Skip to content

Honey Duck

A Polars/Parquet data pipeline with Dagster orchestration, using dlt for data ingestion and DuckDB for SQL queries.

Quick Start

# Install dependencies
uv sync

# Start Dagster UI
uv run dg dev

# Or run pipeline via CLI
uv run dg launch --job processors_pipeline

# List all definitions
uv run dg list defs

Open http://localhost:3000 and click "Materialize all" to run the pipeline!

Features

  • 7 parallel implementations sharing a common harvest layer
  • Lazy evaluation with Polars for optimal performance
  • Soda data quality validation with contract-based schema checks
  • Column lineage tracking with example values
  • Visualization helpers for Dagster asset metadata (Altair charts, markdown tables)
  • Reusable processors for DuckDB, Polars, and Pandas transforms

Architecture

CSV files → dlt harvest → Transform assets → Output assets → JSON
                 ↓              ↓                 ↓
           Parquet raw    Parquet storage    JSON files

Documentation

  • Getting Started


    Get up and running in 15 minutes with the quick start tutorial.

    Quick Start

  • User Guide


    Best practices, Polars patterns, and performance tuning.

    Best Practices

  • API Reference


    Auto-generated documentation from source code.

    API Docs

  • Integrations


    Elasticsearch, S3, PostgreSQL, and more.

    Integrations

Implementations

Suffix Description Features
(none) Original with processor classes Processor pattern
_polars Pure Polars with intermediate steps Lazy evaluation, visualization
_duckdb Pure DuckDB SQL queries SQL-based transforms
_soda DuckDB + Soda validation Column lineage, schema contracts
_polars_fs Polars variant (different group) Same logic, separate group
_polars_ops Graph-backed assets with ops Detailed observability
_polars_multi Multi-asset pattern Tightly coupled steps