Honey Duck¶
A Polars/Parquet data pipeline with Dagster orchestration, using dlt for data ingestion and DuckDB for SQL queries.
Quick Start¶
# Install dependencies
uv sync
# Start Dagster UI
uv run dg dev
# Or run pipeline via CLI
uv run dg launch --job processors_pipeline
# List all definitions
uv run dg list defs
Open http://localhost:3000 and click "Materialize all" to run the pipeline!
Features¶
- 7 parallel implementations sharing a common harvest layer
- Lazy evaluation with Polars for optimal performance
- Soda data quality validation with contract-based schema checks
- Column lineage tracking with example values
- Visualization helpers for Dagster asset metadata (Altair charts, markdown tables)
- Reusable processors for DuckDB, Polars, and Pandas transforms
Architecture¶
CSV files → dlt harvest → Transform assets → Output assets → JSON
↓ ↓ ↓
Parquet raw Parquet storage JSON files
Documentation¶
-
Getting Started
Get up and running in 15 minutes with the quick start tutorial.
-
User Guide
Best practices, Polars patterns, and performance tuning.
-
API Reference
Auto-generated documentation from source code.
-
Integrations
Elasticsearch, S3, PostgreSQL, and more.
Implementations¶
| Suffix | Description | Features |
|---|---|---|
| (none) | Original with processor classes | Processor pattern |
_polars |
Pure Polars with intermediate steps | Lazy evaluation, visualization |
_duckdb |
Pure DuckDB SQL queries | SQL-based transforms |
_soda |
DuckDB + Soda validation | Column lineage, schema contracts |
_polars_fs |
Polars variant (different group) | Same logic, separate group |
_polars_ops |
Graph-backed assets with ops | Detailed observability |
_polars_multi |
Multi-asset pattern | Tightly coupled steps |