Processor Equivalents¶
Common Honeysuckle processors and their native Polars/DuckDB equivalents.
Why Native Operations?
Native Polars/DuckDB operations are:
- Faster: No Python loop overhead, vectorized execution
- Simpler: Less boilerplate, more readable
- Type-safe: Better IDE support and error messages
Test Script
All examples are tested in scripts/test_processor_equivalents.py.
Run with: uv run scripts/test_processor_equivalents.py --verbose
Documentation Links¶
- Polars: docs.pola.rs | Expressions Guide | API Reference
- DuckDB: duckdb.org/docs | Functions | Friendly SQL
- Honeysuckle: GitHub (private repo)
Top 35 Processors by Usage¶
Based on analysis of Huntington Extractor, Universal Yiddish Library, and Royal Society SITM codebases.
| Rank | Processor | Usage | Operation |
|---|---|---|---|
| 1 | FillEmptyProcessor |
55 | Fill nulls with column/constant |
| 2 | RemoveMultivalueNullsProcessor |
47 | Remove nulls from lists |
| 3 | StripStringProcessor |
44 | Strip whitespace |
| 4 | ExtractFirstProcessor |
42 | Get first element from list |
| 5 | ExtractOnConditionProcessor |
29 | Extract value conditionally |
| 6 | ColumnsToDictsProcessor |
27 | Combine columns into nested dicts |
| 7 | ContainsBoolProcessor |
26 | Check if string contains pattern |
| 8 | StringConstantDataframeProcessor |
23 | Add constant column |
| 9 | AppendOnConditionProcessor |
18 | Append to list conditionally |
| 10 | ImplodeProcessor |
15 | Aggregate values into lists |
| 11 | ConcatProcessor |
15 | Concatenate columns |
| 12 | RenameProcessor |
13 | Rename columns |
| 13 | IsEmptyProcessor |
13 | Check if column is null |
| 14 | ExplodeColumnsProcessor |
13 | Unnest lists to rows |
| 15 | ReplaceOnConditionProcessor |
10 | Conditional value replacement |
| 16 | LowerStringProcessor |
9 | Lowercase strings |
| 17 | KeepOnlyProcessor |
9 | Keep only specified columns |
| 18 | DropNullColumnsProcessor |
9 | Drop all-null columns |
| 19 | DropColumnsProcessor |
9 | Drop specified columns |
| 20 | CopyFieldDataframeProcessor |
9 | Copy column values |
| 21 | ReplaceProcessor |
8 | String replacement |
| 22 | MergeProcessor |
8 | Join dataframes |
| 23 | FillMissingTuplesProcessor |
8 | Fill missing struct fields |
| 24 | ConditionalBoolProcessor |
8 | Set boolean from expression |
| 25 | CapitalizeProcessor |
4 | Capitalize first character |
| 26 | AppendStringProcessor |
4 | Append/prepend string |
| 27 | AddNumberProcessor |
4 | Add value to column |
| 28 | SubtractNumberProcessor |
4 | Subtract value from column |
| 29 | SplitStringProcessor |
3 | Split string to list |
| 30 | AsTypeProcessor |
— | Cast column types |
| 31 | DropRowsOnConditionProcessor |
— | Filter rows by condition |
| 32 | AppendProcessor |
— | Append multiple columns to list |
| 33 | EnsureTuplesProcessor |
— | Convert scalars to lists |
| 34 | ExtractCharacterDataframeProcessor |
— | Extract pattern from string |
| 35 | SumColumnValuesProcessor |
— | Sum values by group |
1. FillEmptyProcessor¶
Fill empty values with values from a different column or a constant.
| Honeysuckle | fill_empty_processor.py |
| Polars | fill_null() |
| DuckDB | coalesce() / ifnull() |
# Honeysuckle
FillEmptyProcessor(column_name="title", column_fill="alt_title")
FillEmptyProcessor(column_name="price", constant_fill=0)
FillEmptyProcessor(column_name="name", column_fill="display_name", constant_fill="Unknown")
Example¶
| title | alt_title | title | ||
|---|---|---|---|---|
None |
"Backup" | → | "Backup" | |
| "Main" | "Backup" | → | "Main" |
Polars¶
df = df.with_columns(
pl.col("title").fill_null(pl.col("alt_title")),
pl.col("price").fill_null(0),
)
Chained fallback:
DuckDB¶
Chained coalesce:
2. RemoveMultivalueNullsProcessor¶
Remove null values from list/array columns.
| Honeysuckle | remove_multivalue_nulls_processor.py |
| Polars | list.eval() with drop_nulls(), list.len() |
| DuckDB | list_filter() with lambda, nullif() |
Example¶
| tags | tags | ||
|---|---|---|---|
["a", None, "b"] |
→ | ["a", "b"] |
|
[None, None] |
→ | null |
Polars¶
# Step 1: Remove nulls from list
df = df.with_columns(pl.col("tags").list.eval(pl.element().drop_nulls()).alias("tags"))
# Step 2: Convert empty lists to null
return df.with_columns(
pl.when(pl.col("tags").list.len() == 0).then(None).otherwise(pl.col("tags")).alias("tags")
)
DuckDB¶
How it works:
list_filter(tags, x -> x IS NOT NULL)— Lambda filters nulls from listnullif(..., [])— Converts empty list[]toNULL
Alternative list comprehension:
3. StripStringProcessor¶
Strip whitespace (or specific characters) from strings.
| Honeysuckle | strip_string_processor.py |
| Polars | str.strip_chars() |
| DuckDB | regexp_replace() for full whitespace; .trim() for spaces only |
# Honeysuckle
StripStringProcessor(column_name="name", new_column="name_clean")
StripStringProcessor(column_name="code", new_column="code", strip_string="-")
Example¶
| name | name_clean | ||
|---|---|---|---|
" hello " |
→ | "hello" |
|
"\tworld\n" |
→ | "world" |
Polars¶
Additional patterns:
# Strip specific character
df = df.with_columns(pl.col("code").str.strip_chars("-"))
# Strip only leading (start) or trailing (end)
df = df.with_columns(
pl.col("name").str.strip_chars_start(), # Left only
pl.col("name").str.strip_chars_end(), # Right only
)
DuckDB¶
-- trim() only removes spaces; use regexp_replace for all whitespace
SELECT name, regexp_replace(name, '^\\s+|\\s+$', '', 'g') AS name_clean
Additional patterns:
-- Strip specific character
SELECT trim(code, '-') AS code FROM items
-- Strip only leading/trailing
SELECT name.ltrim() AS name FROM items -- Left only
SELECT name.rtrim() AS name FROM items -- Right only
4. ExtractFirstProcessor¶
Extract first value from multivalue column.
| Honeysuckle | extract_first_processor.py |
| Polars | list.first() |
| DuckDB | list[1] (1-indexed) or list_extract() |
# Honeysuckle
ExtractFirstProcessor(column_name="authors", result_column="primary_author")
ExtractFirstProcessor(column_name="tags") # Overwrites same column
Example¶
| authors | primary_author | ||
|---|---|---|---|
["Alice", "Bob"] |
→ | "Alice" |
|
["Charlie"] |
→ | "Charlie" |
Polars¶
DuckDB¶
Replace in-place:
5. ExtractOnConditionProcessor¶
Extract value from one column to another based on a condition.
| Honeysuckle | extract_on_condition_processor.py |
| Polars | when().then().otherwise() |
| DuckDB | CASE WHEN or if() |
# Honeysuckle
ExtractOnConditionProcessor(
expression="is_on_sale == True",
extract_columns=["price"],
result_columns=["sale_price"],
)
Example¶
| price | is_on_sale | sale_price | ||
|---|---|---|---|---|
| 100 | true |
→ | 100 | |
| 200 | false |
→ | null |
Polars¶
df = df.with_columns(
pl.when(pl.col("is_on_sale")).then(pl.col("price")).otherwise(None).alias("sale_price")
)
Multiple conditions:
df = df.with_columns(
pl.when(pl.col("status") == "active")
.then(pl.col("current_value"))
.when(pl.col("status") == "pending")
.then(pl.col("estimated_value"))
.otherwise(None)
.alias("display_value")
)
DuckDB¶
Multiple conditions with CASE:
SELECT CASE
WHEN status = 'active' THEN current_value
WHEN status = 'pending' THEN estimated_value
END AS display_value
FROM items
6. ContainsBoolProcessor¶
Check if a string contains a pattern and set boolean result.
| Honeysuckle | contains_bool_processor.py |
| Polars | str.contains() |
| DuckDB | .contains() (dot notation) or LIKE/ILIKE |
# Honeysuckle
ContainsBoolProcessor(column_name="description", result_column="has_keyword", pattern="important")
Example¶
| description | has_keyword | ||
|---|---|---|---|
"This is important" |
→ | true |
|
"Nothing here" |
→ | false |
Polars¶
Regex pattern:
DuckDB¶
Additional patterns:
-- Case-insensitive with ILIKE
SELECT description ILIKE '%important%' AS has_keyword FROM items
-- Regex pattern with regexp_matches()
SELECT regexp_matches(email, '@gmail\.com$') AS is_gmail FROM items
See also: regexp_matches()
7. StringConstantDataframeProcessor¶
Add a column with a constant string value.
| Honeysuckle | string_constant_dataframe_processor.py |
| Polars | pl.lit() |
| DuckDB | Literal value in SELECT |
Example¶
| id | id | source | ||
|---|---|---|---|---|
| 1 | → | 1 | "huntington" |
|
| 2 | → | 2 | "huntington" |
Polars¶
Add multiple constants:
df = df.with_columns(
pl.lit("huntington").alias("source"),
pl.lit("2.0").alias("version"),
pl.lit(True).alias("is_published"),
)
DuckDB¶
Add multiple constants:
8. AppendOnConditionProcessor¶
Append columns to a list/tuple based on a condition (uses pandas query syntax).
Inefficient Implementation
Honeysuckle uses for i in range(len(data.index)) row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.
| Honeysuckle | append_on_condition_processor.py |
| Polars | when().then() with list.concat(), concat_list() |
| DuckDB | list_concat() with CASE WHEN |
# Honeysuckle
AppendOnConditionProcessor(
initial_column="tags",
append_columns=["category"],
condition="is_featured == True"
)
Example¶
| tags | category | is_featured | tags | ||
|---|---|---|---|---|---|
["art"] |
"new" |
true |
→ | ["art", "new"] |
|
["photo"] |
"old" |
false |
→ | ["photo"] |
Polars¶
# Step 1: Wrap scalar in list
df = df.with_columns(pl.concat_list(pl.col("category")).alias("category_list"))
# Step 2: Conditionally append
return df.with_columns(
pl.when(pl.col("is_featured"))
.then(pl.col("tags").list.concat(pl.col("category_list")))
.otherwise(pl.col("tags"))
.alias("tags")
).drop("category_list")
Why the intermediate step? The list.concat() method requires both operands to be lists. Using concat_list() wraps the scalar in a list cleanly.
DuckDB¶
SELECT
CASE WHEN is_featured
THEN list_concat(tags, [category])
ELSE tags
END AS tags,
category,
is_featured
9. ImplodeProcessor¶
Aggregate values into a list during group by.
| Honeysuckle | implode_processor.py |
| Polars | Automatic in group_by().agg() |
| DuckDB | list() with GROUP BY ALL |
Example¶
| artwork_id | tag | artwork_id | tag | ||
|---|---|---|---|---|---|
| 1 | "oil" |
→ | 1 | ["oil", "portrait"] |
|
| 1 | "portrait" |
2 | ["landscape"] |
||
| 2 | "landscape" |
Polars¶
With ordering inside the list:
DuckDB¶
With ordering inside the list:
10. ConcatProcessor¶
Concatenate multiple columns into a single string.
Inefficient Implementation
Honeysuckle uses itertuples() row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.
| Honeysuckle | concat_processor.py |
| Polars | pl.concat_str() |
| DuckDB | concat() or || operator |
# Honeysuckle
ConcatProcessor(new_field="full_name", join_fields=["first", "last"], join_on=" ")
ConcatProcessor(new_field="code", join_fields=["prefix", "id"], join_on="-")
Example¶
| first | last | full_name | ||
|---|---|---|---|---|
"John" |
"Doe" |
→ | "John Doe" |
|
"Jane" |
"Smith" |
→ | "Jane Smith" |
|
null |
"Unknown" |
→ | "None Unknown" |
Polars¶
df = df.with_columns(
pl.concat_str(
[pl.col("first").fill_null("None"), pl.col("last").fill_null("None")],
separator=" ",
ignore_nulls=False,
).alias("full_name")
)
Best practice - skip nulls instead of stringifying them:
df = df.with_columns(
pl.concat_str(["prefix", "id"], separator=" ", ignore_nulls=True).alias("code")
)
DuckDB¶
SELECT
first,
last,
concat(coalesce(first::VARCHAR, 'None'), ' ', coalesce(last::VARCHAR, 'None')) AS full_name
Best practice - skip nulls with concat_ws():
11. RenameProcessor¶
Rename columns in a dataframe.
| Honeysuckle | rename_processor.py |
| Polars | rename() |
| DuckDB | AS in SELECT |
Example¶
| old_name | title | new_name | artwork_title | ||
|---|---|---|---|---|---|
"value" |
"Art" |
→ | "value" |
"Art" |
Polars¶
DuckDB¶
Keep all columns, rename specific ones:
12. IsEmptyProcessor¶
Check if column values are null and set boolean result.
| Honeysuckle | is_empty_processor.py |
| Polars | is_null() / is_not_null() |
| DuckDB | IS NULL / IS NOT NULL |
# Honeysuckle
IsEmptyProcessor(column_name="email", result_column="missing_email")
IsEmptyProcessor(column_name="email", result_column="has_email", invert=True)
Example¶
| missing_email | has_email | |||
|---|---|---|---|---|
null |
→ | true |
false |
|
"a@b.com" |
→ | false |
true |
Polars¶
df = df.with_columns(
pl.col("email").is_null().alias("missing_email"),
pl.col("email").is_not_null().alias("has_email"),
)
DuckDB¶
13. ExplodeColumnsProcessor¶
Unnest/explode list columns into separate rows.
| Honeysuckle | explode_columns_processor.py |
| Polars | explode() |
| DuckDB | unnest() |
Example¶
| id | tags | id | tags | ||
|---|---|---|---|---|---|
| 1 | ["a", "b"] |
→ | 1 | "a" |
|
| 1 | "b" |
Polars¶
Explode multiple columns (must have same length lists):
DuckDB¶
Keep all other columns:
14. ReplaceOnConditionProcessor¶
Replace values based on a condition (==, !=, <, >, <=, >=).
| Honeysuckle | replace_on_condition_processor.py |
| Polars | when().then().otherwise() |
| DuckDB | CASE WHEN / greatest() / least() |
# Honeysuckle
ReplaceOnConditionProcessor(
target_field="status",
conditional_operator="==",
conditional_value="",
replacement="unknown"
)
ReplaceOnConditionProcessor(
target_field="price",
conditional_operator="<",
conditional_value=0,
replacement=0
)
Example¶
| status | status | ||
|---|---|---|---|
"" |
→ | "unknown" |
|
"active" |
→ | "active" |
Polars¶
df = df.with_columns(
pl.when(pl.col("status") == "")
.then(pl.lit("unknown"))
.otherwise(pl.col("status"))
.alias("status"),
pl.max_horizontal(pl.col("price"), pl.lit(0)).alias("price"),
)
Note: max_horizontal() returns the maximum value across columns, useful for clamping values.
DuckDB¶
Complex condition with CASE:
15. LowerStringProcessor¶
Convert strings to lowercase.
| Honeysuckle | lower_string_processor.py |
| Polars | str.to_lowercase() |
| DuckDB | .lower() (dot notation) |
Example¶
"John@Example.COM" |
→ | "john@example.com" |
Polars¶
DuckDB¶
Traditional function syntax:
16. KeepOnlyProcessor¶
Keep only specified columns, drop all others.
| Honeysuckle | keep_only_processor.py |
| Polars | select() |
| DuckDB | Column list in SELECT |
Example¶
| id | title | author | internal_code | id | title | ||
|---|---|---|---|---|---|---|---|
| 1 | "Art" | "Jane" | "X1" | → | 1 | "Art" |
Polars¶
Using pl.col:
DuckDB¶
17. DropNullColumnsProcessor¶
Drop columns where ALL values are null.
| Honeysuckle | drop_null_columns_processor.py |
| Polars | Filter columns using all() |
| DuckDB | Dynamic SQL (columns with all nulls) |
Example¶
| id | empty_col | title | id | title | ||
|---|---|---|---|---|---|---|
| 1 | null |
"Art" | → | 1 | "Art" | |
| 2 | null |
"Photo" | 2 | "Photo" |
Polars¶
DuckDB¶
18. DropColumnsProcessor¶
Drop specified columns from the dataframe.
| Honeysuckle | drop_columns_processor.py |
| Polars | drop() |
| DuckDB | EXCLUDE |
Example¶
| id | internal_id | title | id | title | ||
|---|---|---|---|---|---|---|
| 1 | "abc123" | "Art" | → | 1 | "Art" |
Polars¶
Drop single column:
DuckDB¶
Or explicitly list columns to keep:
19. CopyFieldDataframeProcessor¶
Copy values from one column to another (create new column).
| Honeysuckle | copy_field_processor.py |
| Polars | alias() |
| DuckDB | SELECT col AS new_col |
Example¶
| original_title | original_title | display_title | ||
|---|---|---|---|---|
"Mona Lisa" |
→ | "Mona Lisa" |
"Mona Lisa" |
Polars¶
Copy multiple columns:
df = df.with_columns(
pl.col("title").alias("display_title"),
pl.col("date").alias("publication_date"),
)
DuckDB¶
Copy multiple:
20. ReplaceProcessor¶
Replace exact values in a column.
| Honeysuckle | replace_processor.py |
| Polars | when().then().otherwise() for exact value replacement |
| DuckDB | CASE WHEN for exact value replacement |
# Honeysuckle
ReplaceProcessor(target_field="status", to_replace="", replacement="unknown")
ReplaceProcessor(target_field="category", to_replace="N/A", replacement=None)
Example¶
| status | status | ||
|---|---|---|---|
"" |
→ | "unknown" |
Polars¶
df = df.with_columns(
pl.when(pl.col("status") == "")
.then(pl.lit("unknown"))
.otherwise(pl.col("status"))
.alias("status")
)
Best practice for substring/regex replacement with str.replace_all():
DuckDB¶
Best practice for substring/regex replacement with regexp_replace():
21. MergeProcessor¶
Join/merge two dataframes together.
| Honeysuckle | merge_processor.py |
| Polars | join() |
| DuckDB | JOIN |
# Honeysuckle
MergeProcessor(
merge_dataframe=artists_df,
merge_columns=["artist_name", "birth_year"],
merge_dataframe_left="artist_id",
merge_dataframe_right="id",
how="left"
)
Example¶
sales:
| id | artist_id |
|---|---|
| 1 | 100 |
artists:
| id | artist_name |
|---|---|
| 100 | "Monet" |
Result:
| id_x | artist_id | artist_name | id_y |
|---|---|---|---|
| 1 | 100 | "Monet" | 100 |
Polars¶
# Polars drops right_on column; keep it explicitly as id_y
merged = sales.join(
artists.select(["id", "artist_name"]).with_columns(pl.col("id").alias("id_y")),
left_on="artist_id",
right_on="id",
how="left",
)
return merged.rename({"id": "id_x"})
The suffix parameter automatically appends a suffix to duplicate column names from the right DataFrame.
Additional patterns:
# Inner join
df = sales.join(artists, left_on="artist_id", right_on="id", how="inner")
# Join with custom suffix for duplicate column names
df = sales.join(artists, left_on="artist_id", right_on="id", suffix="_artist")
DuckDB¶
SELECT s.id AS id_x, s.artist_id, a.artist_name, a.id AS id_y
FROM 'sales.parquet' s
LEFT JOIN 'artists.parquet' a ON s.artist_id = a.id
Additional patterns:
-- Inner join
SELECT s.*, a.artist_name
FROM sales s
INNER JOIN artists a ON s.artist_id = a.id
-- Using USING for same column name
SELECT * FROM sales JOIN artists USING (artist_id)
22. FillMissingTuplesProcessor¶
Fill missing tuple/list values in multiple columns by inserting nulls to match the row's tuple length.
Inefficient Implementation
Honeysuckle uses itertuples() row-by-row iteration with nested loops. The Polars/DuckDB equivalents are vectorized and significantly faster.
| Honeysuckle | fill_missing_tuples_processor.py |
| Polars | struct.field() with fill_null() |
| DuckDB | Struct access with coalesce() |
Example¶
| names | roles | names | roles | ||
|---|---|---|---|---|---|
("Alice", "Bob") |
null |
→ | ("Alice", "Bob") |
(null, null) |
|
null |
("editor", "viewer") |
→ | (null, null) |
("editor", "viewer") |
Polars¶
df = df.with_columns(
pl.when(pl.col("names").is_null())
.then(pl.lit([None, None]))
.otherwise(pl.col("names"))
.alias("names"),
pl.when(pl.col("roles").is_null())
.then(pl.lit([None, None]))
.otherwise(pl.col("roles"))
.alias("roles"),
)
DuckDB¶
23. ConditionalBoolProcessor¶
Evaluate a pandas query expression and set boolean result.
| Honeysuckle | conditional_bool_processor.py |
| Polars | Boolean expressions |
| DuckDB | Boolean expressions |
# Honeysuckle
ConditionalBoolProcessor(expression="price > 1000 and status == 'active'", result_column="is_premium")
Example¶
| price | status | is_premium | ||
|---|---|---|---|---|
| 1500 | "active" |
→ | true |
|
| 500 | "active" |
→ | false |
Polars¶
df = df.with_columns(
((pl.col("price") > 1000) & (pl.col("status") == "active")).alias("is_premium")
)
Complex expressions:
df = df.with_columns(
(
(pl.col("price") > 1000) |
(pl.col("featured") == True)
).alias("is_highlight")
)
DuckDB¶
Complex expressions:
24. CapitalizeProcessor¶
Capitalize the first character of strings.
| Honeysuckle | capitalize_processor.py |
| Polars | str.head() + str.to_uppercase(), or str.to_titlecase() |
| DuckDB | initcap() |
Example¶
| title | title | ||
|---|---|---|---|
"hello world" |
→ | "Hello world" |
Polars¶
# Step 1: Get first char and rest
df = df.with_columns(
pl.col("title").str.head(1).str.to_uppercase().alias("_first"),
pl.col("title").str.slice(1).str.to_lowercase().alias("_rest"),
)
# Step 2: Combine and clean up
return df.with_columns((pl.col("_first") + pl.col("_rest")).alias("title")).drop(
["_first", "_rest"]
)
Note: str.head(1) is cleaner than .str.slice(0, 1) for extracting the first character.
Title case (capitalize each word):
DuckDB¶
How it works:
title[1]— DuckDB uses 1-based indexing for stringstitle[2:]— Slice from second character to end
Title case (capitalize each word):
25. AppendStringProcessor¶
Append or prepend a string to column values.
| Honeysuckle | append_string_processor.py |
| Polars | + operator or pl.concat_str() |
| DuckDB | || operator or concat() |
# Honeysuckle
AppendStringProcessor(column="id", value="ID-", prefix=True) # Prepend
AppendStringProcessor(column="filename", value=".jpg") # Append
Example¶
| id | id | ||
|---|---|---|---|
"123" |
→ | "ID-123" |
Polars¶
df = df.with_columns(
(pl.lit("ID-") + pl.col("id")).alias("id"),
(pl.col("filename") + pl.lit(".jpg")).alias("filename"),
)
Using concat_str:
DuckDB¶
Using concat:
26. AddNumberProcessor¶
Add a numeric value to column values.
| Honeysuckle | add_number_processor.py |
| Polars | + operator |
| DuckDB | + operator |
Example¶
| year | next_year | ||
|---|---|---|---|
| 2024 | → | 2025 |
Polars¶
Add to same column:
DuckDB¶
Add to computed expression:
27. SubtractNumberProcessor¶
Subtract a numeric value from column values.
| Honeysuckle | subtract_number_processor.py |
| Polars | - operator |
| DuckDB | - operator |
Example¶
| year | prev_year | ||
|---|---|---|---|
| 2024 | → | 2023 |
Polars¶
Calculate difference:
DuckDB¶
Calculate difference:
28. SplitStringProcessor¶
Split string into a list based on delimiter.
| Honeysuckle | split_string_processor.py |
| Polars | str.split() |
| DuckDB | string_split() |
Example¶
| tags | tags | ||
|---|---|---|---|
"a,b,c" |
→ | ["a", "b", "c"] |
Polars¶
Split with default space delimiter:
DuckDB¶
Split with default space:
29. AsTypeProcessor¶
Cast column to a different data type.
| Honeysuckle | as_type_processor.py |
| Polars | cast() |
| DuckDB | ::TYPE or CAST() |
# Honeysuckle
AsTypeProcessor(column_name="year", new_type="int")
AsTypeProcessor(column_name="price", new_type="float")
Example¶
| year | year | ||
|---|---|---|---|
"2024" |
→ | 2024 (int) |
Polars¶
DuckDB¶
Alternative CAST syntax:
30. DropRowsOnConditionProcessor¶
Filter/drop rows based on a condition.
| Honeysuckle | drop_rows_on_condition_processor.py |
| Polars | filter() |
| DuckDB | WHERE clause |
# Honeysuckle
DropRowsOnConditionProcessor(target_field="status", conditional_operator="==", conditional_value="deleted")
DropRowsOnConditionProcessor(target_field="price", conditional_operator="<", conditional_value=0)
Example¶
| status | price | status | price | ||
|---|---|---|---|---|---|
"active" |
100 | → | "active" |
100 | |
"deleted" |
50 | → | (dropped) |
Polars¶
DuckDB¶
31. ColumnsToDictsProcessor¶
Combine multiple columns into a single column containing a list of dictionaries (nested structs). Useful for creating nested JSON output. Handles both single-value and multivalue columns.
Inefficient Implementation
Honeysuckle uses itertuples() and for...range row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.
| Honeysuckle | columns_to_dicts_processor.py |
| Polars | pl.struct() wrapped in pl.concat_list() |
| DuckDB | Struct literals in list |
# Honeysuckle
ColumnsToDictsProcessor(
new_field="dimensions",
column_names=["width", "height", "depth"],
new_column_names=["w", "h", "d"] # Optional rename
)
Example (single values)¶
| width | height | depth | dimensions | ||
|---|---|---|---|---|---|
| 10 | 20 | 5 | → | [{"w": 10, "h": 20, "d": 5}] |
|
| 15 | 30 | 8 | → | [{"w": 15, "h": 30, "d": 8}] |
Example (multivalue - matching length lists)¶
| name | role | people | ||
|---|---|---|---|---|
["Alice", "Bob"] |
["admin", "user"] |
→ | [{"name": "Alice", "role": "admin"}, {"name": "Bob", "role": "user"}] |
Polars¶
df = df.with_columns(
pl.concat_list(
pl.struct(
pl.col("width").alias("w"),
pl.col("height").alias("h"),
pl.col("depth").alias("d"),
)
).alias("dimensions")
).drop(["width", "height", "depth"])
For multivalue columns (lists of same length):
df = pl.DataFrame({
"name": [["Alice", "Bob"]],
"role": [["admin", "user"]]
})
# Explode, create struct, then group back
df = (
df.with_row_index("_idx")
.explode(["name", "role"])
.with_columns(
pl.struct(["name", "role"]).alias("person")
)
.group_by("_idx")
.agg(pl.col("person").alias("people"))
.drop("_idx")
)
DuckDB¶
Multivalue: transform parallel lists into list of structs using list_zip():
SELECT [
{'name': name, 'role': role}
FOR (name, role) IN list_zip(names, roles)
] AS people
FROM items
32. AppendProcessor¶
Append values from multiple columns into a single list column.
Inefficient Implementation
Honeysuckle uses itertuples() and row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.
| Honeysuckle | AppendProcessor.py |
| Polars | concat_list() |
| DuckDB | list_concat() |
Example¶
| tags | category | source | tags (after) | ||
|---|---|---|---|---|---|
("a", "b") |
"new" |
"web" |
→ | ["a", "b", "new", "web"] |
Polars¶
# Append category and source to tags list (modifies tags in-place)
return df.with_columns(pl.concat_list(["tags", "category", "source"]).alias("tags"))
DuckDB¶
-- Append category and source to tags list (modifies tags in-place)
SELECT list_concat(list_concat(tags, [category]), [source]) AS tags, category, source
33. EnsureTuplesProcessor¶
Convert scalar values to single-element lists (tuples in Honeysuckle terminology).
Inefficient Implementation
Honeysuckle uses itertuples() and iat[] for row-by-row updates. The Polars/DuckDB equivalents are vectorized.
| Honeysuckle | EnsureTuplesProcessor.py |
| Polars | concat_list() |
| DuckDB | Array literal [value] |
Example¶
| names | roles | names | roles | ||
|---|---|---|---|---|---|
"Alice" |
"admin" |
→ | ["Alice"] |
["admin"] |
|
"Bob" |
"user" |
→ | ["Bob"] |
["user"] |
Polars¶
# Wrap scalar values in single-element lists
return df.with_columns(
pl.concat_list(pl.col("names")).alias("names"),
pl.concat_list(pl.col("roles")).alias("roles"),
)
DuckDB¶
34. ExtractCharacterDataframeProcessor¶
Extract a pattern from a string column using regex.
Inefficient Implementation
Honeysuckle uses .apply() with a lambda function for row-by-row regex extraction. The Polars/DuckDB equivalents are vectorized.
| Honeysuckle | ExtractCharacterDataframeProcessor.py |
| Polars | str.extract() |
| DuckDB | regexp_extract() |
# Honeysuckle
ExtractCharacterDataframeProcessor(
target_field="text",
pattern=r"-(\d+)-",
new_field="number"
)
Example¶
| text | number | ||
|---|---|---|---|
"Item-123-A" |
→ | "123" |
|
"Product-456-B" |
→ | "456" |
Polars¶
DuckDB¶
35. SumColumnValuesProcessor¶
Sum column values grouped by another column and join back to the original dataframe.
Inefficient Implementation
Honeysuckle uses itertuples() and row-by-row value setting via loc[]. The Polars/DuckDB equivalents use vectorized aggregation and joins.
| Honeysuckle | SumColumnValuesProcessor.py |
| Polars | group_by().agg() + join() |
| DuckDB | Subquery with GROUP BY + JOIN |
# Honeysuckle
SumColumnValuesProcessor(
column_to_sum="amount",
result_column="total",
result_label_column="category",
sum_counts=True,
group_sum_counts_by="category"
)
Example¶
| category | amount | category | amount | total | ||
|---|---|---|---|---|---|---|
"A" |
10 | → | "A" |
10 | 90 | |
"B" |
20 | → | "B" |
20 | 60 | |
"A" |
30 | → | "A" |
30 | 90 | |
"B" |
40 | → | "B" |
40 | 60 | |
"A" |
50 | → | "A" |
50 | 90 |
Polars¶
sums = df.group_by("category").agg(pl.col("amount").sum().alias("total"))
return df.join(sums, on="category", how="left")
DuckDB¶
SELECT s.*, totals.total
FROM 'source.parquet' s
JOIN (SELECT category, sum(amount) AS total FROM 'source.parquet' GROUP BY category) totals
USING (category)
Quick Reference¶
| Processor | Polars | DuckDB |
|---|---|---|
FillEmptyProcessor |
.fill_null() |
coalesce() / ifnull() |
RemoveMultivalueNullsProcessor |
.list.eval() + .list.len() |
list_filter() + nullif() |
StripStringProcessor |
.str.strip_chars() |
regexp_replace() |
ExtractFirstProcessor |
.list.first() |
col[1] |
ExtractOnConditionProcessor |
pl.when().then().otherwise() |
if() / CASE WHEN |
ContainsBoolProcessor |
.str.contains() |
.contains() |
StringConstantDataframeProcessor |
pl.lit() |
'value' AS col |
AppendOnConditionProcessor |
concat_list() + .list.concat() |
list_concat() + CASE |
ImplodeProcessor |
.group_by().agg() |
list() + GROUP BY ALL |
ConcatProcessor |
pl.concat_str() |
concat_ws() / || |
RenameProcessor |
.rename() |
AS new_name |
IsEmptyProcessor |
.is_null() / .is_not_null() |
IS NULL / IS NOT NULL |
ExplodeColumnsProcessor |
.explode() |
unnest() |
ReplaceOnConditionProcessor |
pl.when().then().otherwise() |
if() / greatest() |
LowerStringProcessor |
.str.to_lowercase() |
.lower() |
KeepOnlyProcessor |
.select() |
Column list |
DropNullColumnsProcessor |
Filter columns | Dynamic SQL |
DropColumnsProcessor |
.drop() |
EXCLUDE |
CopyFieldDataframeProcessor |
.alias() |
AS new_col |
ReplaceProcessor |
when().then().otherwise() |
CASE WHEN |
MergeProcessor |
.join() |
JOIN |
FillMissingTuplesProcessor |
.struct.field().fill_null() |
coalesce(struct.field) |
ConditionalBoolProcessor |
Boolean expression | Boolean expression |
CapitalizeProcessor |
.str.head() + .str.to_uppercase() |
upper(title[1]) || lower(title[2:]) |
AppendStringProcessor |
+ / pl.concat_str() |
|| / concat() |
AddNumberProcessor |
+ |
+ |
SubtractNumberProcessor |
- |
- |
SplitStringProcessor |
.str.split() |
string_split() |
AsTypeProcessor |
.cast() |
::TYPE |
DropRowsOnConditionProcessor |
.filter() |
WHERE |
ColumnsToDictsProcessor |
pl.concat_list(pl.struct()) |
[{...} FOR ... IN list_zip()] |
AppendProcessor |
pl.concat_list() |
list_concat() |
EnsureTuplesProcessor |
pl.concat_list() |
[value] |
ExtractCharacterDataframeProcessor |
.str.extract() |
regexp_extract() |
SumColumnValuesProcessor |
.group_by().agg() + .join() |
GROUP BY + JOIN |