Processor Equivalents¶

Common Honeysuckle processors and their native Polars/DuckDB equivalents.

Why Native Operations?

Native Polars/DuckDB operations are:

Faster: No Python loop overhead, vectorized execution
Simpler: Less boilerplate, more readable
Type-safe: Better IDE support and error messages

Test Script

All examples are tested in scripts/test_processor_equivalents.py. Run with: uv run scripts/test_processor_equivalents.py --verbose

Documentation Links¶

Polars: docs.pola.rs | Expressions Guide | API Reference
DuckDB: duckdb.org/docs | Functions | Friendly SQL
Honeysuckle: GitHub (private repo)

Top 35 Processors by Usage¶

Based on analysis of Huntington Extractor, Universal Yiddish Library, and Royal Society SITM codebases.

Rank	Processor	Usage	Operation
1	`FillEmptyProcessor`	55	Fill nulls with column/constant
2	`RemoveMultivalueNullsProcessor`	47	Remove nulls from lists
3	`StripStringProcessor`	44	Strip whitespace
4	`ExtractFirstProcessor`	42	Get first element from list
5	`ExtractOnConditionProcessor`	29	Extract value conditionally
6	`ColumnsToDictsProcessor`	27	Combine columns into nested dicts
7	`ContainsBoolProcessor`	26	Check if string contains pattern
8	`StringConstantDataframeProcessor`	23	Add constant column
9	`AppendOnConditionProcessor`	18	Append to list conditionally
10	`ImplodeProcessor`	15	Aggregate values into lists
11	`ConcatProcessor`	15	Concatenate columns
12	`RenameProcessor`	13	Rename columns
13	`IsEmptyProcessor`	13	Check if column is null
14	`ExplodeColumnsProcessor`	13	Unnest lists to rows
15	`ReplaceOnConditionProcessor`	10	Conditional value replacement
16	`LowerStringProcessor`	9	Lowercase strings
17	`KeepOnlyProcessor`	9	Keep only specified columns
18	`DropNullColumnsProcessor`	9	Drop all-null columns
19	`DropColumnsProcessor`	9	Drop specified columns
20	`CopyFieldDataframeProcessor`	9	Copy column values
21	`ReplaceProcessor`	8	String replacement
22	`MergeProcessor`	8	Join dataframes
23	`FillMissingTuplesProcessor`	8	Fill missing struct fields
24	`ConditionalBoolProcessor`	8	Set boolean from expression
25	`CapitalizeProcessor`	4	Capitalize first character
26	`AppendStringProcessor`	4	Append/prepend string
27	`AddNumberProcessor`	4	Add value to column
28	`SubtractNumberProcessor`	4	Subtract value from column
29	`SplitStringProcessor`	3	Split string to list
30	`AsTypeProcessor`	—	Cast column types
31	`DropRowsOnConditionProcessor`	—	Filter rows by condition
32	`AppendProcessor`	—	Append multiple columns to list
33	`EnsureTuplesProcessor`	—	Convert scalars to lists
34	`ExtractCharacterDataframeProcessor`	—	Extract pattern from string
35	`SumColumnValuesProcessor`	—	Sum values by group

1. FillEmptyProcessor¶

Fill empty values with values from a different column or a constant.


Honeysuckle	`fill_empty_processor.py`
Polars	`fill_null()`
DuckDB	`coalesce()` / `ifnull()`

# Honeysuckle
FillEmptyProcessor(column_name="title", column_fill="alt_title")
FillEmptyProcessor(column_name="price", constant_fill=0)
FillEmptyProcessor(column_name="name", column_fill="display_name", constant_fill="Unknown")

Example¶

title	alt_title			title
`None`	"Backup"	→		"Backup"
"Main"	"Backup"	→		"Main"

Polars¶

df = df.with_columns(
    pl.col("title").fill_null(pl.col("alt_title")),
    pl.col("price").fill_null(0),
)

Chained fallback:

df = df.with_columns(
    pl.col("name").fill_null(pl.col("display_name")).fill_null("Unknown")
)

DuckDB¶

SELECT
    coalesce(title, alt_title) AS title,
    alt_title,
    ifnull(price, 0) AS price

Chained coalesce:

SELECT coalesce(name, display_name, 'Unknown') AS name FROM items

2. RemoveMultivalueNullsProcessor¶

Remove null values from list/array columns.


Honeysuckle	`remove_multivalue_nulls_processor.py`
Polars	`list.eval()` with `drop_nulls()`, `list.len()`
DuckDB	`list_filter()` with lambda, `nullif()`

# Honeysuckle
RemoveMultivalueNullsProcessor(column="tags")

Example¶

tags			tags
`["a", None, "b"]`	→		`["a", "b"]`
`[None, None]`	→		`null`

Polars¶

    # Step 1: Remove nulls from list
    df = df.with_columns(pl.col("tags").list.eval(pl.element().drop_nulls()).alias("tags"))
    # Step 2: Convert empty lists to null
    return df.with_columns(
        pl.when(pl.col("tags").list.len() == 0).then(None).otherwise(pl.col("tags")).alias("tags")
    )

DuckDB¶

SELECT nullif(list_filter(tags, x -> x IS NOT NULL), []) AS tags

How it works:

list_filter(tags, x -> x IS NOT NULL) — Lambda filters nulls from list
nullif(..., []) — Converts empty list [] to NULL

Alternative list comprehension:

SELECT nullif([x FOR x IN tags IF x IS NOT NULL], []) AS tags
FROM items

3. StripStringProcessor¶

Strip whitespace (or specific characters) from strings.


Honeysuckle	`strip_string_processor.py`
Polars	`str.strip_chars()`
DuckDB	`regexp_replace()` for full whitespace; `.trim()` for spaces only

# Honeysuckle
StripStringProcessor(column_name="name", new_column="name_clean")
StripStringProcessor(column_name="code", new_column="code", strip_string="-")

Example¶

name			name_clean
`" hello "`	→		`"hello"`
`"\tworld\n"`	→		`"world"`

Polars¶

df = df.with_columns(pl.col("name").str.strip_chars().alias("name_clean"))

Additional patterns:

# Strip specific character
df = df.with_columns(pl.col("code").str.strip_chars("-"))

# Strip only leading (start) or trailing (end)
df = df.with_columns(
    pl.col("name").str.strip_chars_start(),  # Left only
    pl.col("name").str.strip_chars_end(),    # Right only
)

DuckDB¶

-- trim() only removes spaces; use regexp_replace for all whitespace
SELECT name, regexp_replace(name, '^\\s+|\\s+$', '', 'g') AS name_clean

Additional patterns:

-- Strip specific character
SELECT trim(code, '-') AS code FROM items

-- Strip only leading/trailing
SELECT name.ltrim() AS name FROM items  -- Left only
SELECT name.rtrim() AS name FROM items  -- Right only

4. ExtractFirstProcessor¶

Extract first value from multivalue column.


Honeysuckle	`extract_first_processor.py`
Polars	`list.first()`
DuckDB	`list[1]` (1-indexed) or `list_extract()`

# Honeysuckle
ExtractFirstProcessor(column_name="authors", result_column="primary_author")
ExtractFirstProcessor(column_name="tags")  # Overwrites same column

Example¶

authors			primary_author
`["Alice", "Bob"]`	→		`"Alice"`
`["Charlie"]`	→		`"Charlie"`

Polars¶

df = df.with_columns(pl.col("authors").list.first().alias("primary_author"))

DuckDB¶

SELECT authors, authors[1] AS primary_author

Replace in-place:

SELECT * REPLACE (tags[1] AS tags) FROM items

5. ExtractOnConditionProcessor¶

Extract value from one column to another based on a condition.


Honeysuckle	`extract_on_condition_processor.py`
Polars	`when().then().otherwise()`
DuckDB	`CASE WHEN` or `if()`

# Honeysuckle
ExtractOnConditionProcessor(
    expression="is_on_sale == True",
    extract_columns=["price"],
    result_columns=["sale_price"],
)

Example¶

price	is_on_sale			sale_price
100	`true`	→		100
200	`false`	→		`null`

Polars¶

df = df.with_columns(
    pl.when(pl.col("is_on_sale")).then(pl.col("price")).otherwise(None).alias("sale_price")
)

Multiple conditions:

df = df.with_columns(
    pl.when(pl.col("status") == "active")
      .then(pl.col("current_value"))
      .when(pl.col("status") == "pending")
      .then(pl.col("estimated_value"))
      .otherwise(None)
      .alias("display_value")
)

DuckDB¶

SELECT price, is_on_sale, if(is_on_sale, price, NULL) AS sale_price

Multiple conditions with CASE:

SELECT CASE
    WHEN status = 'active' THEN current_value
    WHEN status = 'pending' THEN estimated_value
END AS display_value
FROM items

6. ContainsBoolProcessor¶

Check if a string contains a pattern and set boolean result.


Honeysuckle	`contains_bool_processor.py`
Polars	`str.contains()`
DuckDB	`.contains()` (dot notation) or `LIKE`/`ILIKE`

# Honeysuckle
ContainsBoolProcessor(column_name="description", result_column="has_keyword", pattern="important")

Example¶

description			has_keyword
`"This is important"`	→		`true`
`"Nothing here"`	→		`false`

Polars¶

df = df.with_columns(pl.col("description").str.contains("important").alias("has_keyword"))

Regex pattern:

df = df.with_columns(
    pl.col("email").str.contains(r"@gmail\.com$").alias("is_gmail")
)

DuckDB¶

SELECT description, description.contains('important') AS has_keyword

Additional patterns:

-- Case-insensitive with ILIKE
SELECT description ILIKE '%important%' AS has_keyword FROM items

-- Regex pattern with regexp_matches()
SELECT regexp_matches(email, '@gmail\.com$') AS is_gmail FROM items

7. StringConstantDataframeProcessor¶

Add a column with a constant string value.


Honeysuckle	`string_constant_dataframe_processor.py`
Polars	`pl.lit()`
DuckDB	Literal value in SELECT

# Honeysuckle
StringConstantDataframeProcessor(target_field="source", value="huntington")

Example¶

id			id	source
1	→		1	`"huntington"`
2	→		2	`"huntington"`

Polars¶

df = df.with_columns(pl.lit("huntington").alias("source"))

Add multiple constants:

df = df.with_columns(
    pl.lit("huntington").alias("source"),
    pl.lit("2.0").alias("version"),
    pl.lit(True).alias("is_published"),
)

DuckDB¶

SELECT *, 'huntington' AS source

Add multiple constants:

SELECT *,
    'huntington' AS source,
    '2.0' AS version,
    true AS is_published
FROM items

8. AppendOnConditionProcessor¶

Append columns to a list/tuple based on a condition (uses pandas query syntax).

Inefficient Implementation

Honeysuckle uses for i in range(len(data.index)) row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.


Honeysuckle	`append_on_condition_processor.py`
Polars	`when().then()` with `list.concat()`, `concat_list()`
DuckDB	`list_concat()` with `CASE WHEN`

# Honeysuckle
AppendOnConditionProcessor(
    initial_column="tags",
    append_columns=["category"],
    condition="is_featured == True"
)

Example¶

tags	category	is_featured			tags
`["art"]`	`"new"`	`true`	→		`["art", "new"]`
`["photo"]`	`"old"`	`false`	→		`["photo"]`

Polars¶

    # Step 1: Wrap scalar in list
    df = df.with_columns(pl.concat_list(pl.col("category")).alias("category_list"))
    # Step 2: Conditionally append
    return df.with_columns(
        pl.when(pl.col("is_featured"))
        .then(pl.col("tags").list.concat(pl.col("category_list")))
        .otherwise(pl.col("tags"))
        .alias("tags")
    ).drop("category_list")

Why the intermediate step? The list.concat() method requires both operands to be lists. Using concat_list() wraps the scalar in a list cleanly.

DuckDB¶

SELECT
    CASE WHEN is_featured
        THEN list_concat(tags, [category])
        ELSE tags
    END AS tags,
    category,
    is_featured

9. ImplodeProcessor¶

Aggregate values into a list during group by.


Honeysuckle	`implode_processor.py`
Polars	Automatic in `group_by().agg()`
DuckDB	`list()` with `GROUP BY ALL`

# Honeysuckle
ImplodeProcessor(column_names=["tag"], index_column="artwork_id")

Example¶

artwork_id	tag		artwork_id	tag
1	`"oil"`	→	1	`["oil", "portrait"]`
1	`"portrait"`		2	`["landscape"]`
2	`"landscape"`

Polars¶

df = df.group_by("artwork_id").agg(pl.col("tag").alias("tag")).sort("artwork_id")

With ordering inside the list:

df = df.group_by("artwork_id").agg(
    pl.col("filename").sort_by("sort_order").alias("media_files")
)

DuckDB¶

SELECT artwork_id, list(tag) AS tag
FROM 'source.parquet'
GROUP BY ALL
ORDER BY artwork_id

With ordering inside the list:

SELECT artwork_id, list(filename ORDER BY sort_order) AS media_files
FROM media
GROUP BY ALL

10. ConcatProcessor¶

Concatenate multiple columns into a single string.

Inefficient Implementation

Honeysuckle uses itertuples() row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.


Honeysuckle	`concat_processor.py`
Polars	`pl.concat_str()`
DuckDB	`concat()` or `\|\|` operator

# Honeysuckle
ConcatProcessor(new_field="full_name", join_fields=["first", "last"], join_on=" ")
ConcatProcessor(new_field="code", join_fields=["prefix", "id"], join_on="-")

Example¶

first	last		full_name
`"John"`	`"Doe"`	→	`"John Doe"`
`"Jane"`	`"Smith"`	→	`"Jane Smith"`
`null`	`"Unknown"`	→	`"None Unknown"`

Polars¶

df = df.with_columns(
    pl.concat_str(
        [pl.col("first").fill_null("None"), pl.col("last").fill_null("None")],
        separator=" ",
        ignore_nulls=False,
    ).alias("full_name")
)

Best practice - skip nulls instead of stringifying them:

df = df.with_columns(
    pl.concat_str(["prefix", "id"], separator=" ", ignore_nulls=True).alias("code")
)

DuckDB¶

SELECT
    first,
    last,
    concat(coalesce(first::VARCHAR, 'None'), ' ', coalesce(last::VARCHAR, 'None')) AS full_name

Best practice - skip nulls with concat_ws():

SELECT concat_ws(' ', first, last) AS full_name FROM items

11. RenameProcessor¶

Rename columns in a dataframe.


Honeysuckle	`rename_processor.py`
Polars	`rename()`
DuckDB	`AS` in SELECT

# Honeysuckle
RenameProcessor(mapping={"old_name": "new_name", "title": "artwork_title"})

Example¶

old_name	title			new_name	artwork_title
`"value"`	`"Art"`	→		`"value"`	`"Art"`

Polars¶

df = df.rename({"old_name": "new_name", "title": "artwork_title"})

DuckDB¶

SELECT old_name AS new_name, title AS artwork_title

Keep all columns, rename specific ones:

SELECT * REPLACE (title AS artwork_title) FROM items

12. IsEmptyProcessor¶

Check if column values are null and set boolean result.


Honeysuckle	`is_empty_processor.py`
Polars	`is_null()` / `is_not_null()`
DuckDB	`IS NULL` / `IS NOT NULL`

# Honeysuckle
IsEmptyProcessor(column_name="email", result_column="missing_email")
IsEmptyProcessor(column_name="email", result_column="has_email", invert=True)

Example¶

email			missing_email	has_email
`null`	→		`true`	`false`
`"a@b.com"`	→		`false`	`true`

Polars¶

df = df.with_columns(
    pl.col("email").is_null().alias("missing_email"),
    pl.col("email").is_not_null().alias("has_email"),
)

DuckDB¶

SELECT
    email,
    email IS NULL AS missing_email,
    email IS NOT NULL AS has_email

13. ExplodeColumnsProcessor¶

Unnest/explode list columns into separate rows.


Honeysuckle	`explode_columns_processor.py`
Polars	`explode()`
DuckDB	`unnest()`

# Honeysuckle
ExplodeColumnsProcessor(column_names=["tags"])

Example¶

id	tags			id	tags
1	`["a", "b"]`	→		1	`"a"`
				1	`"b"`

Polars¶

df = df.explode("tags")

Explode multiple columns (must have same length lists):

df = df.explode("tags", "tag_ids")

DuckDB¶

SELECT id, unnest(tags) AS tags

Keep all other columns:

SELECT *, unnest(tags) AS tag FROM items

14. ReplaceOnConditionProcessor¶

Replace values based on a condition (==, !=, <, >, <=, >=).


Honeysuckle	`replace_on_condition_processor.py`
Polars	`when().then().otherwise()`
DuckDB	`CASE WHEN` / `greatest()` / `least()`

# Honeysuckle
ReplaceOnConditionProcessor(
    target_field="status",
    conditional_operator="==",
    conditional_value="",
    replacement="unknown"
)
ReplaceOnConditionProcessor(
    target_field="price",
    conditional_operator="<",
    conditional_value=0,
    replacement=0
)

Example¶

status			status
`""`	→		`"unknown"`
`"active"`	→		`"active"`

Polars¶

df = df.with_columns(
    pl.when(pl.col("status") == "")
    .then(pl.lit("unknown"))
    .otherwise(pl.col("status"))
    .alias("status"),
    pl.max_horizontal(pl.col("price"), pl.lit(0)).alias("price"),
)

Note: max_horizontal() returns the maximum value across columns, useful for clamping values.

DuckDB¶

SELECT
    if(status = '', 'unknown', status) AS status,
    greatest(price, 0) AS price

Complex condition with CASE:

SELECT CASE WHEN price < 0 THEN 0 ELSE price END AS price FROM items

15. LowerStringProcessor¶

Convert strings to lowercase.


Honeysuckle	`lower_string_processor.py`
Polars	`str.to_lowercase()`
DuckDB	`.lower()` (dot notation)

# Honeysuckle
LowerStringProcessor(column_name="email")

Example¶

email			email
`"John@Example.COM"`	→		`"john@example.com"`

Polars¶

df = df.with_columns(pl.col("email").str.to_lowercase())

DuckDB¶

SELECT email.lower() AS email

Traditional function syntax:

SELECT lower(email) AS email FROM items

16. KeepOnlyProcessor¶

Keep only specified columns, drop all others.


Honeysuckle	`keep_only_processor.py`
Polars	`select()`
DuckDB	Column list in SELECT

# Honeysuckle
KeepOnlyProcessor(column_names=["id", "title"])

Example¶

id	title	author	internal_code			id	title
1	"Art"	"Jane"	"X1"	→		1	"Art"

Polars¶

df = df.select(["id", "title"])

Using pl.col:

df = df.select(pl.col("id", "title"))

DuckDB¶

SELECT id, title

17. DropNullColumnsProcessor¶

Drop columns where ALL values are null.


Honeysuckle	`drop_null_columns_processor.py`
Polars	Filter columns using `all()`
DuckDB	Dynamic SQL (columns with all nulls)

# Honeysuckle
DropNullColumnsProcessor()  # No arguments

Example¶

id	empty_col	title			id	title
1	`null`	"Art"	→		1	"Art"
2	`null`	"Photo"			2	"Photo"

Polars¶

null_cols = [col for col in df.columns if df[col].is_null().all()]
df = df.drop(null_cols)

DuckDB¶

-- Must enumerate non-null columns manually
SELECT id, title

18. DropColumnsProcessor¶

Drop specified columns from the dataframe.


Honeysuckle	`drop_columns_processor.py`
Polars	`drop()`
DuckDB	`EXCLUDE`

# Honeysuckle
DropColumnsProcessor(column_names=["internal_id"])

Example¶

id	internal_id	title			id	title
1	"abc123"	"Art"	→		1	"Art"

Polars¶

df = df.drop(["internal_id"])

Drop single column:

df = df.drop("internal_id")

DuckDB¶

SELECT * EXCLUDE (internal_id)

Or explicitly list columns to keep:

SELECT id, title FROM items

19. CopyFieldDataframeProcessor¶

Copy values from one column to another (create new column).


Honeysuckle	`copy_field_processor.py`
Polars	`alias()`
DuckDB	`SELECT col AS new_col`

# Honeysuckle
CopyFieldDataframeProcessor(target_field="original_title", new_field="display_title")

Example¶

original_title			original_title	display_title
`"Mona Lisa"`	→		`"Mona Lisa"`	`"Mona Lisa"`

Polars¶

df = df.with_columns(pl.col("original_title").alias("display_title"))

Copy multiple columns:

df = df.with_columns(
    pl.col("title").alias("display_title"),
    pl.col("date").alias("publication_date"),
)

DuckDB¶

SELECT *, original_title AS display_title

Copy multiple:

SELECT *, title AS display_title, date AS publication_date FROM items

20. ReplaceProcessor¶

Replace exact values in a column.


Honeysuckle	`replace_processor.py`
Polars	`when().then().otherwise()` for exact value replacement
DuckDB	`CASE WHEN` for exact value replacement

# Honeysuckle
ReplaceProcessor(target_field="status", to_replace="", replacement="unknown")
ReplaceProcessor(target_field="category", to_replace="N/A", replacement=None)

Example¶

status			status
`""`	→		`"unknown"`

Polars¶

df = df.with_columns(
    pl.when(pl.col("status") == "")
    .then(pl.lit("unknown"))
    .otherwise(pl.col("status"))
    .alias("status")
)

Best practice for substring/regex replacement with str.replace_all():

df = df.with_columns(
    pl.col("text").str.replace_all(r"\s+", " ")  # Collapse whitespace
)

DuckDB¶

SELECT CASE WHEN status = '' THEN 'unknown' ELSE status END AS status

Best practice for substring/regex replacement with regexp_replace():

SELECT regexp_replace(text, '\s+', ' ', 'g') AS text FROM items

21. MergeProcessor¶

Join/merge two dataframes together.


Honeysuckle	`merge_processor.py`
Polars	`join()`
DuckDB	`JOIN`

# Honeysuckle
MergeProcessor(
    merge_dataframe=artists_df,
    merge_columns=["artist_name", "birth_year"],
    merge_dataframe_left="artist_id",
    merge_dataframe_right="id",
    how="left"
)

Example¶

sales:

id	artist_id
1	100

artists:

id	artist_name
100	"Monet"

Result:

id_x	artist_id	artist_name	id_y
1	100	"Monet"	100

Polars¶

    # Polars drops right_on column; keep it explicitly as id_y
    merged = sales.join(
        artists.select(["id", "artist_name"]).with_columns(pl.col("id").alias("id_y")),
        left_on="artist_id",
        right_on="id",
        how="left",
    )
    return merged.rename({"id": "id_x"})

The suffix parameter automatically appends a suffix to duplicate column names from the right DataFrame.

Additional patterns:

# Inner join
df = sales.join(artists, left_on="artist_id", right_on="id", how="inner")

# Join with custom suffix for duplicate column names
df = sales.join(artists, left_on="artist_id", right_on="id", suffix="_artist")

DuckDB¶

SELECT s.id AS id_x, s.artist_id, a.artist_name, a.id AS id_y
FROM 'sales.parquet' s
LEFT JOIN 'artists.parquet' a ON s.artist_id = a.id

Additional patterns:

-- Inner join
SELECT s.*, a.artist_name
FROM sales s
INNER JOIN artists a ON s.artist_id = a.id

-- Using USING for same column name
SELECT * FROM sales JOIN artists USING (artist_id)

22. FillMissingTuplesProcessor¶

Fill missing tuple/list values in multiple columns by inserting nulls to match the row's tuple length.

Inefficient Implementation

Honeysuckle uses itertuples() row-by-row iteration with nested loops. The Polars/DuckDB equivalents are vectorized and significantly faster.


Honeysuckle	`fill_missing_tuples_processor.py`
Polars	`struct.field()` with `fill_null()`
DuckDB	Struct access with `coalesce()`

# Honeysuckle
FillMissingTuplesProcessor(columns=["names", "roles"])

Example¶

names	roles			names	roles
`("Alice", "Bob")`	`null`	→		`("Alice", "Bob")`	`(null, null)`
`null`	`("editor", "viewer")`	→		`(null, null)`	`("editor", "viewer")`

Polars¶

df = df.with_columns(
    pl.when(pl.col("names").is_null())
    .then(pl.lit([None, None]))
    .otherwise(pl.col("names"))
    .alias("names"),
    pl.when(pl.col("roles").is_null())
    .then(pl.lit([None, None]))
    .otherwise(pl.col("roles"))
    .alias("roles"),
)

DuckDB¶

SELECT
    coalesce(names, [NULL, NULL]) AS names,
    coalesce(roles, [NULL, NULL]) AS roles

23. ConditionalBoolProcessor¶

Evaluate a pandas query expression and set boolean result.


Honeysuckle	`conditional_bool_processor.py`
Polars	Boolean expressions
DuckDB	Boolean expressions

# Honeysuckle
ConditionalBoolProcessor(expression="price > 1000 and status == 'active'", result_column="is_premium")

Example¶

price	status			is_premium
1500	`"active"`	→		`true`
500	`"active"`	→		`false`

Polars¶

df = df.with_columns(
    ((pl.col("price") > 1000) & (pl.col("status") == "active")).alias("is_premium")
)

Complex expressions:

df = df.with_columns(
    (
        (pl.col("price") > 1000) |
        (pl.col("featured") == True)
    ).alias("is_highlight")
)

DuckDB¶

SELECT *, (price > 1000 AND status = 'active') AS is_premium

Complex expressions:

SELECT *, (price > 1000 OR featured = true) AS is_highlight FROM items

24. CapitalizeProcessor¶

Capitalize the first character of strings.


Honeysuckle	`capitalize_processor.py`
Polars	`str.head()` + `str.to_uppercase()`, or `str.to_titlecase()`
DuckDB	`initcap()`

# Honeysuckle
CapitalizeProcessor(column_name="title")

Example¶

title			title
`"hello world"`	→		`"Hello world"`

Polars¶

    # Step 1: Get first char and rest
    df = df.with_columns(
        pl.col("title").str.head(1).str.to_uppercase().alias("_first"),
        pl.col("title").str.slice(1).str.to_lowercase().alias("_rest"),
    )
    # Step 2: Combine and clean up
    return df.with_columns((pl.col("_first") + pl.col("_rest")).alias("title")).drop(
        ["_first", "_rest"]
    )

Note: str.head(1) is cleaner than .str.slice(0, 1) for extracting the first character.

Title case (capitalize each word):

df = df.with_columns(
    pl.col("title").str.to_titlecase()
)

DuckDB¶

SELECT upper(title[1]) || lower(title[2:]) AS title

How it works:

title[1] — DuckDB uses 1-based indexing for strings
title[2:] — Slice from second character to end

Title case (capitalize each word):

SELECT initcap(title) AS title FROM items

25. AppendStringProcessor¶

Append or prepend a string to column values.


Honeysuckle	`append_string_processor.py`
Polars	`+` operator or `pl.concat_str()`
DuckDB	`\|\|` operator or `concat()`

# Honeysuckle
AppendStringProcessor(column="id", value="ID-", prefix=True)   # Prepend
AppendStringProcessor(column="filename", value=".jpg")          # Append

Example¶

id			id
`"123"`	→		`"ID-123"`

Polars¶

df = df.with_columns(
    (pl.lit("ID-") + pl.col("id")).alias("id"),
    (pl.col("filename") + pl.lit(".jpg")).alias("filename"),
)

Using concat_str:

df = df.with_columns(
    pl.concat_str([pl.lit("ID-"), pl.col("id")]).alias("id")
)

DuckDB¶

SELECT
    'ID-' || id AS id,
    filename || '.jpg' AS filename

Using concat:

SELECT concat('ID-', id) AS id FROM items

26. AddNumberProcessor¶

Add a numeric value to column values.


Honeysuckle	`add_number_processor.py`
Polars	`+` operator
DuckDB	`+` operator

# Honeysuckle
AddNumberProcessor(column_name="year", new_column="next_year", value=1)

Example¶

year			next_year
2024	→		2025

Polars¶

df = df.with_columns((pl.col("year") + 1).alias("next_year"))

Add to same column:

df = df.with_columns(
    pl.col("price") + 10
)

DuckDB¶

SELECT year, year + 1 AS next_year

Add to computed expression:

SELECT *, price + tax AS total FROM items

27. SubtractNumberProcessor¶

Subtract a numeric value from column values.


Honeysuckle	`subtract_number_processor.py`
Polars	`-` operator
DuckDB	`-` operator

# Honeysuckle
SubtractNumberProcessor(column_name="year", new_column="prev_year", value=1)

Example¶

year			prev_year
2024	→		2023

Polars¶

df = df.with_columns((pl.col("year") - 1).alias("prev_year"))

Calculate difference:

df = df.with_columns(
    (pl.col("sale_price") - pl.col("cost")).alias("profit")
)

DuckDB¶

SELECT year, year - 1 AS prev_year

Calculate difference:

SELECT *, sale_price - cost AS profit FROM items

28. SplitStringProcessor¶

Split string into a list based on delimiter.


Honeysuckle	`split_string_processor.py`
Polars	`str.split()`
DuckDB	`string_split()`

# Honeysuckle
SplitStringProcessor(column_name="tags", delimiter=",")

Example¶

tags			tags
`"a,b,c"`	→		`["a", "b", "c"]`

Polars¶

df = df.with_columns(pl.col("tags").str.split(","))

Split with default space delimiter:

df = df.with_columns(
    pl.col("full_name").str.split(" ")
)

DuckDB¶

SELECT tags.split(',') AS tags

Split with default space:

SELECT string_split(full_name, ' ') AS name_parts FROM items

29. AsTypeProcessor¶

Cast column to a different data type.


Honeysuckle	`as_type_processor.py`
Polars	`cast()`
DuckDB	`::TYPE` or `CAST()`

# Honeysuckle
AsTypeProcessor(column_name="year", new_type="int")
AsTypeProcessor(column_name="price", new_type="float")

Example¶

year			year
`"2024"`	→		`2024` (int)

Polars¶

df = df.with_columns(
    pl.col("year").cast(pl.Int64),
    pl.col("price").cast(pl.Float64),
)

DuckDB¶

SELECT year::INTEGER AS year, price::DOUBLE AS price

Alternative CAST syntax:

SELECT CAST(year AS INTEGER) AS year FROM items

30. DropRowsOnConditionProcessor¶

Filter/drop rows based on a condition.


Honeysuckle	`drop_rows_on_condition_processor.py`
Polars	`filter()`
DuckDB	`WHERE` clause

# Honeysuckle
DropRowsOnConditionProcessor(target_field="status", conditional_operator="==", conditional_value="deleted")
DropRowsOnConditionProcessor(target_field="price", conditional_operator="<", conditional_value=0)

Example¶

status	price			status	price
`"active"`	100	→		`"active"`	100
`"deleted"`	50	→		(dropped)

Polars¶

df = df.filter((pl.col("status") != "deleted") & (pl.col("price") >= 0))

DuckDB¶

SELECT *
FROM 'source.parquet'
WHERE status != 'deleted' AND price >= 0

31. ColumnsToDictsProcessor¶

Combine multiple columns into a single column containing a list of dictionaries (nested structs). Useful for creating nested JSON output. Handles both single-value and multivalue columns.

Inefficient Implementation

Honeysuckle uses itertuples() and for...range row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.


Honeysuckle	`columns_to_dicts_processor.py`
Polars	`pl.struct()` wrapped in `pl.concat_list()`
DuckDB	Struct literals in list

# Honeysuckle
ColumnsToDictsProcessor(
    new_field="dimensions",
    column_names=["width", "height", "depth"],
    new_column_names=["w", "h", "d"]  # Optional rename
)

Example (single values)¶

width	height	depth			dimensions
10	20	5	→		`[{"w": 10, "h": 20, "d": 5}]`
15	30	8	→		`[{"w": 15, "h": 30, "d": 8}]`

Example (multivalue - matching length lists)¶

name	role			people
`["Alice", "Bob"]`	`["admin", "user"]`	→		`[{"name": "Alice", "role": "admin"}, {"name": "Bob", "role": "user"}]`

Polars¶

df = df.with_columns(
    pl.concat_list(
        pl.struct(
            pl.col("width").alias("w"),
            pl.col("height").alias("h"),
            pl.col("depth").alias("d"),
        )
    ).alias("dimensions")
).drop(["width", "height", "depth"])

For multivalue columns (lists of same length):

df = pl.DataFrame({
    "name": [["Alice", "Bob"]],
    "role": [["admin", "user"]]
})

# Explode, create struct, then group back
df = (
    df.with_row_index("_idx")
    .explode(["name", "role"])
    .with_columns(
        pl.struct(["name", "role"]).alias("person")
    )
    .group_by("_idx")
    .agg(pl.col("person").alias("people"))
    .drop("_idx")
)

DuckDB¶

SELECT [{'w': width, 'h': height, 'd': depth}] AS dimensions

Multivalue: transform parallel lists into list of structs using list_zip():

SELECT [
    {'name': name, 'role': role}
    FOR (name, role) IN list_zip(names, roles)
] AS people
FROM items

32. AppendProcessor¶

Append values from multiple columns into a single list column.

Inefficient Implementation

Honeysuckle uses itertuples() and row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.


Honeysuckle	`AppendProcessor.py`
Polars	`concat_list()`
DuckDB	`list_concat()`

# Honeysuckle
AppendProcessor(
    initial_column="tags",
    append_columns=["category", "source"]
)

Example¶

tags	category	source			tags (after)
`("a", "b")`	`"new"`	`"web"`	→		`["a", "b", "new", "web"]`

Polars¶

    # Append category and source to tags list (modifies tags in-place)
    return df.with_columns(pl.concat_list(["tags", "category", "source"]).alias("tags"))

DuckDB¶

-- Append category and source to tags list (modifies tags in-place)
SELECT list_concat(list_concat(tags, [category]), [source]) AS tags, category, source

33. EnsureTuplesProcessor¶

Convert scalar values to single-element lists (tuples in Honeysuckle terminology).

Inefficient Implementation

Honeysuckle uses itertuples() and iat[] for row-by-row updates. The Polars/DuckDB equivalents are vectorized.


Honeysuckle	`EnsureTuplesProcessor.py`
Polars	`concat_list()`
DuckDB	Array literal `[value]`

# Honeysuckle
EnsureTuplesProcessor(columns=["names", "roles"])

Example¶

names	roles			names	roles
`"Alice"`	`"admin"`	→		`["Alice"]`	`["admin"]`
`"Bob"`	`"user"`	→		`["Bob"]`	`["user"]`

Polars¶

    # Wrap scalar values in single-element lists
    return df.with_columns(
        pl.concat_list(pl.col("names")).alias("names"),
        pl.concat_list(pl.col("roles")).alias("roles"),
    )

DuckDB¶

-- Wrap scalar values in single-element lists
SELECT [names] AS names, [roles] AS roles

34. ExtractCharacterDataframeProcessor¶

Extract a pattern from a string column using regex.

Inefficient Implementation

Honeysuckle uses .apply() with a lambda function for row-by-row regex extraction. The Polars/DuckDB equivalents are vectorized.


Honeysuckle	`ExtractCharacterDataframeProcessor.py`
Polars	`str.extract()`
DuckDB	`regexp_extract()`

# Honeysuckle
ExtractCharacterDataframeProcessor(
    target_field="text",
    pattern=r"-(\d+)-",
    new_field="number"
)

Example¶

text			number
`"Item-123-A"`	→		`"123"`
`"Product-456-B"`	→		`"456"`

Polars¶

    return df.with_columns(pl.col("text").str.extract(r"-(\d+)-", group_index=1).alias("number"))

DuckDB¶

SELECT text, regexp_extract(text, '-(\d+)-', 1) AS number

35. SumColumnValuesProcessor¶

Sum column values grouped by another column and join back to the original dataframe.

Inefficient Implementation

Honeysuckle uses itertuples() and row-by-row value setting via loc[]. The Polars/DuckDB equivalents use vectorized aggregation and joins.


Honeysuckle	`SumColumnValuesProcessor.py`
Polars	`group_by().agg()` + `join()`
DuckDB	Subquery with `GROUP BY` + `JOIN`

# Honeysuckle
SumColumnValuesProcessor(
    column_to_sum="amount",
    result_column="total",
    result_label_column="category",
    sum_counts=True,
    group_sum_counts_by="category"
)

Example¶

category	amount		category	amount	total
`"A"`	10	→	`"A"`	10	90
`"B"`	20	→	`"B"`	20	60
`"A"`	30	→	`"A"`	30	90
`"B"`	40	→	`"B"`	40	60
`"A"`	50	→	`"A"`	50	90

Polars¶

    sums = df.group_by("category").agg(pl.col("amount").sum().alias("total"))
    return df.join(sums, on="category", how="left")

DuckDB¶

SELECT s.*, totals.total
FROM 'source.parquet' s
JOIN (SELECT category, sum(amount) AS total FROM 'source.parquet' GROUP BY category) totals
USING (category)

Quick Reference¶

Processor	Polars	DuckDB
`FillEmptyProcessor`	`.fill_null()`	`coalesce()` / `ifnull()`
`RemoveMultivalueNullsProcessor`	`.list.eval()` + `.list.len()`	`list_filter()` + `nullif()`
`StripStringProcessor`	`.str.strip_chars()`	`regexp_replace()`
`ExtractFirstProcessor`	`.list.first()`	`col[1]`
`ExtractOnConditionProcessor`	`pl.when().then().otherwise()`	`if()` / `CASE WHEN`
`ContainsBoolProcessor`	`.str.contains()`	`.contains()`
`StringConstantDataframeProcessor`	`pl.lit()`	`'value' AS col`
`AppendOnConditionProcessor`	`concat_list()` + `.list.concat()`	`list_concat()` + `CASE`
`ImplodeProcessor`	`.group_by().agg()`	`list()` + `GROUP BY ALL`
`ConcatProcessor`	`pl.concat_str()`	`concat_ws()` / `\|\|`
`RenameProcessor`	`.rename()`	`AS new_name`
`IsEmptyProcessor`	`.is_null()` / `.is_not_null()`	`IS NULL` / `IS NOT NULL`
`ExplodeColumnsProcessor`	`.explode()`	`unnest()`
`ReplaceOnConditionProcessor`	`pl.when().then().otherwise()`	`if()` / `greatest()`
`LowerStringProcessor`	`.str.to_lowercase()`	`.lower()`
`KeepOnlyProcessor`	`.select()`	Column list
`DropNullColumnsProcessor`	Filter columns	Dynamic SQL
`DropColumnsProcessor`	`.drop()`	`EXCLUDE`
`CopyFieldDataframeProcessor`	`.alias()`	`AS new_col`
`ReplaceProcessor`	`when().then().otherwise()`	`CASE WHEN`
`MergeProcessor`	`.join()`	`JOIN`
`FillMissingTuplesProcessor`	`.struct.field().fill_null()`	`coalesce(struct.field)`
`ConditionalBoolProcessor`	Boolean expression	Boolean expression
`CapitalizeProcessor`	`.str.head()` + `.str.to_uppercase()`	`upper(title[1]) \|\| lower(title[2:])`
`AppendStringProcessor`	`+` / `pl.concat_str()`	`\|\|` / `concat()`
`AddNumberProcessor`	`+`	`+`
`SubtractNumberProcessor`	`-`	`-`
`SplitStringProcessor`	`.str.split()`	`string_split()`
`AsTypeProcessor`	`.cast()`	`::TYPE`
`DropRowsOnConditionProcessor`	`.filter()`	`WHERE`
`ColumnsToDictsProcessor`	`pl.concat_list(pl.struct())`	`[{...} FOR ... IN list_zip()]`
`AppendProcessor`	`pl.concat_list()`	`list_concat()`
`EnsureTuplesProcessor`	`pl.concat_list()`	`[value]`
`ExtractCharacterDataframeProcessor`	`.str.extract()`	`regexp_extract()`
`SumColumnValuesProcessor`	`.group_by().agg()` + `.join()`	`GROUP BY` + `JOIN`