Skip to content

Processor Equivalents

Common Honeysuckle processors and their native Polars/DuckDB equivalents.

Why Native Operations?

Native Polars/DuckDB operations are:

  • Faster: No Python loop overhead, vectorized execution
  • Simpler: Less boilerplate, more readable
  • Type-safe: Better IDE support and error messages

Test Script

All examples are tested in scripts/test_processor_equivalents.py. Run with: uv run scripts/test_processor_equivalents.py --verbose

Top 35 Processors by Usage

Based on analysis of Huntington Extractor, Universal Yiddish Library, and Royal Society SITM codebases.

Rank Processor Usage Operation
1 FillEmptyProcessor 55 Fill nulls with column/constant
2 RemoveMultivalueNullsProcessor 47 Remove nulls from lists
3 StripStringProcessor 44 Strip whitespace
4 ExtractFirstProcessor 42 Get first element from list
5 ExtractOnConditionProcessor 29 Extract value conditionally
6 ColumnsToDictsProcessor 27 Combine columns into nested dicts
7 ContainsBoolProcessor 26 Check if string contains pattern
8 StringConstantDataframeProcessor 23 Add constant column
9 AppendOnConditionProcessor 18 Append to list conditionally
10 ImplodeProcessor 15 Aggregate values into lists
11 ConcatProcessor 15 Concatenate columns
12 RenameProcessor 13 Rename columns
13 IsEmptyProcessor 13 Check if column is null
14 ExplodeColumnsProcessor 13 Unnest lists to rows
15 ReplaceOnConditionProcessor 10 Conditional value replacement
16 LowerStringProcessor 9 Lowercase strings
17 KeepOnlyProcessor 9 Keep only specified columns
18 DropNullColumnsProcessor 9 Drop all-null columns
19 DropColumnsProcessor 9 Drop specified columns
20 CopyFieldDataframeProcessor 9 Copy column values
21 ReplaceProcessor 8 String replacement
22 MergeProcessor 8 Join dataframes
23 FillMissingTuplesProcessor 8 Fill missing struct fields
24 ConditionalBoolProcessor 8 Set boolean from expression
25 CapitalizeProcessor 4 Capitalize first character
26 AppendStringProcessor 4 Append/prepend string
27 AddNumberProcessor 4 Add value to column
28 SubtractNumberProcessor 4 Subtract value from column
29 SplitStringProcessor 3 Split string to list
30 AsTypeProcessor Cast column types
31 DropRowsOnConditionProcessor Filter rows by condition
32 AppendProcessor Append multiple columns to list
33 EnsureTuplesProcessor Convert scalars to lists
34 ExtractCharacterDataframeProcessor Extract pattern from string
35 SumColumnValuesProcessor Sum values by group

1. FillEmptyProcessor

Fill empty values with values from a different column or a constant.

Honeysuckle fill_empty_processor.py
Polars fill_null()
DuckDB coalesce() / ifnull()
# Honeysuckle
FillEmptyProcessor(column_name="title", column_fill="alt_title")
FillEmptyProcessor(column_name="price", constant_fill=0)
FillEmptyProcessor(column_name="name", column_fill="display_name", constant_fill="Unknown")

Example

title alt_title title
None "Backup" "Backup"
"Main" "Backup" "Main"

Polars

df = df.with_columns(
    pl.col("title").fill_null(pl.col("alt_title")),
    pl.col("price").fill_null(0),
)

Chained fallback:

df = df.with_columns(
    pl.col("name").fill_null(pl.col("display_name")).fill_null("Unknown")
)

DuckDB

SELECT
    coalesce(title, alt_title) AS title,
    alt_title,
    ifnull(price, 0) AS price

Chained coalesce:

SELECT coalesce(name, display_name, 'Unknown') AS name FROM items


2. RemoveMultivalueNullsProcessor

Remove null values from list/array columns.

Honeysuckle remove_multivalue_nulls_processor.py
Polars list.eval() with drop_nulls(), list.len()
DuckDB list_filter() with lambda, nullif()
# Honeysuckle
RemoveMultivalueNullsProcessor(column="tags")

Example

tags tags
["a", None, "b"] ["a", "b"]
[None, None] null

Polars

    # Step 1: Remove nulls from list
    df = df.with_columns(pl.col("tags").list.eval(pl.element().drop_nulls()).alias("tags"))
    # Step 2: Convert empty lists to null
    return df.with_columns(
        pl.when(pl.col("tags").list.len() == 0).then(None).otherwise(pl.col("tags")).alias("tags")
    )

DuckDB

SELECT nullif(list_filter(tags, x -> x IS NOT NULL), []) AS tags

How it works:

Alternative list comprehension:

SELECT nullif([x FOR x IN tags IF x IS NOT NULL], []) AS tags
FROM items


3. StripStringProcessor

Strip whitespace (or specific characters) from strings.

Honeysuckle strip_string_processor.py
Polars str.strip_chars()
DuckDB regexp_replace() for full whitespace; .trim() for spaces only
# Honeysuckle
StripStringProcessor(column_name="name", new_column="name_clean")
StripStringProcessor(column_name="code", new_column="code", strip_string="-")

Example

name name_clean
" hello " "hello"
"\tworld\n" "world"

Polars

df = df.with_columns(pl.col("name").str.strip_chars().alias("name_clean"))

Additional patterns:

# Strip specific character
df = df.with_columns(pl.col("code").str.strip_chars("-"))

# Strip only leading (start) or trailing (end)
df = df.with_columns(
    pl.col("name").str.strip_chars_start(),  # Left only
    pl.col("name").str.strip_chars_end(),    # Right only
)

DuckDB

-- trim() only removes spaces; use regexp_replace for all whitespace
SELECT name, regexp_replace(name, '^\\s+|\\s+$', '', 'g') AS name_clean

Additional patterns:

-- Strip specific character
SELECT trim(code, '-') AS code FROM items

-- Strip only leading/trailing
SELECT name.ltrim() AS name FROM items  -- Left only
SELECT name.rtrim() AS name FROM items  -- Right only


4. ExtractFirstProcessor

Extract first value from multivalue column.

Honeysuckle extract_first_processor.py
Polars list.first()
DuckDB list[1] (1-indexed) or list_extract()
# Honeysuckle
ExtractFirstProcessor(column_name="authors", result_column="primary_author")
ExtractFirstProcessor(column_name="tags")  # Overwrites same column

Example

authors primary_author
["Alice", "Bob"] "Alice"
["Charlie"] "Charlie"

Polars

df = df.with_columns(pl.col("authors").list.first().alias("primary_author"))

DuckDB

SELECT authors, authors[1] AS primary_author

Replace in-place:

SELECT * REPLACE (tags[1] AS tags) FROM items


5. ExtractOnConditionProcessor

Extract value from one column to another based on a condition.

Honeysuckle extract_on_condition_processor.py
Polars when().then().otherwise()
DuckDB CASE WHEN or if()
# Honeysuckle
ExtractOnConditionProcessor(
    expression="is_on_sale == True",
    extract_columns=["price"],
    result_columns=["sale_price"],
)

Example

price is_on_sale sale_price
100 true 100
200 false null

Polars

df = df.with_columns(
    pl.when(pl.col("is_on_sale")).then(pl.col("price")).otherwise(None).alias("sale_price")
)

Multiple conditions:

df = df.with_columns(
    pl.when(pl.col("status") == "active")
      .then(pl.col("current_value"))
      .when(pl.col("status") == "pending")
      .then(pl.col("estimated_value"))
      .otherwise(None)
      .alias("display_value")
)

DuckDB

SELECT price, is_on_sale, if(is_on_sale, price, NULL) AS sale_price

Multiple conditions with CASE:

SELECT CASE
    WHEN status = 'active' THEN current_value
    WHEN status = 'pending' THEN estimated_value
END AS display_value
FROM items


6. ContainsBoolProcessor

Check if a string contains a pattern and set boolean result.

Honeysuckle contains_bool_processor.py
Polars str.contains()
DuckDB .contains() (dot notation) or LIKE/ILIKE
# Honeysuckle
ContainsBoolProcessor(column_name="description", result_column="has_keyword", pattern="important")

Example

description has_keyword
"This is important" true
"Nothing here" false

Polars

df = df.with_columns(pl.col("description").str.contains("important").alias("has_keyword"))

Regex pattern:

df = df.with_columns(
    pl.col("email").str.contains(r"@gmail\.com$").alias("is_gmail")
)

DuckDB

SELECT description, description.contains('important') AS has_keyword

Additional patterns:

-- Case-insensitive with ILIKE
SELECT description ILIKE '%important%' AS has_keyword FROM items

-- Regex pattern with regexp_matches()
SELECT regexp_matches(email, '@gmail\.com$') AS is_gmail FROM items

See also: regexp_matches()


7. StringConstantDataframeProcessor

Add a column with a constant string value.

Honeysuckle string_constant_dataframe_processor.py
Polars pl.lit()
DuckDB Literal value in SELECT
# Honeysuckle
StringConstantDataframeProcessor(target_field="source", value="huntington")

Example

id id source
1 1 "huntington"
2 2 "huntington"

Polars

df = df.with_columns(pl.lit("huntington").alias("source"))

Add multiple constants:

df = df.with_columns(
    pl.lit("huntington").alias("source"),
    pl.lit("2.0").alias("version"),
    pl.lit(True).alias("is_published"),
)

DuckDB

SELECT *, 'huntington' AS source

Add multiple constants:

SELECT *,
    'huntington' AS source,
    '2.0' AS version,
    true AS is_published
FROM items


8. AppendOnConditionProcessor

Append columns to a list/tuple based on a condition (uses pandas query syntax).

Inefficient Implementation

Honeysuckle uses for i in range(len(data.index)) row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.

Honeysuckle append_on_condition_processor.py
Polars when().then() with list.concat(), concat_list()
DuckDB list_concat() with CASE WHEN
# Honeysuckle
AppendOnConditionProcessor(
    initial_column="tags",
    append_columns=["category"],
    condition="is_featured == True"
)

Example

tags category is_featured tags
["art"] "new" true ["art", "new"]
["photo"] "old" false ["photo"]

Polars

    # Step 1: Wrap scalar in list
    df = df.with_columns(pl.concat_list(pl.col("category")).alias("category_list"))
    # Step 2: Conditionally append
    return df.with_columns(
        pl.when(pl.col("is_featured"))
        .then(pl.col("tags").list.concat(pl.col("category_list")))
        .otherwise(pl.col("tags"))
        .alias("tags")
    ).drop("category_list")

Why the intermediate step? The list.concat() method requires both operands to be lists. Using concat_list() wraps the scalar in a list cleanly.

DuckDB

SELECT
    CASE WHEN is_featured
        THEN list_concat(tags, [category])
        ELSE tags
    END AS tags,
    category,
    is_featured

9. ImplodeProcessor

Aggregate values into a list during group by.

Honeysuckle implode_processor.py
Polars Automatic in group_by().agg()
DuckDB list() with GROUP BY ALL
# Honeysuckle
ImplodeProcessor(column_names=["tag"], index_column="artwork_id")

Example

artwork_id tag artwork_id tag
1 "oil" 1 ["oil", "portrait"]
1 "portrait" 2 ["landscape"]
2 "landscape"

Polars

df = df.group_by("artwork_id").agg(pl.col("tag").alias("tag")).sort("artwork_id")

With ordering inside the list:

df = df.group_by("artwork_id").agg(
    pl.col("filename").sort_by("sort_order").alias("media_files")
)

DuckDB

SELECT artwork_id, list(tag) AS tag
FROM 'source.parquet'
GROUP BY ALL
ORDER BY artwork_id

With ordering inside the list:

SELECT artwork_id, list(filename ORDER BY sort_order) AS media_files
FROM media
GROUP BY ALL


10. ConcatProcessor

Concatenate multiple columns into a single string.

Inefficient Implementation

Honeysuckle uses itertuples() row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.

Honeysuckle concat_processor.py
Polars pl.concat_str()
DuckDB concat() or || operator
# Honeysuckle
ConcatProcessor(new_field="full_name", join_fields=["first", "last"], join_on=" ")
ConcatProcessor(new_field="code", join_fields=["prefix", "id"], join_on="-")

Example

first last full_name
"John" "Doe" "John Doe"
"Jane" "Smith" "Jane Smith"
null "Unknown" "None Unknown"

Polars

df = df.with_columns(
    pl.concat_str(
        [pl.col("first").fill_null("None"), pl.col("last").fill_null("None")],
        separator=" ",
        ignore_nulls=False,
    ).alias("full_name")
)

Best practice - skip nulls instead of stringifying them:

df = df.with_columns(
    pl.concat_str(["prefix", "id"], separator=" ", ignore_nulls=True).alias("code")
)

DuckDB

SELECT
    first,
    last,
    concat(coalesce(first::VARCHAR, 'None'), ' ', coalesce(last::VARCHAR, 'None')) AS full_name

Best practice - skip nulls with concat_ws():

SELECT concat_ws(' ', first, last) AS full_name FROM items


11. RenameProcessor

Rename columns in a dataframe.

Honeysuckle rename_processor.py
Polars rename()
DuckDB AS in SELECT
# Honeysuckle
RenameProcessor(mapping={"old_name": "new_name", "title": "artwork_title"})

Example

old_name title new_name artwork_title
"value" "Art" "value" "Art"

Polars

df = df.rename({"old_name": "new_name", "title": "artwork_title"})

DuckDB

SELECT old_name AS new_name, title AS artwork_title

Keep all columns, rename specific ones:

SELECT * REPLACE (title AS artwork_title) FROM items


12. IsEmptyProcessor

Check if column values are null and set boolean result.

Honeysuckle is_empty_processor.py
Polars is_null() / is_not_null()
DuckDB IS NULL / IS NOT NULL
# Honeysuckle
IsEmptyProcessor(column_name="email", result_column="missing_email")
IsEmptyProcessor(column_name="email", result_column="has_email", invert=True)

Example

email missing_email has_email
null true false
"a@b.com" false true

Polars

df = df.with_columns(
    pl.col("email").is_null().alias("missing_email"),
    pl.col("email").is_not_null().alias("has_email"),
)

DuckDB

SELECT
    email,
    email IS NULL AS missing_email,
    email IS NOT NULL AS has_email

13. ExplodeColumnsProcessor

Unnest/explode list columns into separate rows.

Honeysuckle explode_columns_processor.py
Polars explode()
DuckDB unnest()
# Honeysuckle
ExplodeColumnsProcessor(column_names=["tags"])

Example

id tags id tags
1 ["a", "b"] 1 "a"
1 "b"

Polars

df = df.explode("tags")

Explode multiple columns (must have same length lists):

df = df.explode("tags", "tag_ids")

DuckDB

SELECT id, unnest(tags) AS tags

Keep all other columns:

SELECT *, unnest(tags) AS tag FROM items


14. ReplaceOnConditionProcessor

Replace values based on a condition (==, !=, <, >, <=, >=).

Honeysuckle replace_on_condition_processor.py
Polars when().then().otherwise()
DuckDB CASE WHEN / greatest() / least()
# Honeysuckle
ReplaceOnConditionProcessor(
    target_field="status",
    conditional_operator="==",
    conditional_value="",
    replacement="unknown"
)
ReplaceOnConditionProcessor(
    target_field="price",
    conditional_operator="<",
    conditional_value=0,
    replacement=0
)

Example

status status
"" "unknown"
"active" "active"

Polars

df = df.with_columns(
    pl.when(pl.col("status") == "")
    .then(pl.lit("unknown"))
    .otherwise(pl.col("status"))
    .alias("status"),
    pl.max_horizontal(pl.col("price"), pl.lit(0)).alias("price"),
)

Note: max_horizontal() returns the maximum value across columns, useful for clamping values.

DuckDB

SELECT
    if(status = '', 'unknown', status) AS status,
    greatest(price, 0) AS price

Complex condition with CASE:

SELECT CASE WHEN price < 0 THEN 0 ELSE price END AS price FROM items


15. LowerStringProcessor

Convert strings to lowercase.

Honeysuckle lower_string_processor.py
Polars str.to_lowercase()
DuckDB .lower() (dot notation)
# Honeysuckle
LowerStringProcessor(column_name="email")

Example

email email
"John@Example.COM" "john@example.com"

Polars

df = df.with_columns(pl.col("email").str.to_lowercase())

DuckDB

SELECT email.lower() AS email

Traditional function syntax:

SELECT lower(email) AS email FROM items


16. KeepOnlyProcessor

Keep only specified columns, drop all others.

Honeysuckle keep_only_processor.py
Polars select()
DuckDB Column list in SELECT
# Honeysuckle
KeepOnlyProcessor(column_names=["id", "title"])

Example

id title author internal_code id title
1 "Art" "Jane" "X1" 1 "Art"

Polars

df = df.select(["id", "title"])

Using pl.col:

df = df.select(pl.col("id", "title"))

DuckDB

SELECT id, title

17. DropNullColumnsProcessor

Drop columns where ALL values are null.

Honeysuckle drop_null_columns_processor.py
Polars Filter columns using all()
DuckDB Dynamic SQL (columns with all nulls)
# Honeysuckle
DropNullColumnsProcessor()  # No arguments

Example

id empty_col title id title
1 null "Art" 1 "Art"
2 null "Photo" 2 "Photo"

Polars

null_cols = [col for col in df.columns if df[col].is_null().all()]
df = df.drop(null_cols)

DuckDB

-- Must enumerate non-null columns manually
SELECT id, title

18. DropColumnsProcessor

Drop specified columns from the dataframe.

Honeysuckle drop_columns_processor.py
Polars drop()
DuckDB EXCLUDE
# Honeysuckle
DropColumnsProcessor(column_names=["internal_id"])

Example

id internal_id title id title
1 "abc123" "Art" 1 "Art"

Polars

df = df.drop(["internal_id"])

Drop single column:

df = df.drop("internal_id")

DuckDB

SELECT * EXCLUDE (internal_id)

Or explicitly list columns to keep:

SELECT id, title FROM items


19. CopyFieldDataframeProcessor

Copy values from one column to another (create new column).

Honeysuckle copy_field_processor.py
Polars alias()
DuckDB SELECT col AS new_col
# Honeysuckle
CopyFieldDataframeProcessor(target_field="original_title", new_field="display_title")

Example

original_title original_title display_title
"Mona Lisa" "Mona Lisa" "Mona Lisa"

Polars

df = df.with_columns(pl.col("original_title").alias("display_title"))

Copy multiple columns:

df = df.with_columns(
    pl.col("title").alias("display_title"),
    pl.col("date").alias("publication_date"),
)

DuckDB

SELECT *, original_title AS display_title

Copy multiple:

SELECT *, title AS display_title, date AS publication_date FROM items


20. ReplaceProcessor

Replace exact values in a column.

Honeysuckle replace_processor.py
Polars when().then().otherwise() for exact value replacement
DuckDB CASE WHEN for exact value replacement
# Honeysuckle
ReplaceProcessor(target_field="status", to_replace="", replacement="unknown")
ReplaceProcessor(target_field="category", to_replace="N/A", replacement=None)

Example

status status
"" "unknown"

Polars

df = df.with_columns(
    pl.when(pl.col("status") == "")
    .then(pl.lit("unknown"))
    .otherwise(pl.col("status"))
    .alias("status")
)

Best practice for substring/regex replacement with str.replace_all():

df = df.with_columns(
    pl.col("text").str.replace_all(r"\s+", " ")  # Collapse whitespace
)

DuckDB

SELECT CASE WHEN status = '' THEN 'unknown' ELSE status END AS status

Best practice for substring/regex replacement with regexp_replace():

SELECT regexp_replace(text, '\s+', ' ', 'g') AS text FROM items


21. MergeProcessor

Join/merge two dataframes together.

Honeysuckle merge_processor.py
Polars join()
DuckDB JOIN
# Honeysuckle
MergeProcessor(
    merge_dataframe=artists_df,
    merge_columns=["artist_name", "birth_year"],
    merge_dataframe_left="artist_id",
    merge_dataframe_right="id",
    how="left"
)

Example

sales:

id artist_id
1 100

artists:

id artist_name
100 "Monet"

Result:

id_x artist_id artist_name id_y
1 100 "Monet" 100

Polars

    # Polars drops right_on column; keep it explicitly as id_y
    merged = sales.join(
        artists.select(["id", "artist_name"]).with_columns(pl.col("id").alias("id_y")),
        left_on="artist_id",
        right_on="id",
        how="left",
    )
    return merged.rename({"id": "id_x"})

The suffix parameter automatically appends a suffix to duplicate column names from the right DataFrame.

Additional patterns:

# Inner join
df = sales.join(artists, left_on="artist_id", right_on="id", how="inner")

# Join with custom suffix for duplicate column names
df = sales.join(artists, left_on="artist_id", right_on="id", suffix="_artist")

DuckDB

SELECT s.id AS id_x, s.artist_id, a.artist_name, a.id AS id_y
FROM 'sales.parquet' s
LEFT JOIN 'artists.parquet' a ON s.artist_id = a.id

Additional patterns:

-- Inner join
SELECT s.*, a.artist_name
FROM sales s
INNER JOIN artists a ON s.artist_id = a.id

-- Using USING for same column name
SELECT * FROM sales JOIN artists USING (artist_id)


22. FillMissingTuplesProcessor

Fill missing tuple/list values in multiple columns by inserting nulls to match the row's tuple length.

Inefficient Implementation

Honeysuckle uses itertuples() row-by-row iteration with nested loops. The Polars/DuckDB equivalents are vectorized and significantly faster.

Honeysuckle fill_missing_tuples_processor.py
Polars struct.field() with fill_null()
DuckDB Struct access with coalesce()
# Honeysuckle
FillMissingTuplesProcessor(columns=["names", "roles"])

Example

names roles names roles
("Alice", "Bob") null ("Alice", "Bob") (null, null)
null ("editor", "viewer") (null, null) ("editor", "viewer")

Polars

df = df.with_columns(
    pl.when(pl.col("names").is_null())
    .then(pl.lit([None, None]))
    .otherwise(pl.col("names"))
    .alias("names"),
    pl.when(pl.col("roles").is_null())
    .then(pl.lit([None, None]))
    .otherwise(pl.col("roles"))
    .alias("roles"),
)

DuckDB

SELECT
    coalesce(names, [NULL, NULL]) AS names,
    coalesce(roles, [NULL, NULL]) AS roles

23. ConditionalBoolProcessor

Evaluate a pandas query expression and set boolean result.

Honeysuckle conditional_bool_processor.py
Polars Boolean expressions
DuckDB Boolean expressions
# Honeysuckle
ConditionalBoolProcessor(expression="price > 1000 and status == 'active'", result_column="is_premium")

Example

price status is_premium
1500 "active" true
500 "active" false

Polars

df = df.with_columns(
    ((pl.col("price") > 1000) & (pl.col("status") == "active")).alias("is_premium")
)

Complex expressions:

df = df.with_columns(
    (
        (pl.col("price") > 1000) |
        (pl.col("featured") == True)
    ).alias("is_highlight")
)

DuckDB

SELECT *, (price > 1000 AND status = 'active') AS is_premium

Complex expressions:

SELECT *, (price > 1000 OR featured = true) AS is_highlight FROM items


24. CapitalizeProcessor

Capitalize the first character of strings.

Honeysuckle capitalize_processor.py
Polars str.head() + str.to_uppercase(), or str.to_titlecase()
DuckDB initcap()
# Honeysuckle
CapitalizeProcessor(column_name="title")

Example

title title
"hello world" "Hello world"

Polars

    # Step 1: Get first char and rest
    df = df.with_columns(
        pl.col("title").str.head(1).str.to_uppercase().alias("_first"),
        pl.col("title").str.slice(1).str.to_lowercase().alias("_rest"),
    )
    # Step 2: Combine and clean up
    return df.with_columns((pl.col("_first") + pl.col("_rest")).alias("title")).drop(
        ["_first", "_rest"]
    )

Note: str.head(1) is cleaner than .str.slice(0, 1) for extracting the first character.

Title case (capitalize each word):

df = df.with_columns(
    pl.col("title").str.to_titlecase()
)

DuckDB

SELECT upper(title[1]) || lower(title[2:]) AS title

How it works:

  • title[1] — DuckDB uses 1-based indexing for strings
  • title[2:] — Slice from second character to end

Title case (capitalize each word):

SELECT initcap(title) AS title FROM items


25. AppendStringProcessor

Append or prepend a string to column values.

Honeysuckle append_string_processor.py
Polars + operator or pl.concat_str()
DuckDB || operator or concat()
# Honeysuckle
AppendStringProcessor(column="id", value="ID-", prefix=True)   # Prepend
AppendStringProcessor(column="filename", value=".jpg")          # Append

Example

id id
"123" "ID-123"

Polars

df = df.with_columns(
    (pl.lit("ID-") + pl.col("id")).alias("id"),
    (pl.col("filename") + pl.lit(".jpg")).alias("filename"),
)

Using concat_str:

df = df.with_columns(
    pl.concat_str([pl.lit("ID-"), pl.col("id")]).alias("id")
)

DuckDB

SELECT
    'ID-' || id AS id,
    filename || '.jpg' AS filename

Using concat:

SELECT concat('ID-', id) AS id FROM items


26. AddNumberProcessor

Add a numeric value to column values.

Honeysuckle add_number_processor.py
Polars + operator
DuckDB + operator
# Honeysuckle
AddNumberProcessor(column_name="year", new_column="next_year", value=1)

Example

year next_year
2024 2025

Polars

df = df.with_columns((pl.col("year") + 1).alias("next_year"))

Add to same column:

df = df.with_columns(
    pl.col("price") + 10
)

DuckDB

SELECT year, year + 1 AS next_year

Add to computed expression:

SELECT *, price + tax AS total FROM items


27. SubtractNumberProcessor

Subtract a numeric value from column values.

Honeysuckle subtract_number_processor.py
Polars - operator
DuckDB - operator
# Honeysuckle
SubtractNumberProcessor(column_name="year", new_column="prev_year", value=1)

Example

year prev_year
2024 2023

Polars

df = df.with_columns((pl.col("year") - 1).alias("prev_year"))

Calculate difference:

df = df.with_columns(
    (pl.col("sale_price") - pl.col("cost")).alias("profit")
)

DuckDB

SELECT year, year - 1 AS prev_year

Calculate difference:

SELECT *, sale_price - cost AS profit FROM items


28. SplitStringProcessor

Split string into a list based on delimiter.

Honeysuckle split_string_processor.py
Polars str.split()
DuckDB string_split()
# Honeysuckle
SplitStringProcessor(column_name="tags", delimiter=",")

Example

tags tags
"a,b,c" ["a", "b", "c"]

Polars

df = df.with_columns(pl.col("tags").str.split(","))

Split with default space delimiter:

df = df.with_columns(
    pl.col("full_name").str.split(" ")
)

DuckDB

SELECT tags.split(',') AS tags

Split with default space:

SELECT string_split(full_name, ' ') AS name_parts FROM items


29. AsTypeProcessor

Cast column to a different data type.

Honeysuckle as_type_processor.py
Polars cast()
DuckDB ::TYPE or CAST()
# Honeysuckle
AsTypeProcessor(column_name="year", new_type="int")
AsTypeProcessor(column_name="price", new_type="float")

Example

year year
"2024" 2024 (int)

Polars

df = df.with_columns(
    pl.col("year").cast(pl.Int64),
    pl.col("price").cast(pl.Float64),
)

DuckDB

SELECT year::INTEGER AS year, price::DOUBLE AS price

Alternative CAST syntax:

SELECT CAST(year AS INTEGER) AS year FROM items


30. DropRowsOnConditionProcessor

Filter/drop rows based on a condition.

Honeysuckle drop_rows_on_condition_processor.py
Polars filter()
DuckDB WHERE clause
# Honeysuckle
DropRowsOnConditionProcessor(target_field="status", conditional_operator="==", conditional_value="deleted")
DropRowsOnConditionProcessor(target_field="price", conditional_operator="<", conditional_value=0)

Example

status price status price
"active" 100 "active" 100
"deleted" 50 (dropped)

Polars

df = df.filter((pl.col("status") != "deleted") & (pl.col("price") >= 0))

DuckDB

SELECT *
FROM 'source.parquet'
WHERE status != 'deleted' AND price >= 0

31. ColumnsToDictsProcessor

Combine multiple columns into a single column containing a list of dictionaries (nested structs). Useful for creating nested JSON output. Handles both single-value and multivalue columns.

Inefficient Implementation

Honeysuckle uses itertuples() and for...range row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.

Honeysuckle columns_to_dicts_processor.py
Polars pl.struct() wrapped in pl.concat_list()
DuckDB Struct literals in list
# Honeysuckle
ColumnsToDictsProcessor(
    new_field="dimensions",
    column_names=["width", "height", "depth"],
    new_column_names=["w", "h", "d"]  # Optional rename
)

Example (single values)

width height depth dimensions
10 20 5 [{"w": 10, "h": 20, "d": 5}]
15 30 8 [{"w": 15, "h": 30, "d": 8}]

Example (multivalue - matching length lists)

name role people
["Alice", "Bob"] ["admin", "user"] [{"name": "Alice", "role": "admin"}, {"name": "Bob", "role": "user"}]

Polars

df = df.with_columns(
    pl.concat_list(
        pl.struct(
            pl.col("width").alias("w"),
            pl.col("height").alias("h"),
            pl.col("depth").alias("d"),
        )
    ).alias("dimensions")
).drop(["width", "height", "depth"])

For multivalue columns (lists of same length):

df = pl.DataFrame({
    "name": [["Alice", "Bob"]],
    "role": [["admin", "user"]]
})

# Explode, create struct, then group back
df = (
    df.with_row_index("_idx")
    .explode(["name", "role"])
    .with_columns(
        pl.struct(["name", "role"]).alias("person")
    )
    .group_by("_idx")
    .agg(pl.col("person").alias("people"))
    .drop("_idx")
)

DuckDB

SELECT [{'w': width, 'h': height, 'd': depth}] AS dimensions

Multivalue: transform parallel lists into list of structs using list_zip():

SELECT [
    {'name': name, 'role': role}
    FOR (name, role) IN list_zip(names, roles)
] AS people
FROM items


32. AppendProcessor

Append values from multiple columns into a single list column.

Inefficient Implementation

Honeysuckle uses itertuples() and row-by-row iteration. The Polars/DuckDB equivalents are vectorized and significantly faster.

Honeysuckle AppendProcessor.py
Polars concat_list()
DuckDB list_concat()
# Honeysuckle
AppendProcessor(
    initial_column="tags",
    append_columns=["category", "source"]
)

Example

tags category source tags (after)
("a", "b") "new" "web" ["a", "b", "new", "web"]

Polars

    # Append category and source to tags list (modifies tags in-place)
    return df.with_columns(pl.concat_list(["tags", "category", "source"]).alias("tags"))

DuckDB

-- Append category and source to tags list (modifies tags in-place)
SELECT list_concat(list_concat(tags, [category]), [source]) AS tags, category, source

33. EnsureTuplesProcessor

Convert scalar values to single-element lists (tuples in Honeysuckle terminology).

Inefficient Implementation

Honeysuckle uses itertuples() and iat[] for row-by-row updates. The Polars/DuckDB equivalents are vectorized.

Honeysuckle EnsureTuplesProcessor.py
Polars concat_list()
DuckDB Array literal [value]
# Honeysuckle
EnsureTuplesProcessor(columns=["names", "roles"])

Example

names roles names roles
"Alice" "admin" ["Alice"] ["admin"]
"Bob" "user" ["Bob"] ["user"]

Polars

    # Wrap scalar values in single-element lists
    return df.with_columns(
        pl.concat_list(pl.col("names")).alias("names"),
        pl.concat_list(pl.col("roles")).alias("roles"),
    )

DuckDB

-- Wrap scalar values in single-element lists
SELECT [names] AS names, [roles] AS roles

34. ExtractCharacterDataframeProcessor

Extract a pattern from a string column using regex.

Inefficient Implementation

Honeysuckle uses .apply() with a lambda function for row-by-row regex extraction. The Polars/DuckDB equivalents are vectorized.

Honeysuckle ExtractCharacterDataframeProcessor.py
Polars str.extract()
DuckDB regexp_extract()
# Honeysuckle
ExtractCharacterDataframeProcessor(
    target_field="text",
    pattern=r"-(\d+)-",
    new_field="number"
)

Example

text number
"Item-123-A" "123"
"Product-456-B" "456"

Polars

    return df.with_columns(pl.col("text").str.extract(r"-(\d+)-", group_index=1).alias("number"))

DuckDB

SELECT text, regexp_extract(text, '-(\d+)-', 1) AS number

35. SumColumnValuesProcessor

Sum column values grouped by another column and join back to the original dataframe.

Inefficient Implementation

Honeysuckle uses itertuples() and row-by-row value setting via loc[]. The Polars/DuckDB equivalents use vectorized aggregation and joins.

Honeysuckle SumColumnValuesProcessor.py
Polars group_by().agg() + join()
DuckDB Subquery with GROUP BY + JOIN
# Honeysuckle
SumColumnValuesProcessor(
    column_to_sum="amount",
    result_column="total",
    result_label_column="category",
    sum_counts=True,
    group_sum_counts_by="category"
)

Example

category amount category amount total
"A" 10 "A" 10 90
"B" 20 "B" 20 60
"A" 30 "A" 30 90
"B" 40 "B" 40 60
"A" 50 "A" 50 90

Polars

    sums = df.group_by("category").agg(pl.col("amount").sum().alias("total"))
    return df.join(sums, on="category", how="left")

DuckDB

SELECT s.*, totals.total
FROM 'source.parquet' s
JOIN (SELECT category, sum(amount) AS total FROM 'source.parquet' GROUP BY category) totals
USING (category)

Quick Reference

Processor Polars DuckDB
FillEmptyProcessor .fill_null() coalesce() / ifnull()
RemoveMultivalueNullsProcessor .list.eval() + .list.len() list_filter() + nullif()
StripStringProcessor .str.strip_chars() regexp_replace()
ExtractFirstProcessor .list.first() col[1]
ExtractOnConditionProcessor pl.when().then().otherwise() if() / CASE WHEN
ContainsBoolProcessor .str.contains() .contains()
StringConstantDataframeProcessor pl.lit() 'value' AS col
AppendOnConditionProcessor concat_list() + .list.concat() list_concat() + CASE
ImplodeProcessor .group_by().agg() list() + GROUP BY ALL
ConcatProcessor pl.concat_str() concat_ws() / ||
RenameProcessor .rename() AS new_name
IsEmptyProcessor .is_null() / .is_not_null() IS NULL / IS NOT NULL
ExplodeColumnsProcessor .explode() unnest()
ReplaceOnConditionProcessor pl.when().then().otherwise() if() / greatest()
LowerStringProcessor .str.to_lowercase() .lower()
KeepOnlyProcessor .select() Column list
DropNullColumnsProcessor Filter columns Dynamic SQL
DropColumnsProcessor .drop() EXCLUDE
CopyFieldDataframeProcessor .alias() AS new_col
ReplaceProcessor when().then().otherwise() CASE WHEN
MergeProcessor .join() JOIN
FillMissingTuplesProcessor .struct.field().fill_null() coalesce(struct.field)
ConditionalBoolProcessor Boolean expression Boolean expression
CapitalizeProcessor .str.head() + .str.to_uppercase() upper(title[1]) || lower(title[2:])
AppendStringProcessor + / pl.concat_str() || / concat()
AddNumberProcessor + +
SubtractNumberProcessor - -
SplitStringProcessor .str.split() string_split()
AsTypeProcessor .cast() ::TYPE
DropRowsOnConditionProcessor .filter() WHERE
ColumnsToDictsProcessor pl.concat_list(pl.struct()) [{...} FOR ... IN list_zip()]
AppendProcessor pl.concat_list() list_concat()
EnsureTuplesProcessor pl.concat_list() [value]
ExtractCharacterDataframeProcessor .str.extract() regexp_extract()
SumColumnValuesProcessor .group_by().agg() + .join() GROUP BY + JOIN