NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...5 min read

What Category Theory Teaches Us About DataFrames

Share
NOW LET US Article – What Category Theory Teaches Us About DataFrames

The article explores how dataframe algebra and category theory can reduce hundreds of complex pandas methods into a few fundamental operators, helping developers understand the true structure of data manipulation.

What Category Theory Teaches Us About DataFrames

Every dataframe library ships with hundreds of operations. pandas alone has over 200 methods on a DataFrame. Is pivot different from melt? Is apply different from map? What about transform, agg, applymap, pipe? Some of these seem like the same operation wearing different hats. Others seem genuinely distinct. Without a framework for telling them apart, you end up memorizing APIs instead of understanding structure.

I ran into this question while building my own dataframe library. I needed to decide which operations were truly fundamental and which were just surface-level variations. That search led me to Petersohn et al.’s Towards Scalable Dataframe Systems. They’d built Modin, a drop-in replacement for pandas, and needed to understand the actual structure underneath the API. They analyzed 1 million Jupyter notebooks, cataloged how people use pandas, and proposed a dataframe algebra: a formal set of about 15 operators that can express what all 200+ pandas operations do.

That algebra was a huge compression, but I kept wondering whether there was a level below it. Whether there was a smaller set of truly primitive operations that the 15 are built from. If those exist, they would be a real foundation: operations small enough to be obviously correct, expressive enough to build everything else.

Petersohn’s dataframe algebra

It’s worth spending some time on what Petersohn et al. actually did, because it frames everything that follows.

They started by defining what a dataframe is. Surprisingly, nobody had done this formally before. Their Definition 4.1 says a dataframe is a tuple (A, R, C, D): an array of data A, row labels R, column labels C, and a vector of column domains D. This is more precise than “a table” because it captures things that make dataframes different from relational tables. Rows and columns are both ordered, both labeled, and treated symmetrically. You can transpose a dataframe. You can promote data values into column labels. These aren’t things you can do with a SQL table.

Then they identified the operators. Here’s their Table 1, condensed:

| Operator | Origin | What it does | |---|---|---| | SELECTION | Relational | Eliminate rows | | PROJECTION | Relational | Eliminate columns | | UNION | Relational | Combine two dataframes vertically | | DIFFERENCE | Relational | Rows in one but not the other | | CROSS PRODUCT / JOIN | Relational | Combine two dataframes by key | | DROP DUPLICATES | Relational | Remove duplicate rows | | GROUPBY | Relational | Group rows by column values | | SORT | Relational | Reorder rows | | RENAME | Relational | Rename columns | | WINDOW | SQL | Sliding-window functions | | TRANSPOSE | Dataframe | Swap rows and columns | | MAP | Dataframe | Apply a function to every row | | TOLABELS | Dataframe | Promote data to column/row labels | | FROMLABELS | Dataframe | Demote labels back to data |

The “Origin” column matters. The first nine operators come from relational algebra and have direct analogs in SQL. WINDOW comes from SQL extensions. The last four (TRANSPOSE, MAP, TOLABELS, FROMLABELS) are unique to dataframes. They exist because dataframes treat rows and columns symmetrically and allow data to move between values and metadata. Relational databases can’t do that.

Petersohn showed that over 85% of the pandas API can be rewritten as compositions of these operators. Operations like fillna, isnull, str.upper, and cummax are all special cases of MAP. Operations like sort_values, set_index, reset_index, merge, groupby, and pivot all map one-to-one onto operators in the algebra. That’s a huge compression: 200+ ad hoc methods become 15 composable primitives.

But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?

Shapes of schema change

I kept staring at Petersohn’s table, and a pattern emerged. Some operators change the schema, meaning which columns exist and what types they have. Others leave the schema alone and only affect the rows. And if you focus on the schema-changing ones, they fall into three groups.

Restructuring. You rearrange, subset, or relabel columns. The data stays the same; only the shape changes. In SQL terms: SELECT name, salary FROM employees produces a two-column result from a three-column table. This covers Petersohn’s PROJECTION and RENAME. The output schema is a function of the input schema. You can compute it without looking at any data.

Merging. You collapse rows that share a key, either into a summary or a collection. In SQL: SELECT department, AVG(salary) FROM employees GROUP BY department. Multiple rows map to the same key and get combined. This covers Petersohn’s GROUPBY and UNION.

Pairing. You find rows in two tables that agree on a shared key and stitch them into a wider row. In SQL: SELECT * FROM employees INNER JOIN departments USING (department). Shared keys appear once; unique columns from each side are concatenated. This covers Petersohn’s CROSS PRODUCT / JOIN.

What doesn’t fit. Two relational operators resist this grouping. DIFFERENCE and DROP DUPLICATES both change which rows are present, but they don’t restructure columns, collapse by key, or pair across tables. They feel set-theoretic: one computes a complement, the other computes an image.

So five of Petersohn’s relational operators (PROJECTION, RENAME, GROUPBY, UNION, JOIN) map cleanly onto three patterns: restructuring, merging, pairing. The schema-preserving operators (SELECTION, SORT, WINDOW) are orthogonal. And the dataframe-specific operators (TRANSPOSE, MAP, TOLABELS, FROMLABELS) live outside the relational model entirely.

I had three patterns and two outliers. What I didn’t have was a reason why it should be these three patterns. Are restructuring, merging, and pairing truly fundamental, or did I just happen to group things this way? Is there a theory that predicts all of this?

That’s the question my mentor Sam Stites pointed me toward when he suggested I read Fong and Spivak’s Seven Sketches in Compositionality. The answer turns out to come from category theory.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Swift at Apple: Migrating the TrueType hinting interpreter

dev-tools

Swift at Apple: Migrating the TrueType hinting interpreter

Apple has rewritten its TrueType hinting interpreter from C to memory-safe Swift for its Fall 2025 OS releases, improving security and boosting performance by an average of 13%.

NOW LET US Related – Where Did Earth Get Its Oceans? Maybe It Made Them Itself

dev-tools

Where Did Earth Get Its Oceans? Maybe It Made Them Itself

For decades, scientists believed Earth's water was delivered by comets or asteroids. However, new research and space missions suggest our planet might have manufactured its own oceans through a mix of magma and hydrogen.

NOW LET US Related – Digital Sovereignty Becomes an Imperative as the US Reads Dutch Emails

dev-tools

Digital Sovereignty Becomes an Imperative as the US Reads Dutch Emails

The reported access of Dutch officials' emails by the U.S. House of Representatives highlights the critical difference between data residency and true digital sovereignty. It underscores why nations must secure legal and operational control over their data, moving beyond mere local storage promises.

NOW LET US Related – Removing 'um' from a recording is harder than it sounds

dev-tools

Removing 'um' from a recording is harder than it sounds

Removing filler words like 'um' and 'uh' from audio recordings is surprisingly difficult due to audio artifacts and AI limitations. The open-source tool 'erm' solves this by combining Whisper with advanced digital signal processing techniques.

NOW LET US Related – If you are asking for human attention, demonstrate human effort

dev-tools

If you are asking for human attention, demonstrate human effort

As AI-generated content floods the workplace, a new etiquette dilemma emerges. This article highlights a crucial principle for modern collaboration: if you want to request human attention, you must first demonstrate human effort.

NOW LET US Related – Raspberry Pi 5 – 16GB RAM

dev-tools

Raspberry Pi 5 – 16GB RAM

The Raspberry Pi 5 features a massive upgrade with a 2.4GHz quad-core processor, up to 16GB of RAM, and in-house silicon for vastly improved I/O performance.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.