What Category Theory Teaches Us About DataFrames

The article explores how dataframe algebra and category theory can reduce hundreds of complex pandas methods into a few fundamental operators, helping developers understand the true structure of data manipulation.

Every dataframe library ships with hundreds of operations. pandas alone has over 200 methods on a DataFrame. Is pivot different from melt? Is apply different from map? What about transform, agg, applymap, pipe? Some of these seem like the same operation wearing different hats. Others seem genuinely distinct. Without a framework for telling them apart, you end up memorizing APIs instead of understanding structure.

I ran into this question while building my own dataframe library. I needed to decide which operations were truly fundamental and which were just surface-level variations. That search led me to Petersohn et al.’s Towards Scalable Dataframe Systems. They’d built Modin, a drop-in replacement for pandas, and needed to understand the actual structure underneath the API. They analyzed 1 million Jupyter notebooks, cataloged how people use pandas, and proposed a dataframe algebra: a formal set of about 15 operators that can express what all 200+ pandas operations do.

That algebra was a huge compression, but I kept wondering whether there was a level below it. Whether there was a smaller set of truly primitive operations that the 15 are built from. If those exist, they would be a real foundation: operations small enough to be obviously correct, expressive enough to build everything else.

Petersohn’s dataframe algebra

It’s worth spending some time on what Petersohn et al. actually did, because it frames everything that follows.

They started by defining what a dataframe is. Surprisingly, nobody had done this formally before. Their Definition 4.1 says a dataframe is a tuple (A, R, C, D): an array of data A, row labels R, column labels C, and a vector of column domains D. This is more precise than “a table” because it captures things that make dataframes different from relational tables. Rows and columns are both ordered, both labeled, and treated symmetrically. You can transpose a dataframe. You can promote data values into column labels. These aren’t things you can do with a SQL table.

Then they identified the operators. Here’s their Table 1, condensed:

| Operator | Origin | What it does | |---|---|---| | SELECTION | Relational | Eliminate rows | | PROJECTION | Relational | Eliminate columns | | UNION | Relational | Combine two dataframes vertically | | DIFFERENCE | Relational | Rows in one but not the other | | CROSS PRODUCT / JOIN | Relational | Combine two dataframes by key | | DROP DUPLICATES | Relational | Remove duplicate rows | | GROUPBY | Relational | Group rows by column values | | SORT | Relational | Reorder rows | | RENAME | Relational | Rename columns | | WINDOW | SQL | Sliding-window functions | | TRANSPOSE | Dataframe | Swap rows and columns | | MAP | Dataframe | Apply a function to every row | | TOLABELS | Dataframe | Promote data to column/row labels | | FROMLABELS | Dataframe | Demote labels back to data |

The “Origin” column matters. The first nine operators come from relational algebra and have direct analogs in SQL. WINDOW comes from SQL extensions. The last four (TRANSPOSE, MAP, TOLABELS, FROMLABELS) are unique to dataframes. They exist because dataframes treat rows and columns symmetrically and allow data to move between values and metadata. Relational databases can’t do that.

Petersohn showed that over 85% of the pandas API can be rewritten as compositions of these operators. Operations like fillna, isnull, str.upper, and cummax are all special cases of MAP. Operations like sort_values, set_index, reset_index, merge, groupby, and pivot all map one-to-one onto operators in the algebra. That’s a huge compression: 200+ ad hoc methods become 15 composable primitives.

But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?

Shapes of schema change

I kept staring at Petersohn’s table, and a pattern emerged. Some operators change the schema, meaning which columns exist and what types they have. Others leave the schema alone and only affect the rows. And if you focus on the schema-changing ones, they fall into three groups.

Restructuring. You rearrange, subset, or relabel columns. The data stays the same; only the shape changes. In SQL terms: SELECT name, salary FROM employees produces a two-column result from a three-column table. This covers Petersohn’s PROJECTION and RENAME. The output schema is a function of the input schema. You can compute it without looking at any data.

Merging. You collapse rows that share a key, either into a summary or a collection. In SQL: SELECT department, AVG(salary) FROM employees GROUP BY department. Multiple rows map to the same key and get combined. This covers Petersohn’s GROUPBY and UNION.

Pairing. You find rows in two tables that agree on a shared key and stitch them into a wider row. In SQL: SELECT * FROM employees INNER JOIN departments USING (department). Shared keys appear once; unique columns from each side are concatenated. This covers Petersohn’s CROSS PRODUCT / JOIN.

What doesn’t fit. Two relational operators resist this grouping. DIFFERENCE and DROP DUPLICATES both change which rows are present, but they don’t restructure columns, collapse by key, or pair across tables. They feel set-theoretic: one computes a complement, the other computes an image.

So five of Petersohn’s relational operators (PROJECTION, RENAME, GROUPBY, UNION, JOIN) map cleanly onto three patterns: restructuring, merging, pairing. The schema-preserving operators (SELECTION, SORT, WINDOW) are orthogonal. And the dataframe-specific operators (TRANSPOSE, MAP, TOLABELS, FROMLABELS) live outside the relational model entirely.

I had three patterns and two outliers. What I didn’t have was a reason why it should be these three patterns. Are restructuring, merging, and pairing truly fundamental, or did I just happen to group things this way? Is there a theory that predicts all of this?

That’s the question my mentor Sam Stites pointed me toward when he suggested I read Fong and Spivak’s Seven Sketches in Compositionality. The answer turns out to come from category theory.

Source: Hacker News