NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...6 min read

Data Manipulation in Clojure Compared to R and Python

Share
NOW LET US Article – Data Manipulation in Clojure Compared to R and Python

A comparative look at Clojure's Tablecloth library versus R's Tidyverse and Python's Pandas/Polars for data science tasks.

Data Manipulation in Clojure Compared to R and Python

Published 2024-07-18

I spend a lot of time developing and teaching people about Clojure's open source tools for working with data. Almost everybody who wants to use Clojure for this kind of work is coming from another language ecosystem, usually R or Python. Together with Daniel Slutsky, I'm working on formalizing some of the common teachings into a course. Part of that is providing context for people coming from other ecosystems, including "translations" of how to accomplish data science tasks in Clojure.

As part of this development, I wanted to share an early preview in this blog post. The format is inspired by this great blog post I read a while ago comparing R and Polars side by side (where "R" here refers to the tidyverse, an opinionated collection of R libraries for data science, and realistically mostly dplyr specifically). I'm adding Pandas because it's among the most popular dataset manipulation libraries, and of course Clojure, specifically tablecloth, the primary data manipulation library in our ecosystem.

I'll use the same dataset as the original blog post, the Palmer Penguin dataset. For the sake of simplicity, I saved a copy of the dataset as a CSV file and made it available on this website. I will also refer the data as a "dataset" throughout this post because that's what Clojure people call a tabular, column-major data structure, but it's the same thing that is variously referred to as a dataframe, data table, or just "data" in other languages. I'm also assuming you know how to install the packages required in the given ecosystems, but any necessary imports or requirements are included in the code snippets the first time they appear. Versions of all languages and libraries used in this post are listed at the end. Here we go!

Reading data

Reading data is straightforward in every language, but as a bonus we want to be able to indicate on the fly which values should be interpreted as "missing", whatever that means in the given libraries. In this dataset, the string "NA" means "missing", so we want to tell the dataset constructor this as soon as possible. Here's the comparison of how to accomplish that in various languages:

Tablecloth

(require '[tablecloth.api :as tc])
(def ds
(tc/dataset "https://codewithkira.com/assets/penguins.csv"))

Note that tablecloth interprets the string "NA" as missing (nil, in Clojure) by default.

R

In reality, in R you would get the dataset from the R package that contains the dataset. This is a fairly common practice in R. In order to compare apples to apples, though, here I'll show how to initialize the dataset from a remote CSV file, using the readr package's read_csv, which is part of the tidyverse:

library(tidyverse)
ds <- read_csv("https://codewithkira.com/assets/penguins.csv",
na = "NA")

Pandas

import pandas as pd
ds = pd.read_csv("https://codewithkira.com/assets/penguins.csv")

Note that pandas has a fairly long list of values it considers NaN already, so we don't need to specify what missing values look like in our case, since "NA" is already in that list.

Polars

import polars as pl
ds = pl.read_csv("https://codewithkira.com/assets/penguins.csv",
null_values="NA")

Basic commands to explore the dataset

The first thing people usually want to do with their dataset is see it and poke around a bit. Below is a comparison of how to accomplish basic data exploration tasks using each library.

| Operation | tablecloth | dplyr | |---|---|---| | see first 10 rows | (tc/head ds 10) | head(ds, 10) | | see all column names | (tc/column-names ds) | colnames(ds) | | select column | (tc/select-columns ds "year") | select(ds, year) | | select multiple columns | (tc/select-columns ds ["year" "sex"]) | select(ds, year, sex) | | select rows | (tc/select-rows ds #(> (% "year") 2008)) | filter(ds, year > 2008) | | sort column | (tc/order-by ds "year") | arrange(ds, year) |

| Operation | pandas | polars | |---|---|---| see first n rows | ds.head(10) | ds.head(10) | | see all column names | ds.columns | ds.columns | | select column | ds[["year"]] | ds.select(pl.col("year")) | | select multiple columns | ds[["year", "sex"]] | ds.select(pl.col("year", "sex")) | | select rows | ds[ds["year"] > 2008] | ds.filter(pl.col("year") > 2008) | | sort column | ds.sort_values("year") | ds.sort("year") |

Note there are some differences in how different libraries sort missing values, for example in tablecloth and polars they are placed at the beginning (so they're at the top when a column is sorted in ascending order and last when descending), but dplyr and pandas place them last (regardless of whether ascending or descending order is specified).

As you can see, these commands are all pretty similar, with the exception of selecting rows in tablecloth. This is a short-hand syntax for writing an anonymous function in Clojure, which is how rows are selected. Being a functional language, functions in Clojure are "first-class", which basically just means they are passed around as arguments willy-nilly, all over the place, all the time. In this case, the third argument to tablecloth's select-rows function is a predicate (a function that returns a boolean) that takes as its argument a dataset row as a map of column names to values. Don't worry, though, tablecloth doesn't process your entire dataset row-wise. Under the hood datasets are highly optimized to perform column-wise operations as fast as possible.

Here's an example of what it looks like to string a couple of these basic dataset exploration operations together, for example in this case to get the bill_length_mm of all penguins with body_mass_g below 3800:

Tablecloth

(-> ds
(tc/select-rows #(and (% "body_mass_g")
(> (% "body_mass_g") 3800)))
(tc/select-columns "bill_length_mm"))

Note that in tablecloth we have to explicitly omit rows where the value we're filtering by is missing, unlike in other libraries. This is because tablecloth actually uses nil (as opposed to a library-specific construct) to indicate a missing value , and in Clojure nil is not treated as comparable to numbers. If we were to try to compare nil to a number, we would get an exception telling us that we're trying to compare incomparable types. Clojure is fundamentally dynamically typed in that it only does type checking at runtime and bindings can refer to values of any type, but it is also strongly typed, as we see here, in the sense that it explicitly avoids implicit type coercion. For example deciding whether 0 is greater or larger than nil requires some assumptions, and these are intentionally not baked into the core of Clojure or into tablecloth as a library as is the case in some other languages and libraries.

This example also introduces Clojure's "thread-first" macro. The -> arrow is like R's |> operator or the unix pipe, effectively passing the output of each function in the chain as input to the next. It comes in very handy for data processing code like this.

Here is the equivalent operation in the other libraries:

dplyr

ds |>
filter(body_mass_g < 3800) |>
select(bill_length_mm)

Pandas

ds[ds["body_mass_g"] < 3800]["bill_length_mm"]

Polars

ds.filter(pl.col("body_mass_g") < 3800).select(pl.col("bill_length_mm"))

More advanced filtering and selecting

Here is what some more complicated data wrangling looks like across the libraries.

Select all columns except for one

| Library | Code | |---|---| | tablecloth | (tc/select-columns ds (complement #{"year"})) | | dplyr | select(ds, -year) | | pandas | ds.drop(columns=["year"]) | | polars | ds.select(pl.exclude("year")) |

Another property of functional languages in general, and especially Clojure, is that they really take advantage of the fact that a lot of operations are just functions.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.