EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

EsoLang-Bench reveals a dramatic performance collapse in frontier LLMs when tested on esoteric programming languages with scarce training data. The findings suggest that high scores on mainstream benchmarks may reflect data memorization rather than genuine algorithmic reasoning.

Current benchmarks for large language model (LLM) code generation primarily evaluate mainstream languages like Python, where models benefit from massive pretraining corpora. This leads to inflated accuracy scores that may reflect data memorization rather than genuine reasoning ability. We introduce EsoLang-Bench, a benchmark of 80 programming problems across five esoteric languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) where training data is 5,000 to 100,000x scarcer than Python.

We evaluate five frontier models using five prompting strategies and two agentic coding systems. The best-performing model achieves only 3.8% overall accuracy, compared to ~90% on equivalent Python tasks. All models score 0% on problems above the Easy tier, Whitespace remains completely unsolved (0% across all configurations), and self-reflection provides essentially zero benefit. These results reveal a dramatic gap between benchmark performance on mainstream languages and genuine programming ability, suggesting that current LLM code generation capabilities are far narrower than headline metrics imply.

Frontier models achieving 85 to 95% on standard benchmarks score only 0 to 11% on equivalent esoteric tasks, revealing that high scores on mainstream languages do not reflect general programming ability. All models score 0% on Medium, Hard, and Extra-Hard problems across all languages and strategies, indicating a hard ceiling on current reasoning capabilities beyond the simplest tasks.

No model produces valid Whitespace code under any configuration. The invisible syntax (spaces, tabs, newlines only) cannot be learned from training data, a paradigm that is economically irrational to include in pre-training. Few-shot prompting yields no significant improvement over zero-shot (Wilcoxon p = 0.505), suggesting ICL success on standard benchmarks reflects activation of training priors rather than genuine in-context learning.

Direct interpreter feedback (1 LLM call/iteration) consistently outperforms multi-agent approaches. Adding a critic or planner introduces noise rather than useful signal when all components lack domain knowledge. Tool-augmented agents (Codex, Claude Code) achieve ~2× the accuracy of prompting-only approaches via execution feedback loops that partially compensate for the lack of training data.

Source: Hacker News