Training mRNA Language Models Across 25 Species for $165

A new end-to-end protein AI pipeline achieves state-of-the-art codon optimization across 25 species using CodonRoBERTa-large-v2, outperforming ModernBERT at a fraction of the cost.
We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.
reply
Source: Hacker News












