Vocabulary shapes cross-lingual variation of word-order learnability in language models

A new study reveals that vocabulary and subword structure, rather than just grammatical rules, are the primary predictors of how easily language models learn word order across different languages.

Computer Science > Computation and Language

Title:Vocabulary shapes cross-lingual variation of word-order learnability in language models

View PDF HTML (experimental)Abstract:Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

Vocabulary shapes cross-lingual variation of word-order learnability in language models

Computer Science > Computation and Language

Title:Vocabulary shapes cross-lingual variation of word-order learnability in language models

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

More in this category

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

Discover All Categories