Skip to main content

Wals Roberta Sets 1-36.zip May 2026

By treating each set as a temporal slice (hypothetical), you can train a recurrent version of RoBERTa to simulate how word order or phoneme inventories shift over time.

This dataset is derived from WALS (World Atlas of Language Structures), a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials by a team of 55 authors. WALS Roberta Sets 1-36.zip

The .zip archive contains structured data files partitioned into 36 sets. While specific naming conventions may vary, the typical structure is designed to segment the data by: By treating each set as a temporal slice

This dataset is intended for researchers and practitioners in Natural Language Processing (NLP) and Computational Linguistics. Primary use cases include: Note: Please ensure you cite the original WALS

WALS Roberta Sets 1-36.zip is a specialized dataset bundle derived from the World Atlas of Language Structures (WALS). It is pre-processed and formatted specifically for fine-tuning and evaluating RoBERTa-based language models on linguistic typology tasks. The archive contains 36 distinct data splits (or feature sets), allowing for granular analysis of syntactic, morphological, and phonological features across the world's languages.


Note: Please ensure you cite the original WALS database authors if you use this dataset in your research.

Open chat support