- Home
- AI & Machine Learning
- How to Handle Multilingual Data in LLM Pretraining Pipelines
How to Handle Multilingual Data in LLM Pretraining Pipelines
The core problem in multilingual pretraining is no longer about whether we should mix languages, but how to optimize the allocation of tokens. If you give a model enough representation for a specific language, adding other languages doesn't degrade performance; in some cases, it actually helps. The challenge shifts from a binary choice to a data engineering problem: how do we find the highest quality tokens across hundreds of languages without drowning the model in noise?
The Myth of the Curse of Multilinguality
For years, the industry assumed that model capacity was a fixed pie. If a model had 3 billion parameters, adding a 10th or 20th language would theoretically "steal" capacity from the first few. However, research on models in the 1B to 3B parameter range has flipped this narrative. When investigators trained models on corpora spanning from 25 to over 400 languages, they found that combining English with diverse multilingual data doesn't necessarily hurt anyone. As long as each language has a sufficient number of tokens, the model can handle the variety.
This means we can stop worrying about a strict trade-off and start focusing on Token Allocation is the strategic distribution of training tokens across different languages to ensure the model achieves balanced proficiency . Instead of just dumping raw web crawls into a pipeline, the goal is to identify the "sweet spot" of data volume for each language group to prevent the model from ignoring low-resource languages or over-fitting on high-resource ones.
Using English as a Strategic Pivot
One of the most interesting findings in modern pretraining is the role of the pivot language. A pivot language is a high-resource language that helps the model learn general patterns of human communication, which it then transfers to other languages. While it seems logical to use a language from the same family (like using Spanish to help with Italian), the data suggests that English is the most effective pivot language due to its massive data availability and structural properties that catalyze cross-lingual generalization .
This happens because English provides such a dense map of logical connections and knowledge. When a model learns a concept in English, it creates a semantic representation that is easier to map onto other languages, regardless of whether those languages are linguistically related. It's less about shared vocabulary and more about the model learning "how to think" in a way that generalizes across the board.
Moving From Rule-Based to Model-Based Filtering
If you've ever worked with web-crawl data, you know it's mostly garbage. Traditionally, we used rule-based filtering-simple heuristics like "remove pages with too many symbols" or "filter by language tags." While this works for English, it's far too blunt for a multilingual pipeline. This is where Model-Based Filtering is a technique using machine learning classifiers to identify high-quality, knowledge-rich, and structured text samples comes into play.
Instead of relying on rigid rules, developers are now using FastText is an efficient library for text classification and representation used to quickly categorize and filter multilingual datasets and Transformer-based classifiers to score the quality of a document. The goal is to find "knowledge-rich" samples-text that actually contains facts and structured reasoning rather than just repetitive SEO spam. In experiments with the FineWeb-2 dataset, using this model-based approach allowed a 1B-parameter Llama model to hit baseline MMLU scores using only 15% of the original tokens. That is a massive win for efficiency.
| Feature | Rule-Based Filtering | Model-Based Filtering |
|---|---|---|
| Logic | Hard-coded heuristics (e.g., word counts) | ML Classifiers (FastText/Transformers) |
| Precision | Low; often removes good data or keeps spam | High; targets "knowledge-rich" content |
| Language Support | Strong for English, weak for others | Scalable across diverse language families |
| Efficiency | Fast but requires massive data volume | Slower to set up but reduces token waste |
Practical Implementation: Building Your Pipeline
If you're looking to add new language support to an existing model, you generally have two paths: training from scratch or continual pretraining. Training from scratch gives you the most control over the tokenizer and data balance, but it's incredibly expensive. For most, continual pretraining on an English foundation model is the way to go.
To do this effectively, you can use frameworks like NVIDIA NeMo is a customizable AI framework that provides workflows for tokenizer merging and continual pretraining of multilingual LLMs . The process usually looks like this:
- Tokenizer Merging: You can't just keep the English tokenizer; it will treat non-English text as a series of unknown characters or fragmented bytes. You must train a tokenizer on your target languages and merge it with the original.
- Architecture Adjustment: Update the model's embedding layer to accommodate the new tokens added during the merge.
- Targeted Pretraining: Feed the model high-quality, filtered data from the target language. For example, using Thai Wikipedia data to adapt a GPT-1.3B model allows it to pick up the nuances of Thai without forgetting its English capabilities.
Strategic Data Curation and the FineWeb2-HQ2 Approach
The recent development of the FineWeb2-HQ2 dataset provides a blueprint for how to handle a diverse set of languages. By focusing on a mix of Chinese, German, French, Arabic, and Danish, researchers proved that combining different language families creates a synergistic effect. The model doesn't just learn the languages; it learns the underlying patterns of human logic more robustly.
The key takeaway here is that not all tokens are created equal. A trillion tokens of low-quality web scrapings are worth far less than a few billion tokens of curated, structured data. When building your pipeline, prioritize the type of data-look for textbooks, high-quality Wikipedia entries, and structured forums-rather than just the volume of data. With the availability of open datasets containing over 2 trillion tokens of permissibly licensed content, the barrier to entry is now lower than ever; the challenge is now purely about how you filter that mountain of information.
Does adding more languages always degrade performance in English?
No. Recent research on 1B to 3B parameter models shows that as long as there is adequate token representation for each language, adding multilingual data does not degrade the performance of the individual language groups. It's about the balance of data, not the number of languages.
Why is English considered the best pivot language?
English has a massive volume of high-quality data, which allows models to develop strong general-purpose representations of logic and knowledge. These representations are then easier to map onto other languages via cross-lingual transfer, regardless of linguistic similarity.
What is the difference between rule-based and model-based filtering?
Rule-based filtering uses fixed heuristics (like word counts or character ratios) to prune data. Model-based filtering uses trained classifiers (like FastText or Transformers) to identify if a piece of text is actually "knowledge-rich" and structured, leading to much higher data quality and better model performance with fewer tokens.
How do I handle tokenizers when adding a new language to an existing model?
You should train a new tokenizer on the target language's dataset and merge it with the existing English tokenizer. This ensures the model can represent the new language efficiently without relying on fragmented byte-level tokens. You'll then need to expand the model's embedding layer to match the new tokenizer size.
What is the "curse of multilinguality"?
It's a theoretical trade-off where adding more languages to a model with a fixed parameter count supposedly degrades the performance of each individual language. However, this effect is less significant in modern model scales (1B-3B) if data is curated and allocated correctly.
Next Steps for Implementation
Depending on your goals, your next move will differ. If you are building a global-first model, start by designing your token allocation strategy. Don't just split the data equally; use a weighted approach based on the available high-quality data for each language. Use a model-based filter to strip out the noise early in the pipeline to save on compute costs.
If you are adapting an existing model for a specific region, focus on continual pretraining. Use NVIDIA NeMo to handle the tokenizer merge and start with a small, high-quality dataset (like a curated Wikipedia dump) before moving to larger web-crawls. This prevents the model from "forgetting" its base capabilities while it learns the new linguistic patterns.
Susannah Greenwood
I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.
Popular Articles
About
EHGA is the Education Hub for Generative AI, offering clear guides, tutorials, and curated resources for learners and professionals. Explore ethical frameworks, governance insights, and best practices for responsible AI development and deployment. Stay updated with research summaries, tool reviews, and project-based learning paths. Build practical skills in prompt engineering, model evaluation, and MLOps for generative AI.