How to Handle Multilingual Data in LLM Pretraining Pipelines

Home
AI & Machine Learning
How to Handle Multilingual Data in LLM Pretraining Pipelines

Susannah Greenwood 25 April 2026 0 Comments

How to Handle Multilingual Data in LLM Pretraining Pipelines

Imagine spending months training a massive language model only to find it's a genius in English but struggles to form a basic sentence in Arabic or Thai. For a long time, developers feared the "curse of multilinguality"-the idea that the more languages you cram into a model, the worse it performs in each individual one. It felt like a zero-sum game: you either had a sharp, single-language tool or a dull, multilingual one. But recent breakthroughs show this isn't actually a hard limit. The real secret isn't how many languages you include, but how you filter and balance the data you feed the model.

The core problem in multilingual pretraining is no longer about whether we should mix languages, but how to optimize the allocation of tokens. If you give a model enough representation for a specific language, adding other languages doesn't degrade performance; in some cases, it actually helps. The challenge shifts from a binary choice to a data engineering problem: how do we find the highest quality tokens across hundreds of languages without drowning the model in noise?

The Myth of the Curse of Multilinguality

For years, the industry assumed that model capacity was a fixed pie. If a model had 3 billion parameters, adding a 10th or 20th language would theoretically "steal" capacity from the first few. However, research on models in the 1B to 3B parameter range has flipped this narrative. When investigators trained models on corpora spanning from 25 to over 400 languages, they found that combining English with diverse multilingual data doesn't necessarily hurt anyone. As long as each language has a sufficient number of tokens, the model can handle the variety.

This means we can stop worrying about a strict trade-off and start focusing on Token Allocation is the strategic distribution of training tokens across different languages to ensure the model achieves balanced proficiency . Instead of just dumping raw web crawls into a pipeline, the goal is to identify the "sweet spot" of data volume for each language group to prevent the model from ignoring low-resource languages or over-fitting on high-resource ones.

Using English as a Strategic Pivot

One of the most interesting findings in modern pretraining is the role of the pivot language. A pivot language is a high-resource language that helps the model learn general patterns of human communication, which it then transfers to other languages. While it seems logical to use a language from the same family (like using Spanish to help with Italian), the data suggests that English is the most effective pivot language due to its massive data availability and structural properties that catalyze cross-lingual generalization .

This happens because English provides such a dense map of logical connections and knowledge. When a model learns a concept in English, it creates a semantic representation that is easier to map onto other languages, regardless of whether those languages are linguistically related. It's less about shared vocabulary and more about the model learning "how to think" in a way that generalizes across the board.

Stylized golden bridge connecting different language symbols in a minimalist art style

Moving From Rule-Based to Model-Based Filtering

If you've ever worked with web-crawl data, you know it's mostly garbage. Traditionally, we used rule-based filtering-simple heuristics like "remove pages with too many symbols" or "filter by language tags." While this works for English, it's far too blunt for a multilingual pipeline. This is where Model-Based Filtering is a technique using machine learning classifiers to identify high-quality, knowledge-rich, and structured text samples comes into play.

Instead of relying on rigid rules, developers are now using FastText is an efficient library for text classification and representation used to quickly categorize and filter multilingual datasets and Transformer-based classifiers to score the quality of a document. The goal is to find "knowledge-rich" samples-text that actually contains facts and structured reasoning rather than just repetitive SEO spam. In experiments with the FineWeb-2 dataset, using this model-based approach allowed a 1B-parameter Llama model to hit baseline MMLU scores using only 15% of the original tokens. That is a massive win for efficiency.

Filtering Method Comparison for Multilingual Data
Feature	Rule-Based Filtering	Model-Based Filtering
Logic	Hard-coded heuristics (e.g., word counts)	ML Classifiers (FastText/Transformers)
Precision	Low; often removes good data or keeps spam	High; targets "knowledge-rich" content
Language Support	Strong for English, weak for others	Scalable across diverse language families
Efficiency	Fast but requires massive data volume	Slower to set up but reduces token waste

Practical Implementation: Building Your Pipeline

If you're looking to add new language support to an existing model, you generally have two paths: training from scratch or continual pretraining. Training from scratch gives you the most control over the tokenizer and data balance, but it's incredibly expensive. For most, continual pretraining on an English foundation model is the way to go.

To do this effectively, you can use frameworks like NVIDIA NeMo is a customizable AI framework that provides workflows for tokenizer merging and continual pretraining of multilingual LLMs . The process usually looks like this:

Tokenizer Merging: You can't just keep the English tokenizer; it will treat non-English text as a series of unknown characters or fragmented bytes. You must train a tokenizer on your target languages and merge it with the original.
Architecture Adjustment: Update the model's embedding layer to accommodate the new tokens added during the merge.
Targeted Pretraining: Feed the model high-quality, filtered data from the target language. For example, using Thai Wikipedia data to adapt a GPT-1.3B model allows it to pick up the nuances of Thai without forgetting its English capabilities.

Surreal mechanical sieve filtering grey noise into glowing diamonds of information

Strategic Data Curation and the FineWeb2-HQ2 Approach

The recent development of the FineWeb2-HQ2 dataset provides a blueprint for how to handle a diverse set of languages. By focusing on a mix of Chinese, German, French, Arabic, and Danish, researchers proved that combining different language families creates a synergistic effect. The model doesn't just learn the languages; it learns the underlying patterns of human logic more robustly.

The key takeaway here is that not all tokens are created equal. A trillion tokens of low-quality web scrapings are worth far less than a few billion tokens of curated, structured data. When building your pipeline, prioritize the type of data-look for textbooks, high-quality Wikipedia entries, and structured forums-rather than just the volume of data. With the availability of open datasets containing over 2 trillion tokens of permissibly licensed content, the barrier to entry is now lower than ever; the challenge is now purely about how you filter that mountain of information.

Does adding more languages always degrade performance in English?

No. Recent research on 1B to 3B parameter models shows that as long as there is adequate token representation for each language, adding multilingual data does not degrade the performance of the individual language groups. It's about the balance of data, not the number of languages.

Why is English considered the best pivot language?

English has a massive volume of high-quality data, which allows models to develop strong general-purpose representations of logic and knowledge. These representations are then easier to map onto other languages via cross-lingual transfer, regardless of linguistic similarity.

What is the difference between rule-based and model-based filtering?

Rule-based filtering uses fixed heuristics (like word counts or character ratios) to prune data. Model-based filtering uses trained classifiers (like FastText or Transformers) to identify if a piece of text is actually "knowledge-rich" and structured, leading to much higher data quality and better model performance with fewer tokens.

How do I handle tokenizers when adding a new language to an existing model?

You should train a new tokenizer on the target language's dataset and merge it with the existing English tokenizer. This ensures the model can represent the new language efficiently without relying on fragmented byte-level tokens. You'll then need to expand the model's embedding layer to match the new tokenizer size.

What is the "curse of multilinguality"?

It's a theoretical trade-off where adding more languages to a model with a fixed parameter count supposedly degrades the performance of each individual language. However, this effect is less significant in modern model scales (1B-3B) if data is curated and allocated correctly.

Next Steps for Implementation

Depending on your goals, your next move will differ. If you are building a global-first model, start by designing your token allocation strategy. Don't just split the data equally; use a weighted approach based on the available high-quality data for each language. Use a model-based filter to strip out the noise early in the pipeline to save on compute costs.

If you are adapting an existing model for a specific region, focus on continual pretraining. Use NVIDIA NeMo to handle the tokenizer merge and start with a small, high-quality dataset (like a curated Wikipedia dump) before moving to larger web-crawls. This prevents the model from "forgetting" its base capabilities while it learns the new linguistic patterns.

Susannah Greenwood

I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.

How to Handle Multilingual Data in LLM Pretraining Pipelines

EHGA is the Education Hub for Generative AI, offering clear guides, tutorials, and curated resources for learners and professionals. Explore ethical frameworks, governance insights, and best practices for responsible AI development and deployment. Stay updated with research summaries, tool reviews, and project-based learning paths. Build practical skills in prompt engineering, model evaluation, and MLOps for generative AI.

How to Handle Multilingual Data in LLM Pretraining Pipelines

The Myth of the Curse of Multilinguality

Using English as a Strategic Pivot

Moving From Rule-Based to Model-Based Filtering

Practical Implementation: Building Your Pipeline

Strategic Data Curation and the FineWeb2-HQ2 Approach

Does adding more languages always degrade performance in English?

Why is English considered the best pivot language?

What is the difference between rule-based and model-based filtering?

How do I handle tokenizers when adding a new language to an existing model?

What is the "curse of multilinguality"?

Next Steps for Implementation

Susannah Greenwood

Popular Articles

How to Handle Multilingual Data in LLM Pretraining Pipelines

About

Latest Stories

Fine-Tuning for Faithfulness in Generative AI: How Supervised and Preference Methods Reduce Hallucinations

Categories

Featured Posts

How to Reduce LLM Latency: A Guide to Streaming, Batching, and Caching

Logit Bias and Token Banning: How to Steer LLM Outputs Without Retraining

Allocating LLM Costs Across Teams: Chargeback Models That Work

Throughput vs Latency: Optimizing LLM Inference Speed and Transformer Design

Generative AI in Healthcare: Boosting Diagnostic Accuracy and Treatment Speed