AI Is Saving Endangered Languages and Killing Them at the Same Time
- Nikita Silaech
- Dec 1, 2025
- 2 min read

Languages are dying faster than ever before. UNESCO estimates that nearly 40% of the world's languages are endangered, and the rate of language loss could triple within 40 years (Venus Moon, 2025). Every two weeks another minority language disappears completely (Goethe Institute, 2019).
Artificial intelligence has emerged as a tool to preserve these languages. Meta's "No Language Left Behind" program expanded its translation systems to include over 200 languages that were not previously served by machine translation software, including Tswana, Dari, and Samoan (Nature, 2024). Google's Project Euphonia uses AI to recognize and document rare speech patterns from endangered language speakers. These efforts are really significant because without digital records, languages can be lost forever when the last speaker passes away.
But there is a darker side to this story. Large language models like ChatGPT are trained predominantly on English and a handful of other data-rich languages. This creates an AI landscape heavily skewed toward the Western world. As AI becomes more integrated into education, healthcare, and governance, the dominance of a small number of languages in these systems risks accelerating linguistic homogenization (Tech Policy Press, 2025).
The paradox is that the same technology being used to preserve endangered languages is making the conditions for their survival worse. When AI systems are deployed globally and primarily speak English, they create an incentive structure that pushes people away from minority languages toward languages the AI understands better.
There is also a quality problem to this. When researchers train AI models on endangered languages, they need human language specialists to ensure the quality of the training data. Without human oversight, algorithms get trained on poor quality data that was itself generated by AI, which then creates even more poor quality text. This means future models learn from degraded versions of the language.
William Lamb, a linguist at the University of Edinburgh, has documented this happening with Scottish Gaelic. Most online content in Scottish Gaelic is now generated by AI, which means new AI models train on AI-generated content, creating a feedback loop that destroys the actual language (Nature, 2024).
The solution requires something more difficult than just technology. It requires community involvement. Researchers must work with native speakers, local anthropologists, and sociolinguists, not just in creating the systems but also in how those systems are used and deployed. Ensuring linguistic diversity in AI means building smaller language models that can preserve linguistic nuance rather than relying on massive English-centric models (Oxford Academic, 2025).
It also requires data sovereignty. Communities that speak endangered languages need to have control over how their language data is used, trained on, and commercialized. Right now, much of this work happens without meaningful engagement from the language communities themselves.
AI could be a genuine tool for language revitalization if it were designed with those communities as primary stakeholders rather than as data sources. But currently, we are going in a different direction. We are building systems that can preserve languages while simultaneously creating conditions that make those languages less essential.





Comments