Battle for AI Polyglotism Aligns with Europe's Linguistic Diversity
In an ongoing effort to address the challenge of low-resource languages in large language models (LLMs), European AI companies and projects are actively working towards a more inclusive approach.
The EuroLLM project, spearheaded by Portuguese AI company Unbabel in partnership with European universities, is a significant stride in this direction. EuroLLM focuses on understanding and generating text in all official EU languages, including low-resource ones, as well as languages widely spoken by immigrant communities and major trading partners, such as Hindi, Chinese, and Turkish.
The scarcity of training data for less-spoken languages has been a major challenge for EuroLLM. However, resources like Europarl transcripts, which provide parallel data for official EU languages, have aided training. EuroLLM currently offers models ranging from 1.7B to 22B parameters capable of translation and general multilingual interaction.
Balanced pre-training data distribution is another strategy being employed. Models like Salamandra and EuroLLM intentionally distribute training tokens fairly across languages, boosting performance for smaller languages such as Basque and Galician. However, this can sometimes reduce performance on high-resource languages like Spanish and Catalan.
Continuous pre-training and fine-tuning methods are also being used, with the addition of data from low-resource languages alongside English to prevent forgetting, achieving strong target-language results. Fine-tuning alone improves fluency and style but is less effective to boost reasoning or QA in low-resource languages without extensive multilingual pre-training.
There is also growing attention to smaller, efficient LLMs optimized for low-resource environments such as edge devices. These compact models offer accessibility and sustainability benefits and can perform well where large models are impractical, promising for languages with limited digital resources.
Infrastructure and ecosystem support are crucial for diverse language development and deployment. Germany and other European countries are investing in AI infrastructure, training, and access programs to empower smaller enterprises and researchers, potentially benefiting diverse language development and deployment.
Notable European AI initiatives include Lumi, which uses a "cross-lingual training" technique, sharing parameters between high-resource and low-resource languages. Hugging Face, a company promoting open models, is one of the driving forces behind the BLOOM model, a groundbreaking multilingual model. Europe also boasts high-profile AI companies and projects such as Mistral, which offers free-to-use models with multilingual support.
A LinkedIn poll revealed a 50/50 split between people using AI tools in English and a mixture of languages. This underscores the need for more inclusive AI tools that cater to a wider range of languages. European initiatives are stepping up to meet this challenge, aiming for equitable coverage across Europe’s linguistic diversity.
Artificial intelligence projects in Europe, such as the EuroLLM and Lumi, are focusing on developing models capable of understanding and generating text in various low-resource languages like Hindi, Chinese, and Turkish, to address the challenge of language inclusivity in large language models. The Balanced pre-training data distribution strategy is being employed to boost performance for smaller languages like Basque and Galician, while also preventing a decline in performance for high-resource languages like Spanish and Catalan.