Curriculum Learning for large language models in low-resource languages

Deskribapen motza, derrigorrezkoa proiektuak logorik ez badu (eu):

Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources

Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing. Building LLMs require huge amounts of data, which is not available for low resource languages. As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe. The data scarcity problem is usually alleviated by augmenting the training corpora in the target language with text from a language with many resources (e.g. English). In this project we propose a systematic study of different strategies to perform this combination in an optimal way, framing the existing approaches into a more general curriculum learning paradigm. We will use the computational resources of EuroHPC to perform a systematic study and scale up experiments to build LLMs for four European languages with few resources. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.

Deskribapen motza, derrigorrezkoa proiektuak logorik ez badu (en):

Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources

Deskribapena (en):

Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing. Building LLMs require huge amounts of data, which is not available for low resource languages. As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe. The data scarcity problem is usually alleviated by augmenting the training corpora in the target language with text from a language with many resources (e.g. English). In this project we propose a systematic study of different strategies to perform this combination in an optimal way, framing the existing approaches into a more general curriculum learning paradigm. We will use the computational resources of EuroHPC to perform a systematic study and scale up experiments to build LLMs for four European languages with few resources. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.

Deskribapen motza, derrigorrezkoa proiektuak logorik ez badu (es):

Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources

Deskribapena (es):

Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing. Building LLMs require huge amounts of data, which is not available for low resource languages. As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe. The data scarcity problem is usually alleviated by augmenting the training corpora in the target language with text from a language with many resources (e.g. English). In this project we propose a systematic study of different strategies to perform this combination in an optimal way, framing the existing approaches into a more general curriculum learning paradigm. We will use the computational resources of EuroHPC to perform a systematic study and scale up experiments to build LLMs for four European languages with few resources. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.

Kode ofiziala:

EHPC-EXT-2024E01-042

Ikertzaile nagusia:

Aitor Soroa

Erakundea:

EuroHPC Joint Undertaking

Saila:

Hitz Zentroa

Hasiera data:

2024/10/10

Bukaera data:

2025/10/09

Taldea:

IXA taldea

Taldeko ikertzaile nagusia:

Aitor Soroa

Ixakideak:

Kontratua:

No

Webgunea:

http://

Languages

Who we are

What we do

Others

Curriculum Learning for large language models in low-resource languages

Search form

Languages

You are here

Who we are

What we do

Others

Curriculum Learning for large language models in low-resource languages