Large Language Models (LLMs) have revolutionized the field of AI, opening up new opportunities for research and industry applications. However, LLMs demonstrate impressive capabilities only in high-resource languages, in particular in English, while their performance varies substantially across different languages. Especially in the case of low-resourced languages, such as Greek, existing open-source LLMs are underperforming due to lack of training data.

Recently, there have been efforts to extend the capabilities of open-source LLMs to other languages (e.g., LeoLM for German, Aguila for Spanish, etc.). This shift provides local communities with alternatives to commercial, siloed solutions, and with control for developing safe and application-optimized models.

To address these challenges, the Institute for Language and Speech Processing of the Athena Research Center is thrilled to introduce Meltemi, the first Greek bilingual LLM. While being highly proficient in English, Meltemi has been extended to understand and generate fluent text in Modern Greek. Built on top of Mistral-7B through continual pretraining, Meltemi is trained on a corpus of 28.5 billion tokens that includes high-quality greek texts.

Two models have already been released, trained with 8k context length under the Apache 2.0 license: Meltemi-7B-v1 and Meltemi-Instruct-7B-v1, an instruction-tuned variant that can be used for chatbot applications. The performance of the released models has been assessed on an LLM evaluation suite created by ILSP showing an average improvement of 14.9% compared to Mistral-7B.

The model was trained on an AWS infrastructure that was made available by GRNET.

You can find out more here.