OpenGPT-X research projectLarge AI language model published
The large AI language model of the OpenGPT-X research project is now available for download on Hugging Face: "Teuken-7B" was trained from scratch with the 24 official languages of the EU and comprises 7 billion parameters.
The language distribution of Teuken-7B-v0.4: In addition to code, Teuken-7B-v0.4 contains around 50 % non-English text from 23 European countries and only around 40 % English pretraining data (by comparison, Meta Llama3 was trained with only 8 % non-English languages). This distinguishes Teuken-7B-v0.4 from most of the multilingual models available to date, which were only expanded to include multilingual data in the course of further pre-training or fine-tuning. © Fraunhofer IAIS

