OpenGPT-X research project

Inka Krischke, 06.12.2024, 11:05

Large AI language model published

The large AI language model of the OpenGPT-X research project is now available for download on Hugging Face: "Teuken-7B" was trained from scratch with the 24 official languages of the EU and comprises 7 billion parameters.

Images

Juwels' was used for the training of Teuken-7B, among others. © Research Center Jülich / Sascha Kreklau

Researchers and companies can use the commercially viable open source model for their own artificial intelligence (AI) applications. The partners of the OpenGPT-X consortium project funded by the German Federal Ministry for Economic Affairs and Climate Protection (BMWK) and led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS have thus launched a large AI language model as a freely usable open source model with a European perspective.

"In the OpenGPT-X project, we have spent the past two years working with strong partners from research and industry to research the basic technology for large AI fundamental models and train corresponding models. We are pleased that we can now make our 'Teuken-7B' model freely available worldwide and thus offer an alternative for science and companies that originates from public research," says Prof. Dr. Stefan Wrobel, Institute Director at Fraunhofer IAIS. "Our model has demonstrated its capabilities across a wide range of languages, and we hope that as many people as possible will adapt or further develop the model for their own work and applications. In this way, we want to make a contribution both within the scientific community and together with companies from different industries to address the growing demand for transparent and customizable generative artificial intelligence solutions."

The language distribution of Teuken-7B-v0.4: In addition to code, Teuken-7B-v0.4 contains around 50 % non-English text from 23 European countries and only around 40 % English pretraining data (by comparison, Meta Llama3 was trained with only 8 % non-English languages). This distinguishes Teuken-7B-v0.4 from most of the multilingual models available to date, which were only expanded to include multilingual data in the course of further pre-training or fine-tuning. © Fraunhofer IAIS

Teuken-7B is currently one of the few AI language models that have been developed multilingually from the ground up. It contains around 50% non-English pre-training data and has been trained in all 24 official European languages. According to Fraunhofer IAIS, its performance has proven to be stable and reliable across several languages. This offers added value, particularly for international companies with multilingual communication needs and product and service offerings. The provision as an open source model allows companies and organizations to operate their own adapted models in real applications. Sensitive data can remain within the company.

Multilingual 'Tokenizer'

In addition to model training, the OpenGPT-X team also addressed numerous research questions, such as how multilingual AI language models can be trained and operated in a more energy- and cost-efficient manner. To this end, a multilingual 'tokenizer' was developed in the project. The task of a tokenizer is to break down words into individual word components - the fewer tokens, the more (energy-) efficiently and quickly a language model generates the answer. The developed tokenizer led to a reduction in training costs compared to other multilingual tokenizers, such as Llama3 or Mistral. This is particularly important for European languages with long words such as German, Finnish or Hungarian. Efficiency gains can also be achieved in the operation of multilingual AI applications.

Also accessible via the Gaia-X infrastructure

The bar chart shows the performance of Teuken-7B-instruct-research-v0.4 in the multilingual benchmarks ARC-, HellaSwag- and TruthfulQA in comparison to other open source models of similar size. The bars show the performance for the respective benchmark averaged over 21 European languages, and the mean value of all three benchmarks. In this selection of benchmarks, Teuken-7B-instruct-research-v0.4 is ahead of all other models on average. In the individual benchmarks ARC and HellaSwag, Teuken is in second place behind Salamandra-7b-instruct, and in TruthfulQA in second place behind Mistral-7B-instruct-v0.3. © Fraunhofer IAIS

The OpenGPT-X joint project was funded as part of the BMWK funding program 'Innovative and practical applications and data spaces in the Gaia-X digital ecosystem'. Teuken-7B is therefore also accessible via the Gaia-X infrastructure. Actors in the Gaia-X ecosystem can thus develop innovative language applications and transfer them into concrete application scenarios in their respective domains. In contrast to existing cloud solutions, Gaia-X is a federated system that allows different service providers and data owners to connect with each other. The data always remains with the owner and is only shared according to defined conditions.

"I am delighted with today's release of the Gaia-X-based AI language model Teuken-7B and congratulate the OpenGPT-X project for reaching this important milestone. Teuken-7B also enables the secure use of sensitive company data, as the Gaia-X standards guarantee data storage and processing in accordance with the highest European data protection and security regulations. Innovations such as these strengthen digital sovereignty, competitiveness and also the resilience of Germany and Europe. This is why the BMWK is funding the project with around 14 million euros," says Dr. Franziska Brantner, Parliamentary State Secretary at the BMWK.

Prof. Dr. Bernhard Grill, Institute Director at Fraunhofer IIS, emphasizes the importance for safety-relevant applications: "With the completely independently trained language model published here, the project partners demonstrate their ability to generate their own large models. The associated access to a large AI language model enables applications that offer much better control over this technology without the need for non-visible third-party components - e.g. for specific, particularly safety-critical applications in the automotive sector, robotics, medicine or finance. By training with the data relevant to the specific use case and using application-specific architectures, companies can create individual AI solutions that do not require black box components."

Generative AI from a strong network

The diagram shows the additional computing power required to process a non-English text with the tokenizer associated with the language model (in % compared to Llama 3). In comparison, Teuken models require the least amount of additional computing power and therefore incur the lowest surcharge for multlingual queries to the model. © Fraunhofer IAIS

Important research results from the OpenGPT-X project have been incorporated into the model development, such as tools and technologies for processing very large amounts of data, using powerful European HPC infrastructures and carrying out efficient model training. Teuken-7B was trained using the 'Juwels' supercomputer at Forschungszentrum Jülich. In addition to the two Fraunhofer Institutes and Forschungszentrum Jülich, the AI Bundesverband, TU Dresden, the German Research Center for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert and Westdeutscher Rundfunk (WDR) worked on OpenGPT-X as partners. The technology developed in OpenGPT-X will also provide the partners with the basis for training their own models in the future.

"OpenGPT-X serves as an example of how valuable basic technology can be created with the funds of a publicly funded project and the joint efforts of a broad-based consortium - from the underlying infrastructure to the training of models and productive application. In the interests of technology and data sovereignty, it is now important to build on this foundation: We hope that OpenGPT-X will be used as a basis for many subsequent activities," emphasizes Daniel Abbou, Managing Director of the German AI Association and President of the European AI Forum.

The research project, which was launched at the beginning of 2022, is now nearing completion. It will run until 31 March 2025 so that further optimizations and evaluations of the models can take place.

The path to using Teuken-7B

Interested developers from the scientific community or companies can download Teuken-7B free of charge from Hugging Face and work with it in their own development environment. The model has already been optimized for chat by means of instruction tuning. Instruction tuning is used to adapt large AI language models so that the model correctly understands instructions from users, which is particularly relevant for using the models in practice - for example, for use in a chat application.

Teuken-7B is available in two versions: a version that can be used for research purposes and a version under the 'Apache 2.0' license, which companies can use for commercial purposes in addition to research and integrate into their own AI applications. The performance of both models is roughly comparable, but some of the data sets used for instruction tuning exclude commercial use and were therefore not used in the Apache 2.0 version.

Back to topic page

You might also be interested in

VDMA at the Hannover Messe 2026

Relocation concerns in mechanical engineering

The German mechanical and plant engineering industry is under increasing pressure from regulation, high costs and geopolitical risks. According to the VDMA, many companies are considering investing abroad. At the same time, the industry remains...

RobCo at the Hannover Messe 2026

Robotic system for dynamic environments

RobCo is presenting a new industrial robotics system for dynamic tasks in industry at the Hannover Messe. With 'Autonomous Alfie', the company is opening up a new category of robotic solutions. This allows tasks in which processes, objects and...

Fraunhofer at the Hannover Messe 2026

Data, AI and material innovations for the industry of tomorrow

Under the motto "Innovations for our Future", nine Fraunhofer Institutes and Groups and the Research Fab Microelectronics Germany (FMD) will be presenting concrete solutions for a resilient, digitally networked industry at Hannover Messe 2026.

Fraunhofer IMS

Funding project on embedded AI

The "Edge AI Platform" project is entering its third round of funding: three Fraunhofer Institutes are further developing the platform to version 3.0 in order to provide companies with even more efficient access to embedded artificial intelligence...

Munich Trade Fair

automatica to be held in China for the first time in 2027

automatica is expanding its global brand strategy and will take place in China for the first time in 2027. automatica Shanghai 2027, organized by Messe München Shanghai and the VDMA Representative Office in Shanghai, will celebrate its premiere from...

Sereact at the Logimat 2026

Generative AI delivers more than 300 picks per hour

Sereact presents the latest generation of the Vision-Language-Action (VLA) model 'Cortex' and the associated visual intelligence 'Lens'.

TUM and Neura Robotics

World's largest robotics training center planned

The Munich Institute of Robotics and Machine Intelligence (TUM MIRMI) at the Technical University of Munich (TUM) and Neura Robotics are joining forces to create the world's largest research and training center for robotics in the scientific field....

Cybersecurity

Dragos expands collaboration with Microsoft

Dragos, a global provider of cyber security for OT environments, is expanding its collaboration with Microsoft. The aim is to support companies in modernizing and securing their cyber-physical operating processes.

Vention

Generalized physical AI pipeline for production automation

Vention launches 'GRIIP (Generalized Robotic Industrial Intelligence Pipeline)'. The end-to-end pipeline for physical AI enables the use of autonomous robot cells in highly unstructured manufacturing environments.

Large AI language model published

Multilingual 'Tokenizer'

Also accessible via the Gaia-X infrastructure

Generative AI from a strong network

The path to using Teuken-7B

You might also be interested in

Relocation concerns in mechanical engineering

Robotic system for dynamic environments

Data, AI and material innovations for the industry of tomorrow

Funding project on embedded AI

automatica to be held in China for the first time in 2027

Generative AI delivers more than 300 picks per hour

World's largest robotics training center planned

Dragos expands collaboration with Microsoft

Generalized physical AI pipeline for production automation

Categories

Focus areas

Service

Magazine

Our network