OpenGPT-X research project
Large AI language model published
The large AI language model of the OpenGPT-X research project is now available for download on Hugging Face: "Teuken-7B" was trained from scratch with the 24 official languages of the EU and comprises 7 billion parameters.
Researchers and companies can use the commercially viable open source model for their own artificial intelligence (AI) applications. The partners of the OpenGPT-X consortium project funded by the German Federal Ministry for Economic Affairs and Climate Protection (BMWK) and led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS have thus launched a large AI language model as a freely usable open source model with a European perspective.
"In the OpenGPT-X project, we have spent the past two years working with strong partners from research and industry to research the basic technology for large AI fundamental models and train corresponding models. We are pleased that we can now make our 'Teuken-7B' model freely available worldwide and thus offer an alternative for science and companies that originates from public research," says Prof. Dr. Stefan Wrobel, Institute Director at Fraunhofer IAIS. "Our model has demonstrated its capabilities across a wide range of languages, and we hope that as many people as possible will adapt or further develop the model for their own work and applications. In this way, we want to make a contribution both within the scientific community and together with companies from different industries to address the growing demand for transparent and customizable generative artificial intelligence solutions."
Teuken-7B is currently one of the few AI language models that have been developed multilingually from the ground up. It contains around 50% non-English pre-training data and has been trained in all 24 official European languages. According to Fraunhofer IAIS, its performance has proven to be stable and reliable across several languages. This offers added value, particularly for international companies with multilingual communication needs and product and service offerings. The provision as an open source model allows companies and organizations to operate their own adapted models in real applications. Sensitive data can remain within the company.
Multilingual 'Tokenizer'
In addition to model training, the OpenGPT-X team also addressed numerous research questions, such as how multilingual AI language models can be trained and operated in a more energy- and cost-efficient manner. To this end, a multilingual 'tokenizer' was developed in the project. The task of a tokenizer is to break down words into individual word components - the fewer tokens, the more (energy-) efficiently and quickly a language model generates the answer. The developed tokenizer led to a reduction in training costs compared to other multilingual tokenizers, such as Llama3 or Mistral. This is particularly important for European languages with long words such as German, Finnish or Hungarian. Efficiency gains can also be achieved in the operation of multilingual AI applications.
Also accessible via the Gaia-X infrastructure
The OpenGPT-X joint project was funded as part of the BMWK funding program 'Innovative and practical applications and data spaces in the Gaia-X digital ecosystem'. Teuken-7B is therefore also accessible via the Gaia-X infrastructure. Actors in the Gaia-X ecosystem can thus develop innovative language applications and transfer them into concrete application scenarios in their respective domains. In contrast to existing cloud solutions, Gaia-X is a federated system that allows different service providers and data owners to connect with each other. The data always remains with the owner and is only shared according to defined conditions.
"I am delighted with today's release of the Gaia-X-based AI language model Teuken-7B and congratulate the OpenGPT-X project for reaching this important milestone. Teuken-7B also enables the secure use of sensitive company data, as the Gaia-X standards guarantee data storage and processing in accordance with the highest European data protection and security regulations. Innovations such as these strengthen digital sovereignty, competitiveness and also the resilience of Germany and Europe. This is why the BMWK is funding the project with around 14 million euros," says Dr. Franziska Brantner, Parliamentary State Secretary at the BMWK.
Prof. Dr. Bernhard Grill, Institute Director at Fraunhofer IIS, emphasizes the importance for safety-relevant applications: "With the completely independently trained language model published here, the project partners demonstrate their ability to generate their own large models. The associated access to a large AI language model enables applications that offer much better control over this technology without the need for non-visible third-party components - e.g. for specific, particularly safety-critical applications in the automotive sector, robotics, medicine or finance. By training with the data relevant to the specific use case and using application-specific architectures, companies can create individual AI solutions that do not require black box components."
Generative AI from a strong network
Important research results from the OpenGPT-X project have been incorporated into the model development, such as tools and technologies for processing very large amounts of data, using powerful European HPC infrastructures and carrying out efficient model training. Teuken-7B was trained using the 'Juwels' supercomputer at Forschungszentrum Jülich. In addition to the two Fraunhofer Institutes and Forschungszentrum Jülich, the AI Bundesverband, TU Dresden, the German Research Center for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert and Westdeutscher Rundfunk (WDR) worked on OpenGPT-X as partners. The technology developed in OpenGPT-X will also provide the partners with the basis for training their own models in the future.
"OpenGPT-X serves as an example of how valuable basic technology can be created with the funds of a publicly funded project and the joint efforts of a broad-based consortium - from the underlying infrastructure to the training of models and productive application. In the interests of technology and data sovereignty, it is now important to build on this foundation: We hope that OpenGPT-X will be used as a basis for many subsequent activities," emphasizes Daniel Abbou, Managing Director of the German AI Association and President of the European AI Forum.
The research project, which was launched at the beginning of 2022, is now nearing completion. It will run until 31 March 2025 so that further optimizations and evaluations of the models can take place.
The path to using Teuken-7B
Interested developers from the scientific community or companies can download Teuken-7B free of charge from Hugging Face and work with it in their own development environment. The model has already been optimized for chat by means of instruction tuning. Instruction tuning is used to adapt large AI language models so that the model correctly understands instructions from users, which is particularly relevant for using the models in practice - for example, for use in a chat application.
Teuken-7B is available in two versions: a version that can be used for research purposes and a version under the 'Apache 2.0' license, which companies can use for commercial purposes in addition to research and integrate into their own AI applications. The performance of both models is roughly comparable, but some of the data sets used for instruction tuning exclude commercial use and were therefore not used in the Apache 2.0 version.













