Hadoop
Rethinking big data handling
Breaking down the separation of data storage and data processing: That is the goal of the open source platform Hadoop. John Kreisa, Vice President for International Marketing at Hortonworks, talks about an architecture that grows with the volume of data.
Mr. Kreisa, Hadoop is set to play an important role in the field of big data. What is behind it?
Kreisa: Hadoop is a project within the Apache Software Foundation and is intended to provide a fundamental technology for big data. Behind it is a free framework written in Java for scalable, distributed software. This should make it possible to carry out intensive computing processes with large amounts of data on computer clusters. For us at Hortonworks, this is the basis of all our developments. From a technical perspective, Hadoop requires new thinking in the areas of computing, data and analysis.
How does Hadoop differ from previous methods?
Kreisa: Previously, data and its processing was carried out in different areas of the network. The data was therefore always moved across the network for processing. However, the results of the processing were then written back to the storage location. This makes it almost impossible to carry out a really fast analysis with a corresponding gain in knowledge. The separation of data storage and data processing is increasingly proving to be a performance limitation. Hadoop addresses precisely this problem and combines data storage and processing in so-called nodes. Large computing tasks are divided into several small jobs and distributed across the nodes - in each case to where the data is stored.
The second point is the focus on the data. Conventional architectures typically store and process data in rows and columns of a relational database. Some databases are also able to provide data accordingly, but other databases cannot. Accordingly, a number of processing steps have had to be added in order to create a processable format, the so-called ETL process. In this process, the data is first extracted from the existing database, then transformed for processing - sometimes several times - and finally loaded into the relational database. In many large systems, the ETL process therefore accounts for 70% of the total system costs. Even before the actual processing and evaluation process can begin, three quarters of the available budget has already been used up. And despite the immense costs, every transformation also means a loss of information.
John Kreisa: "With Hadoop, the processing comes to the data instead of the data coming to the processing."
© HortonworksInstead of transforming the data for processing, Hadoop focuses on storing data in its most original form and optimizing it for processing. It is simply the digital equivalent of physical objects and relationships. In this way, an incredible number of types of objects can be stored in Hadoop, providing valuable information for future analysis.
The size of the data collection is also an important point. This is because a single technology provider is hardly in a position to analyze the problems and opportunities of these data volumes. The Hadoop ecosystem is based on open source as an open community. This means that many people are involved in overcoming challenges together that one provider would otherwise have to face alone. For the Hadoop architecture in particular, the solution grows with the growth of the data volume and thus the challenge. This means that the number of nodes also increases in what is known as a horizontal scale-out process. These nodes can be simple standard hardware. They do not have to be an integral part of an integrated solution from a single manufacturer.
Hortonworks is very committed to this and leads a community in the Hadoop area, packaging Hadoop in bundles that are distributed and supported by us. The infrastructure and suitable open source-based tools for data acquisition and data processing are also part of this. In this way, security, manageability and governance complement the current expectations of companies.
How can this technology help in the manufacturing sector?
Kreisa: Advanced manufacturing processes have evolved into an incredibly complex set of interactions both within a single manufacturer and across the supply chain. These interactions are increasingly dependent on timely information to ensure production quality and supply chain efficiency. More and more production tools and equipment are becoming interconnected, generating a variety of data types, including images, audio, infrared and three-dimensional lidar arrays that do not fit well into a relational database or traditional analytical systems. At the same time, data from these sources is often merged with other data types. Having everything in a single data pool greatly simplifies this process. Cyber-physical models attempt to capture this complexity, but often struggle with limitations due to the underlying technical basis. For example, there is massively more data and data types that represent increasingly complex relationships and need to be analyzed in less time. As Hadoop-based big data solutions are designed precisely for this situation, they are increasingly being used in production and manufacturing.
Can you give a specific example of how they are used in production?
Kreisa: A good example is predictive maintenance and industrial control systems. Until now, analytical systems have collected and processed industrial control data in periodic aggregations. However, a production process becomes more efficient when working with a data stream instead of a sequence of periodic states. Instead of having to wait for production data to be analyzed, real-time analysis of control data with the Hadoop platforms 'Apache Storm' and 'Apache Kafka' enables rapid detection and resolution of machine tolerance problems, saves resources, avoids production downtime and reduces maintenance costs through predictive failure analysis.
Or take supply chain risk management, warehousing for just-in-time manufacturing, logistics and routing optimization. Supply, warehousing and logistics are certainly not new topics for producers, but the complexity of the end-to-end system has increased significantly. Bottlenecks in the availability of a single, small component can lead to major domino effects that propagate through the supply chain. The whip effect is becoming increasingly evident at the transitions from just-in-time warehousing, global sourcing and increased component complexity. The deep insight into production data provided by Hadoop-based systems enables an understanding of these complex effects, which helps with planning and risk management. In addition, an increased inventory of availability data throughout the supply chain creates another important data flow that needs to be analyzed in order to implement the planning process in real time.
Are there any particular challenges in the German manufacturing market compared to other geographical regions?
Kreisa: Big data solutions are characterized by the fact that they provide insights that cannot be obtained from separate, siloed data sets. Data protection and other regulations in the EU and especially in Germany have traditionally caused some hesitation in creating large, centralized data repositories.
Hortonworks and other providers operating in the open source community have responded to this. They offer enterprise-class security, data provenance and governance in their big data solutions. Given the way big data works, this is important. If policies and processes are too restrictive, you end up with the same siloed architecture that prevented innovation in the first place. If the restrictions are handled too loosely, there is a compliance risk for the company. New systems and concepts were needed to ensure granular access control in the dataset. At the same time, the dataset had to be easy to administer and scalable in terms of both the size and complexity of the data.
The challenges faced by the German manufacturing sector stem from some of its greatest strengths. The focus on quality and precision with which it has achieved its leading position sometimes leads to risk aversion for manufacturers. However, big data analytics require an exploratory approach. This creates a challenge for German manufacturers to both maintain their culture of precision and embrace the concept of constant experimentation and rapid implementation.











