Big Data
Real-time analysis
If fast response times are required for data analysis, a cloud solution can reach its limits. An alternative is supercomputer technology: In combination with customized analysis software, it promises analyses in real time.
The hype surrounding big data is huge: the mere acquisition of a big data analytics solution is often seen as a panacea in terms of business intelligence and ROI - but there are many other factors to consider. Despite all the discussions, at least one thing is certain: As far as the impact of Big Data is concerned, we are only seeing the tip of the iceberg at the moment. Big data is the basis of digitalization and will consequently affect more and more areas in the future. Therefore, both IT and business must find ways to use the potential of big data to convert data into information and this in turn into added value and knowledge.
After all, what is the point of having huge amounts of data if it is not put to good use and companies can only make appropriate business decisions far too late? This is why the time factor is crucial. In many cases, it must be possible to process data within a very short time in order to make profitable decisions at all. Data analysis in real time is often even necessary.
Fundamental problems
Big data analyses are very difficult overall. The data volumes are immense and the data itself is extremely diverse because it is available in every conceivable format. Whether in terms of the size of the data records, the scope or the complexity: big data analytics is experiencing an almost explosive development. And this poses additional problems for companies that are already struggling with the unchecked proliferation of clusters, the flood of new applications and the ever-increasing need for ever faster insights. What's more, technological developments in the world of big data are anything but stagnating. Technologies such as Spark, Hadoop and graph databases are now ubiquitous in many industries. And innovative approaches such as deep learning / machine learning are also on the rise.
Against this backdrop, solutions are needed that make mountains of data quickly comprehensible and that can be successfully applied in a scalable environment. In addition, a correspondingly high level of computing power is required, which conventional computing architectures are generally unable to deliver.

Special shows for Industry 4.0
The IT & Business trade fair opens its doors in Stuttgart from October 4 to 6. The topic of Industry 4.0 will be high on the agenda - with the expanded special show 'Smart Factory' and a new showcase.
Fusion of software and hardware
To address these problems, Cray has developed the new agile big data analytics platform 'Urika-GX', which is designed to help tackle the biggest big data challenges - despite ever-increasing data volumes, complexity and a growing number of application areas. To achieve this, the characteristics of a supercomputer, namely enormous computing speed as well as scaling and throughput rates, were combined with those of standardized enterprise hardware and an open source software environment (OpenStack for data management and Apache Mesos for dynamic configuration) - which ultimately means more application convenience and flexibility for the user. In contrast to the often cited 'shadow IT', in which different cluster architectures are used for different workloads and thus pose a problem for the integration of applications, the focus here is on the use of uniform and open industry standards. This makes it much easier to integrate new analytics tools.
The 'Urika-GX' system has pre-integrated industry-standard software for easy implementation during operation.
© CrayThe hardware appliance is designed for demanding analysis workloads and allows multiple analysis tasks - be it Hadoop, Apache Spark or Graph - to be executed simultaneously on a single platform. Because even very extensive and complex graph analyses are possible, users have a powerful tool at their disposal to quickly gain insights into large volumes of unstructured data.
The Aries connection chip
On the hardware side, the system has Intel Xeon Broadwell cores, 22 TByte RAM and 35 TByte local SSD storage as well as the Aries connection chip.
© CrayHow can this be realized? It is made possible by the use of components that are already successfully in use in the 'Cray XC' supercomputers - including the so-called Aries interconnect chip (Aries Interconnect). This high-speed internal network is a distributed interconnect system designed for low latency, high bandwidth and optimized for high messaging rates. As a result, network-dependent workloads such as Spark or graph-based analyses run faster, as the data packets can be fed in continuously (in-flight) without having to wait for a response. This refers to the ability of the network to keep very large quantities of data packets active on the network at the same time. This is a necessary prerequisite to enable so-called 'one-way' communication, in which the sender no longer waits for an acknowledgement from the recipient before sending the next data packet, meaning that different communication streams can be overlapped. This results in very high rates of small data packets on the network.
The Aries connection chip replaces connections via Ethernet or InfiniBand nodes, eliminating the need to build a network fabric between individual nodes, which unnecessarily consumes time, support and capital.
Graph analyses in database
Once the large amount of unstructured data has been brought 'into shape', graph analyses come into play. They are a particular strength of the new platform. Graphs are still the fastest growing type of database. One reason for their increasing popularity is the realization that they can map relationships between entities much better than relational databases. Graph databases can be used to recognize certain patterns and relationships between individual variables - this is often very difficult or even impossible with relational databases.
While graph analyses have long been considered one of the most difficult tasks for modern analytics systems in terms of scaling and performance, they can now be performed up to 100 times faster thanks to state-of-the-art technology. In the case described here, the 'Cray Graph Engine' takes over the calculations and enables the necessary fast and complex iterative deep search. In this environment, it is important that every scenario - from a single processor to thousands of processors - is supported without any loss of performance. Another important factor is the ability to process data sets of several terabytes in size without causing unnecessary data shifts.
The graph engine can be used to recognize new patterns within data, make correlations between data points and then formulate corresponding hypotheses. And the analytics workflows on which these hypotheses are based can be run in parallel to compare results in real time and flexibly adapt workflows depending on the outcome.
The difference to conventional cluster architectures is that the calculations performed on them do not slow down as soon as the graphs become larger. With traditional clusters, this can even be the case if additional computing nodes are added, which generally do not bring any additional performance benefits anyway.
Author:
Dominik Ulmer is Vice President EMEA Business Operations at Cray.
Application scenarios for big data analysis
Data scientists, IT departments and researchers can use graph analysis capabilities to first build and then query graphs with tens of billions of relationships, which have also been compiled from all kinds of data sources. This opens up new application possibilities for many industries:
- Graph analyses in cancer research:
In cancer research, graph analytics in particular and Big Data analytics as a whole are being used to analyze genomic data and genome sequencing. Here, too, one of the biggest challenges is that the medical data to be collected is very diverse and fragmented. This is precisely why a standardized platform for recording, analyzing, retrieving and querying data is so essential. The non-profit research institute Broad Institute of the Massachusetts Institute of Technology (MIT) and Harvard in the United States, which strives for a greater understanding of diseases and progress in their treatment, was able to significantly reduce the time it takes to obtain quality score recalibration (QSR) results from its genome analysis toolkit 'GATK4' and the Apache Spark pipeline from 40 to 9 minutes with the new system. - Predictive maintenance in manufacturing:
Big Data Analytics also holds enormous potential for the manufacturing industry. A prime example of the use of big data analytics solutions in the manufacturing sector is predictive maintenance. This involves analyzing the data obtained from sensors and machine control systems in order to time maintenance intervals and avoid breakdowns. For this use case, it is advisable to use a hardware appliance instead of a cloud solution for two reasons. Firstly, the cloud has too high a latency time to be able to achieve analysis results quickly enough. Secondly, the data must first be moved to the cloud - this ties up resources and is often not recommended, especially when it comes to protecting business-critical data. - Fending off cyber attacks:
Ensuring a secure network for uninterrupted business operations is more important than ever in today's hyper-connected world. However, IT departments and security managers are also faced with the problem of coping with the sheer volume of machine-generated data. Conventional technologies often reach their limits at this point. Another key area of application for big data analytics and graph databases in particular is therefore cyber security. Fast reactions are particularly important here, as otherwise a company's reputation and progress may be at stake. In order to detect cyber attacks or anomalies, hundreds of millions of log data must be analyzed. If an attack then occurs on a company network, companies must be able to react immediately - i.e. in real time.













