The growth in data volumes is exponential, at the latest due to the triumphant advance of the Internet. But even beyond the data octopus of the Internet, a lot of data is being collected: According to a McKinsey study, the potentially exploitable data volume generated by a fully occupied airplane on a single one-way flight is estimated at over 200 terabytes. Contrary to expectations that might arise in view of the term, it is by no means only the data volume that is at the centre of the Big Data movement. This is because the current weapon cabinet of data analysis, which is usually summarized under the term Business Intelligence, requires strongly pre-structured and well-designed data models - and thus a time-consuming process. This is one reason why relevant projects are very often overtaken by reality - in surveys, end users cite speed in response to structural changes and new requirements as the main problems in their bi-initiatives, in addition to the "evergreen" query speed.
This is why the big data approach differs from conventional approaches in that it can handle polystructured data flexibly. In addition to classic structured data, such as those generated by an internal ERP system, semi-structured documents, such as documents based on HTML or XML standards, or completely unstructured documents are also used. Blogs are a good example of the latter: Many brand manufacturers try to find out and analyze how often and in what context their products are mentioned in blogs and forums. And this example is the best way to summarize the problems that Big Data is trying to solve:
First: There is a lot of data, it is not structured or the structure is not within the company's sphere of influence and can change dynamically.
Second: Methodological evaluation requires completely new analysis technologies: In this specific case, algorithms which recognize the context - is it a complaint, a commendation or a comparison with a competing product - in all major world languages.
Third: This requires not only additional technology, but also a combination of understanding of the business problem and a deep understanding of what is technologically feasible.
Some industry analysts see this as the new profession of "data scientist". Hotline operators are already experimenting with speech recognition algorithms that recognize the context from the tonality of the spoken word and can thus distinguish an emotionally presented complaint from a routine inquiry. If such technologies succeed in recognizing changes in customer behavior much earlier, for example, a competitive advantage is undoubtedly generated. Scenarios of this kind also carry uncertainty inherent in the system: In contrast to BI systems based on internal data, both the aggregation of data and the analysis must work with partially incomplete data and probabilities, a scenario for which conventional bi- and DWH solutions are completely inadequately prepared today. The well-known IT analyst Wolfgang Martin sums it up pointedly: "The single point of truth goes swimming.
So how does Big Data define itself? A globally recognised, uniform definition has not yet been established. One reason for this is that all major manufacturers of software, hardware and appliance solutions in the "conventional" business intelligence sector are trying to ride the big data wave and enforce their own definitions. In the German-speaking world, the definition of the Barc Institute Würzburg is the most popular: "Big Data refers to methods and technologies for the highly scalable collection, storage and analysis of polystructured data.
It will also be exciting to see how the market is divided up between pure software and hardware suppliers, appliance suppliers and specialised service providers. Using blog analysis as an example, some service providers bundle their complete service as an externally purchasable service and supply customers with ready-made analyses. In view of the complexity of the technology and the high degree of specialisation, this model is an interesting alternative in various areas, even if it means that the opportunity for competitive advantages through own, particularly clever implementations is missed. In other areas, big-data approaches will in most cases tend to develop the classic BI architectures in an evolutionary way or supplement them in individual functional areas. Initial experience shows that the complexity is by no means only in the technology - the demands on the analysis tools are increasing as much as on the user before them.
As companies move into new functional areas of data analysis with Big Data, and often pioneer the way, a classic problem that has accompanied the business intelligence industry since its inception remains: The presentation of ROI is no trivial task and usually has to be based on speculative assumptions. The discounted cash flow of a decision made earlier, or by Big Data in the first place, is difficult to calculate reliably. Big Data therefore accompanies Big Data initiatives from the very beginning.
So is Big Data a new type of software that completely replaces previous investments in business intelligence and data warehouse? Certainly not in this unambiguousness. The established business intelligence manufacturers are currently expanding their platforms in the direction of better suitability for big data scenarios. Nevertheless, alternative architectures are emerging, some of them from the open source area, where there is an innovative scene around Big Data. In the area of so-called NOSQL databases ("not only SQL"), CouchDB and MongoDB are often cited as examples.
Hadoop, named by its inventor Doug Cutting after his son's favourite yellow elephant, is currently experiencing a real hype. Hadoop is based on the so-called Mapreduce algorithm, which supports the massive parallel processing of large amounts of data and was made popular by Google. The idea behind it is simple: Break the task down into its smallest parts, distribute them to as many computers as possible for massively parallel processing (map) and then reassemble the result (reduce). This is hoped to solve the problem of having to analyze very large, unstructured amounts of data with a manageable investment in hardware. This is done as batch processing and thus sets a counterpoint to the in-memory databases that are becoming increasingly popular in the classic business intelligence sector. Hadoop is an open source framework available in Java, which is increasingly being implemented by major manufacturers such as Microsoft, IBM or SAS or is supported in their own solutions. In addition, Hadoop is now offered by various professional distributors with support and related services, accelerating its spread into the commercial sector. Hadoop is by no means an "out of the box" solution: the quality of the analyses stands or falls with the complex algorithms that have to be developed for each subject matter.