There are challenges on several levels within big data analytics, illustrated in the following figure. All of them must be addressed, together, in order to enable end users to successfully perform analysis of massive data: (i) the hardware and platform level with the capacity to collect, store, and process the necessary volumes of data in real time, (ii) machine learning algorithms to model and analyse the collected data, and (iii) high level tools and functionality to access the results and to allow exploring and visualizing both the data and the models.
Challenge 1. To develop a computation platform suitable for machine learning of massive streaming and distributed data.
One of the important characteristics of Big Data is that it is often streaming or at least constantly updated. It typically originates from a large number of distributed sources, and is, like most real world data, inherently noisy, vague or uncertain. At the same time, due to sheer size, a scalable framework for efficient processing is needed to adequately take advantage of it. However, today’s Big Data platforms are not well adapted to the specific needs of machine learning algorithms:
- Current platforms lack functionality suitable for analysing real-time, streaming and distributed data.
- Machine learning requires storing and updating an internal model of the data. Current platforms lack suitable support for stateful computing.
- The advanced processing in machine learning requires a more flexible computational structure than provided within the map-reduce paradigm of big data platforms, for example, iteration.
Challenge 2. To develop machine learning algorithms suitable for handling both the opportunities and challenges with massive, distributed, and streaming data produced in society.
A lot of recent research in machine learning, as a means to automatically sieve through large amounts
of information, model it, and draw conclusions, are motivated by Big Data. Traditional machine learning
algorithms are, however, not suitable for dealing with the opportunities nor the challenges that come with
massive, distributed and, streaming data:
- Many machine learning methods are designed for small training sets, trying to squeeze maximum out of them, usually by iterating over the examples many times. Then they use cross validation schemes to evaluate the methods on, again, limited amounts of data. With larger, especially streaming, data, there should be no need to iterate over the same examples for training and validation.
- A large class of successful machine learning algorithms are sample based (e.g., kernel density estimators and support vector machines), meaning that the model increases in size as more data arrives. This can quickly become infeasible, and so there is a need for more compact models, ones that manage to catch the essence of the data, without out-of-control growth in size.
- Most machine learning approaches assume a fixed training set, or possibly a batch-wise updated scenario, where training and usage can be separated. When new data arrives continuously, and the underlying reality changes constantly, the models need to gradually adapt. When the knowledge is to be used by many users, or by users with varying interests, a single model is not enough. There is a need for methods that can learn many models at the same time, each capturing different aspects of the data, and combine them in flexible ways to provide up-to-date, relevant knowledge.
- Data coming from different sources and being of different types raises several uncertainty issues associated with it, such as the validity, precision, and bias of those sources. This again changes the analytics task in a qualitative way, and calls for principled methods to handle all those aspects throughout the full course of data processing. Machine learning algorithms needs to take this uncertainty into account when creating models, but also be capable of propagating it into their results.
Challenge 3. To provide analytics methodology and high-level interactive functionality, to make the value in massive data easier available to end-users.
Big Data Analytics is capable of highlighting interesting aspects and discovering things of which users are completely unaware: detecting deviations, anomalies and trends, analysing key values, relations and co-occurrences, as well as making predictions. A crucial aspect of Big Data Analytics is enabling end users to use machine learning solutions more efficiently. On the one hand, unrealistic expectations have to be addressed by clearer presentation of the models, their quality and applicability limitations. At the same time, the full capabilities of machine learning in the Big Data context need to be made available to those who can benefit from it the most. The solution is the combination of elevating the abstraction level of machine learning algorithms, increased interactivity, using better visualisation techniques, and engaging end users into the whole data analytics cycle.
- Traditional software and hardware layers for big data analytics lack important services that the human cognitive system needs in processing complex information, and how to make machine learning meet the demands of big data analysts remains an open problem.
- Despite extensive research on how interactive visualization can facilitate the understanding of machine learning algorithms, there is a lack of results indicating the effectiveness of the different visualization techniques in terms of how well they are perceived and understood by the users and how the visualization influences analytical reasoning.
- Most state-of-the-art visual analytics (VA) tools and techniques do not properly accommodate big data. A major challenge is the capability of visual analysis methods to work incrementally and improving real-time analytical capability. How to handle streaming data and how to cooperate with approximate machine learning algorithms are open research questions.
- By default ML algorithms today aim to offer users as much flexibility as possible, based on the assumption that those users are themselves experts in the field. There is a need to develop high level, abstract presentation layer on top of such implementations, one that will allow domain experts to use ML solutions efficiently.
To achieve the above objectives, the work plan contains five work packages:
The work packages WP1-WP3 correspond to the three challenges described above. In WP4, the methods developed and implemented in WP1-WP3 will be evaluated on real scenarios, each involving typical aspects of big data, and to which the BIDAF consortium has access through its partners. WP5 contains the administration, management, steering, collaboration, and dissemination activities in the project.