OpenSource Big Data has just been kick-started with the publication of version 1.0 of Apache SPARK. Very quickly, this new framework has gathered a large number of fans – including MRMS and Cloudera – because it adds to HADOOP everything that it was missing: Speed, ease of programming and flexibility.
HADOOP is a distributed data processing model that has become a standard. In this model, processing is performed in parallel on server farms via a system which automatically organises the distribution of the data, fault tolerance and the distribution of the processing. It therefore enables large volumes of data of any type to be handled – both structured and unstructured – with a guarantee of availability and linearity of performance.
Above this model, a whole ecosystem is laid out, made up of applications specialised in processing data: development tools, database engines, statistics libraries, rules engines, machine learning, graph analysis, visualisation, etc.
Hadoop, designed to handle large volumes on disks is little suited for real time processing and interactivity. To overcome this limitation, a number of tools and applications have been enriched with mechanisms for retaining data in memory and bypassing the model’s mechanisms, thereby allowing better performance to be obtained.
What makes the SPARK project so interesting is that taking the general principles of HADOOP, it has managed to transpose them for use in memory.
SPARK allows data sets to be loaded in memory and made persistent, distributed, fault-tolerant and shareable. SPARK seamlessly fits into HADOOP, accesses its storage spaces and is accessible via its development and management tools.
SPARK adds an “In-Memory” layer available simply for applications without deteriorating the quality of service offered by the model. Applications will be able to choose to work on data on disks, on data in memory for complex calculations, for real-time processing or for interactivity.
Besides the performance it provides, SPARK also demonstrates the dynamism, the extraordinary responsiveness and the inventiveness of the HADOOP ecosystem. It opens up a vast field for innovation in calculations and numerical analysis. It demonstrates how important it is to get on board this train which has suddenly sped up even more.