In 2004 Google reported about the data processing model which they used. This model is based on the idea that the data is handled by a pair of simple functions – Map and Reduce. The first function – Map – selects a set of pairs ‘key/values’ from the input data (which are also the pairs ‘key/value’), and the second one combines / groups these pairs, and, besides, produces pairs, more often less pairs than have come for inputting.
An additional element is the distributed file system GoogleFS thanks to which a processed file and all the intermediate information become easily accessible from any computer in the cluster. Since all processing architecture consists of small functions, processing can be easily vectorized in the cluster. Splitting into separate pieces and restoring after failure becomes easier. Using the distributed file system we divide the data into small pieces, a separate cluster element works with each of them.
The same idea can be also met under the name ‘Split/Aggregate’ i.e. the essence is in the idea that the input data (irrespective of the size) is divided into separate units (the stage ‘split’), for example line by line, each line as a separate value for processing. These line blocks are arranged in the cluster for processing where for the function ‘map’ is called for each line. The result of the performance is again combined (reduce/aggregate) into an output file. If necessary, the data is stored in a certain order, for example by a ‘foreign key’.
The method itself suits well for long-term processing of great volumes of data and when the certain sequence of process or getting results as new data becomes available is not required i.e. if several gigabytes of data are available for processing and we need the final result of processing, it is an excellent variant. But if the same gigabyte arrives in parts constantly and we have to send it immediately for processing, expecting a result as soon as possible, it’s more convenient to use another method rather than MapReduce and Hadoop.