MapReduce in NCache allows developers to process huge amounts of unstructured data in parallel across an NCache cluster. To distribute input data and analyze it in parallel, MapReduce operates in parallel on all nodes in a cluster of any size.
MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. The term “MapReduce” refers to two distinct phases. The first phase is ‘Map’ phase, which takes a set of data and converts it into another set of data, where individual items are broken down into key-value pairs. The second phase is ‘Reduce’ phase, which takes output from ‘Map’ as an input and reduces that data set into a smaller and more meaningful data set.
A user defined Mapper processes a key-value pair to generate a set of intermediate key-value pairs. Reducer processes all those intermediate key-value pairs (having same intermediate key) to aggregate, perform calculations or any other operation on the pairs. Another optional component, Combiner, performs merging of the intermediate key-value pairs generated by Mapper before these key-value pairs can be sent over to the Reducer.
The following example illustrates a MapReduce task (with and without combiner) being executed over a cluster of three nodes. The task takes orders as an input to the Mapper and extracts count of products consumed in it. In figure 1, Mapper’s output is directly sent to the reducer and is being aggregated on Reducer’s node whereas in figure 2, count over a single node is aggregated first and this aggregated count is sent to the Reducer node for final aggregation.
MapReduce without Combiner:
MapReduce with Combiner:
How does MapReduce Work?
Generally, MapReduce consists of two (sometimes three) phases: i.e. Mapping, Combining (optional) and Reducing.
- Mapping phase: Filters and prepares the input for the next phase that may be Combining or Reducing.
- Reduction phase: Takes care of the aggregation and compilation of the final result.
- Combining phase: Responsible for reduction local to the node, before sending the input to the Reducers. Combine phase optimizes performance as it minimizes the network traffic between Mapper and Reducers by sending the output to the Reducer in chunks.
Similarly, NCache MapReduce has three phases: Map, Combine, and Reduce. Only the Mapper is necessary to implement, Reducer and Combiner implementations are optional. NCache MapReduce will execute its default reducer if the user does not implement Reducer. Default reducer merges output omitted by Mapper into an array.
The Mapper, Combiner and Reducer are executed simultaneously during an NCache MapReduce task on the NCache cluster. Mapper output is individually sent to the Combiner. When Combiner’s output reaches the specified chunk size, it is then sent to the Reducer, which finalizes and persists the output.
In order to monitor the submitted task, a traceable object is provided to the user.
Number of tasks to be executed simultaneously and Mapper’s output chunk is configurable. Mapper’s output is sent to combiner or reducer once output chunk reaches the configured chunk size. See NCache Administrator’s Guide.
A typical MapReduce task has the following components:
Mapper: Processes the initial input and enables user to emit the output into a dictionary to be used as an input for the combiner or reducer.
Combiner Factory: creates and manages combiners for each key emitted into output by the mapper.
Combiner: Works as local reducer to the node where Mapper’s output is combined to minimize traffic between Mapper and Reducer.
Reducer Factory: create and manages reducers for each key emitted into output by the mapper or combiner.
Reducer: Processes all those intermediate key-value pairs generated by Mapper or combined by Combiner to aggregate, perform calculations or apply different operations to produce the reduced output.
Key Filter: Key Filter, as the name indicates, allows the user to filter cache data based on its keys before sent to the Mapper. The KeyFilter is called during Mapper phase. If it returns true, the Map will be executed on the key. If it returns false, Mapper will skip the key and move to next one from the Cache.
TrackerTask: This component lets you keep track of the progress of the task and its status as the task is executed. And lets you fetch the output of the task and enumerate it.
Output: The output is stored in-memory, on the server side. It can be enumerated using the TrackableTask instance on the client application.