MapReduce

You can configure MapReduce for processing and generating large data sets with a parallel, distributed algorithm on a cluster. To distribute input data and analyze it in parallel, MapReduce operates in parallel on all nodes in a cluster of any size. The term “MapReduce" refers to two distinct phases. The first phase is ‘Map’ phase, which takes a set of data and converts it into another set of data, where individual items are broken down into key-value pairs. The second phase is ‘Reduce’ phase, which takes output from ‘Map’ as an input and reduces that data set into a smaller and more meaningful data set.

For more detailed explanation of MapReduce please see the MapReduce section.

Configure MapReduce

You can manually edit config.ncconf and client.ncconf located at %NCHOME%/config. %NCHOME% is NCache install directory. These files can be edited through any text editor of your choice.

Step 1:

In case you want to update any cache settings please refer to the Update Cache Config section to follow the set of steps.

Step 2:

MapReduce can be configured through config.ncconf of each server by adding the <tasks-config> tag under the <cache-settings> tag:

<cache-config cache-name="demoClusteredCache">
    <cache-settings ...>
    ...
    <tasks-config max-tasks="10" chunk-size="100" communicate-stats="False"
                  queue-size="10" max-avoidable-exceptions="10"/>
    ...
  </cache-settings>
</cache-config>

The tags in the configuration file are explained below.

max-tasks: In order to provide maximum scalability, the MapReduce job to be performed on data sets are divided into smaller subsets of tasks which can be performed simultaneously. max-tasks is the maximum number of MapReduce tasks to be executed simultaneously. It can be changed according to your requirements. Its default value is 10.

chunk-size: The tasks are processed and stored in bulk before being sent to the Reducer, meaning the data from Mapper is processed in chunks, the size of which is configurable. The Chunk-Size is the number of tasks processed in the Mapper and Combiner - before transmitting to Combiner or Reducer. When Combiner’s output reaches the specified chunk size, it is then sent to the Reducer, which finalizes and persists the output. The default value for chunk-size is 100.

queue-size: The tasks that are performed are lined up in a queue in which the tasks wait for the former tasks to be performed before they can be processed. queue-size is the maximum number of tasks that can wait in queue before they are processed due to the other tasks being processed. The default value is 10.

communicate-stats: Used for the MapReduce tasks to communicate the statistics internally. It is set false by default.

max-avoidable-exceptions: In case you expect exceptions to be thrown during task execution, you can specify the number of exceptions to be avoided from your code, after which the task is failed and logged in the cache error log. The default value is 10.

Note

Repeat this step for all server nodes.

Step 3:

Create a new directory named 'deploy' in %NCHOME%.

Note

Repeat this step for all server nodes.

Step 4:

Create a new directory in %NCHOME%/deploy and name it with the cache name where you want your assemblies to be deployed, for example demoClusteredCache.

Note

Repeat this step for all server nodes.

Step 5:

Place the assembly MapReduce.dll for the implementation of MapReduce in this folder.

The assembly will be placed at "C:\Program Files\NCache\deploy\demoClusteredCache".

Note

Repeat this step for all server nodes.

MapReduce

Configure MapReduce

Note

Note

Note

Note

See Also