Tuesday, March 24, 2015

Hadoop Architecture

The Map/Reduce Paradigm


How many times does this pattern occur in the data?


Mapping




Introductioun:
  1. Each Map task outputs data in the form of Key/Value pair.
    1. mapreduce.tasktracker.map.tasks.maximum: 8
      1. The maximum number of map tasks that will be run simultaneously by a task tracker
    2. mapreduce.map.memory.mb: 128
      1. The amount of memory to request from the scheduler for each map task.
  2. The output is stored in a Ring Buffer rather than being written directly to the disk.
  3. When the Ring Buffer reaches 80% capacity, the content is "spilled" to disk.
    1. This process will create multiple files on the datanode (shuffle spill files).
    2. mapreduce.map.sort.spill.percent: 0.80
      1. The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5
  4. Hadoop will merge all the spill files on a given datanode into a single file
    1. This single file is both sorted and partitioned based on number of reducers.
    2. mapreduce.task.io.sort.mb: 512
      1. The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
    3. mapreduce.task.io.sort.factor: 64
      1. The number of streams to merge at once while sorting files. This determines the number of open file handles.
    4. mapreduce.reduce.shuffle.input.buffer.percent: 0.70
      1. The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
    5. mapreduce.reduce.input.buffer.percent: 0.70
      1. The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
    6. mapreduce.reduce.shuffle.parallelcopies: 128
      1. The default number of parallel transfers run by reduce during the copy(shuffle) phase.
    7. mapreduce.reduce.memory.mb: 1024
      1. The amount of memory to request from the scheduler for each reduce task.
    8. mapreduce.reduce.shuffle.merge.percent: 0.66
      1. The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.



Ring Buffer


The Ring Buffer (aka Circular Buffer) is a key concept in the MapReduce ecosystem.

We have two major challenges in any map/reduce program:

  1. We are dealing with a massive amount of data
    1. If this isn't true, we don't need to use map/reduce
  2. The result of the map tasks can not be constantly written to disk
    1. This would be too slow
  3. Nor can it be stored entirely within memory
    1. Most systems would not have a sufficient amount of memory

We have to use a combination of disks/memory efficiently.

The circular buffer is fast. Writing to memory is much faster than doing an I/O to disk. Flushing the data is only performed when needed.

Continuous logging can fill up space on the systems, causing other programs to also run out of space and fail. In such cases, either logs have to be manually removed or a log rotation policy has to be implemented.



References

  1. Hadoop Internals
    1. One of the best all-in-one overviews of Hadoop Architecture I have read.
    2. The documentation appears to be to date with YARN and other ecosystem improvements.
  2. Advantages of a Ring Buffer
    1. Map Tasks write to ring (aka Circular) buffers while executing
    2. This article is unrelated to Hadoop, but a knowlege of how this buffer works will aid in understanding mapred-site.xml configuration parameters
      1. Property: mapreduce.map.sort.spill.percent
      2. DescriptionThe soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5
      3. Default Value: 0.80
  3. [Quora] Apache Spark vs Hadoop
    1. A good discussion of both the map-side and reduce-side differences.  
    2. Helpful for an understanding of Hadoop's design independent of Spark.