Tuesday, March 24, 2015

Hadoop Architecture

The Map/Reduce Paradigm


How many times does this pattern occur in the data?


Mapping




Introductioun:
  1. Each Map task outputs data in the form of Key/Value pair.
    1. mapreduce.tasktracker.map.tasks.maximum: 8
      1. The maximum number of map tasks that will be run simultaneously by a task tracker
    2. mapreduce.map.memory.mb: 128
      1. The amount of memory to request from the scheduler for each map task.
  2. The output is stored in a Ring Buffer rather than being written directly to the disk.
  3. When the Ring Buffer reaches 80% capacity, the content is "spilled" to disk.
    1. This process will create multiple files on the datanode (shuffle spill files).
    2. mapreduce.map.sort.spill.percent: 0.80
      1. The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5
  4. Hadoop will merge all the spill files on a given datanode into a single file
    1. This single file is both sorted and partitioned based on number of reducers.
    2. mapreduce.task.io.sort.mb: 512
      1. The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
    3. mapreduce.task.io.sort.factor: 64
      1. The number of streams to merge at once while sorting files. This determines the number of open file handles.
    4. mapreduce.reduce.shuffle.input.buffer.percent: 0.70
      1. The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
    5. mapreduce.reduce.input.buffer.percent: 0.70
      1. The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
    6. mapreduce.reduce.shuffle.parallelcopies: 128
      1. The default number of parallel transfers run by reduce during the copy(shuffle) phase.
    7. mapreduce.reduce.memory.mb: 1024
      1. The amount of memory to request from the scheduler for each reduce task.
    8. mapreduce.reduce.shuffle.merge.percent: 0.66
      1. The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.



Ring Buffer


The Ring Buffer (aka Circular Buffer) is a key concept in the MapReduce ecosystem.

We have two major challenges in any map/reduce program:

  1. We are dealing with a massive amount of data
    1. If this isn't true, we don't need to use map/reduce
  2. The result of the map tasks can not be constantly written to disk
    1. This would be too slow
  3. Nor can it be stored entirely within memory
    1. Most systems would not have a sufficient amount of memory

We have to use a combination of disks/memory efficiently.

The circular buffer is fast. Writing to memory is much faster than doing an I/O to disk. Flushing the data is only performed when needed.

Continuous logging can fill up space on the systems, causing other programs to also run out of space and fail. In such cases, either logs have to be manually removed or a log rotation policy has to be implemented.



References

  1. Hadoop Internals
    1. One of the best all-in-one overviews of Hadoop Architecture I have read.
    2. The documentation appears to be to date with YARN and other ecosystem improvements.
  2. Advantages of a Ring Buffer
    1. Map Tasks write to ring (aka Circular) buffers while executing
    2. This article is unrelated to Hadoop, but a knowlege of how this buffer works will aid in understanding mapred-site.xml configuration parameters
      1. Property: mapreduce.map.sort.spill.percent
      2. DescriptionThe soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5
      3. Default Value: 0.80
  3. [Quora] Apache Spark vs Hadoop
    1. A good discussion of both the map-side and reduce-side differences.  
    2. Helpful for an understanding of Hadoop's design independent of Spark.

8 comments:

  1. Hadoop is one of the best cloud based tool for analysisng the big data. With the increase in the usage of big data there is a quite a demand for hadoop professionals.
    Big data training in Chennai | Hadoop training Chennai | Hadoop training in Chennai

    ReplyDelete
  2. Big data(Hadoop) is mostly using their data analytics check process.The cloud contribution aggressive for Hadoop related map reduces code.Related For Hadoop online Training details Click here Hadoop Online Training!!!
    Selenium Online Training
    Selenium Traing

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Resources like the one you mentioned here will be very useful to me ! I will post a link to this page on my blog. I am sure my visitors will find that very useful
    apple iphone service center in chennai | Mac book pro service center in chennai | ipod service center in chennai | apple ipad service center in chennai

    ReplyDelete