The Map/Reduce Paradigm
How many times does this pattern occur in the data?
Mapping
Introductioun:
- Each Map task outputs data in the form of Key/Value pair.
- mapreduce.tasktracker.map.tasks.maximum: 8
- The maximum number of map tasks that will be run simultaneously by a task tracker
- mapreduce.map.memory.mb: 128
- The amount of memory to request from the scheduler for each map task.
- The output is stored in a Ring Buffer rather than being written directly to the disk.
- When the Ring Buffer reaches 80% capacity, the content is "spilled" to disk.
- This process will create multiple files on the datanode (shuffle spill files).
- mapreduce.map.sort.spill.percent: 0.80
- The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5
- Hadoop will merge all the spill files on a given datanode into a single file
- This single file is both sorted and partitioned based on number of reducers.
- mapreduce.task.io.sort.mb: 512
- The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
- mapreduce.task.io.sort.factor: 64
- The number of streams to merge at once while sorting files. This determines the number of open file handles.
- mapreduce.reduce.shuffle.input.buffer.percent: 0.70
- The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
- mapreduce.reduce.input.buffer.percent: 0.70
- The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin.
- mapreduce.reduce.shuffle.parallelcopies: 128
- The default number of parallel transfers run by reduce during the copy(shuffle) phase.
- mapreduce.reduce.memory.mb: 1024
- The amount of memory to request from the scheduler for each reduce task.
- mapreduce.reduce.shuffle.merge.percent: 0.66
- The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.
Ring Buffer
The Ring Buffer (aka Circular Buffer) is a key concept in the MapReduce ecosystem.
We have two major challenges in any map/reduce program:
- We are dealing with a massive amount of data
- If this isn't true, we don't need to use map/reduce
- The result of the map tasks can not be constantly written to disk
- This would be too slow
- Nor can it be stored entirely within memory
- Most systems would not have a sufficient amount of memory
We have to use a combination of disks/memory efficiently.
The circular buffer is fast. Writing to memory is much faster than doing an I/O to disk. Flushing the data is only performed when needed.
Continuous logging can fill up space on the systems, causing other programs to also run out of space and fail. In such cases, either logs have to be manually removed or a log rotation policy has to be implemented.
References
- Hadoop Internals
- One of the best all-in-one overviews of Hadoop Architecture I have read.
- The documentation appears to be to date with YARN and other ecosystem improvements.
- Advantages of a Ring Buffer
- Map Tasks write to ring (aka Circular) buffers while executing
- This article is unrelated to Hadoop, but a knowlege of how this buffer works will aid in understanding mapred-site.xml configuration parameters
- Property: mapreduce.map.sort.spill.percent
- Description: The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5
- Default Value: 0.80
- [Quora] Apache Spark vs Hadoop
- A good discussion of both the map-side and reduce-side differences.
- Helpful for an understanding of Hadoop's design independent of Spark.
Hadoop is one of the best cloud based tool for analysisng the big data. With the increase in the usage of big data there is a quite a demand for hadoop professionals.
ReplyDeleteBig data training in Chennai | Hadoop training Chennai | Hadoop training in Chennai
Big data(Hadoop) is mostly using their data analytics check process.The cloud contribution aggressive for Hadoop related map reduces code.Related For Hadoop online Training details Click here Hadoop Online Training!!!
ReplyDeleteSelenium Online Training
Selenium Traing
Thank you for providing useful content Big data hadoop online training Hyderabad
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteResources like the one you mentioned here will be very useful to me ! I will post a link to this page on my blog. I am sure my visitors will find that very useful
ReplyDeleteapple iphone service center in chennai | Mac book pro service center in chennai | ipod service center in chennai | apple ipad service center in chennai
Wow blog is very nice tqq for sharing
ReplyDeleteHadoop Online Training
Datascience Online TRaining
A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.
ReplyDeleteDell Tablet Service center in chennai | tab service center in chennai | 100% genuine tablet parts | Tablet display replacement in chennai | Tablet Water damage service in chennai | Tablet glass replacement in chennai | 100% genuine tablet parts | Tablet Service center in chennai | Tablet unlocking service in chennai | Acer tablet service centre in chennai
Great insight .
ReplyDeleteGreat insight .
"
Camera lens hire chennai | Digital camera rental chennai | Red camera rental chennai | dravidiancineservice | Film equipment rental in chennai | camera lens rental in chennai | camera rental service in chennai | film lighting equipment rental chennai | davinci resolve color grading services chennai | film editing service chennai | film editing companies in chennai "