mapreduce - JasonWayne/personal-wiki GitHub Wiki
各种资料,学习路线
- Quora上的问题http://www.quora.com/What-are-some-of-the-good-resources-to-learn-Hadoop-and-MapReduce-for-an-absolute-beginner,有个人提到Hue这个网站,感觉还不错,好像是一个不错的实验平台。
- Udacity上的Intro to Hadoop and MapReduce课程
- https://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning讲Partitioning
- http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html官方的教程
- https://docs.google.com/document/d/1v0zGBZ6EHap-Smsr3x3sGGpDW-54m82kDpPKC2M6uiY/edit, Udacity的Intro to Hadoop and Mapreduce课程中提供了一个装好hadoop的虚拟机,非常适合入门使用。
- http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api,新版本更新指南
- https://www.youtube.com/watch?v=tILEXVC95HU,用Mapreduce实现join。
几个要点
-
Hadoop Streaming使得我们可以用Python来编写MapReduce程序
-
看这句加粗的话以及后面的截图问题
by default, Hadoop use an TaskTracker runs on the same machine as the data nodes, the Hadoop framework will be able to have the map tasks work directly on the pieces of data that are stored on that machine. This will save a lot of network traffic. As we saw, each Mapper processes a portion of the input data. That's known as the input split. And by default, Hadoop use an HDFS block as the input split for each Mapper. It will try to make sure that a Mapper works on data on the same machine. If this green block, for example, needs processing, then the TaskTracker on this machine will likely be the one chosen to process that block. That won't always be possible, because the TaskTrackers on these three machines that have the green block could already be busy. In which case, a different node will be chosen to process the green block, and it will be streamed over the network. This actually happens rather rarely. So the Mappers will read their input data. They'll produce intermediate data, which the Hadoop framework will pass to the reducers, remember that's the shuffle and sort. Then the reducers process that data and write their final output back to HDFS. Udacity Lession 2
存疑
- Hadoop的Shuffle到底是在做什么?