Hadoop MapReduce数据流程（上） -

olylakers

浏览: 261099 次
性别:
来自: 杭州

最近访客更多访客>>

sp42

Tension1900

呆呆DE萌萌

lp164042318

博主相关

博客

微博

相册

留言

关于我

博客专栏

: Redis代码阅读
浏览量：22778

文章分类

社区版块

存档分类

Hadoop MapReduce数据流程（上）

博客分类：

Hadoop & Hive

Mapreduce Hadoop Yahoo Apache

本文不涉及MapReduce的原理介绍，只是从源代码的层面讲讲我对Hadoop的MapReduce的执行过程、数据流的一点理解。

首先贴上一张来之于Yahoo Hadoop 教程的图片

Detailed Hadoop MapReduce data flow

由上图可以看出，在进入Map之前，InputFormat把存储在HDFS的文件进行读取和分割，形成和任务相关的InputSplits，然后RecordReader负责读取这些Splits，并把读取出来的内容作为Map函数的输入参数。下面我就从代码执行的角度来看，数据是如何一步步从HDFS的file到Map函数的。在Yahoo Hadoop 教程中已经详细讲解了这一过程。但我作为一个细节控，更想从源代码的级别去理清这一过程，这样我才觉得踏实，才觉得自己真真切切地掌握了这个知识点，因此我仔细阅读了这部分的源代码，写篇博客记录下来，以便以后自己查看。

首先，在Mapper类的run方法中，map函数被循环调用：

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

     ...................................
    /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
    cleanup(context);
  }

在run方法中，每调用一次context.nextKeyValue()，就执行一遍map方法，而此处的context实际上是实现了Context接口的MapContextImpl（这一点可以在MultithreadedMapper的run方法看出来），其nextKeyValue，getCurrentKey,getCurrentValue方法为:

  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
    return reader.nextKeyValue();
  }
  @Override
  public KEYIN getCurrentKey() throws IOException, InterruptedException {
    return reader.getCurrentKey();
  }

  @Override
  public VALUEIN getCurrentValue() throws IOException, InterruptedException {
    return reader.getCurrentValue();
  }

上述代码中的实际上是由reader来完成nextKeyValue的工作，reader是RecordReader实例，RecordReader就是用来读取各个task的splits，产生map函数的输入参数。实现RecordReader接口的类由很多，那此处的reader到底是那个类的实例呢？我们到创建context的地方去看一看。

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
          (inputFormat.createRecordReader(split, taskContext), reporter);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    ..............

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

上面代码中的input是一个NewTrackingRecordReader实例，而NewTrackingRecordReader则是对inputFormat.createRecordReader(split, taskContext), reporter)返回的RecordReader对象的封装，inputFormat是InputFormat类的实例，InputFormat类定义了如何分割可读取文件，

public abstract class InputFormat<K, V> {

 
  public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  

  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

}

读取文件主要是通过其创建的RecordReader来完成的。Hadoop自带了好几种输入格式，关于输入格式的具体描述可以参考此处Yahoo Hadoop 教程。JobContextImpl中包括了InputFormat的get和set方法，默认的实现是TextInputFormat ---读取文件的行，行的偏移量为key，行的内容为value。我们可以通过重写InputFormat中的isSplitable和createRecordReader来实现自定义的InputFormat，并通过JobContextImpl中的set方法来在map中采用自己的输入格式。

  @SuppressWarnings("unchecked")
  public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>) 
      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
  }

因为读物文件是通过RecordReader完成的，因此接下来看看TextInputFormat中的 RecordReader是什么？

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
         ......................
  }

}

可见，TextInputFormat中，创建的RecordReader为LineRecordReader，”textinputformat.record.delimiter“指的是读取一行的数据的终止符号，即遇到“ textinputformat.record.delimiter ”所包含的字符时，该一行的读取结束。可以通过Configuration的set（）方法来设置自定义的终止符，如果没有设置 textinputformat.record.delimiter，那么Hadoop就采用以CR，LF或者CRLF作为终止符，这一点可以查看LineReader的readDefaultLine方法。查看LineRecordReader的实现就知道为什么上面说 TextInputFormat是以行的偏移量为key，行的内容为value了。来看看其中的几个主要的方法：

    public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
                 ......
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();
 

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    if (isCompressedInput()) {
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        if (null == this.recordDelimiterBytes){
          in = new LineReader(cIn, job);
        } else {
          in = new LineReader(cIn, job, this.recordDelimiterBytes);
        }

        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        if (null == this.recordDelimiterBytes) {
          in = new LineReader(codec.createInputStream(fileIn, decompressor),
              job);
        } else {
          in = new LineReader(codec.createInputStream(fileIn,
              decompressor), job, this.recordDelimiterBytes);
        }
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);
      if (null == this.recordDelimiterBytes){
        in = new LineReader(fileIn, job);
      } else {
        in = new LineReader(fileIn, job, this.recordDelimiterBytes);
      }

      filePosition = fileIn;
    }

  }
  public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end) {
      newSize = in.readLine(value, maxLineLength,
          Math.max(maxBytesToConsume(pos), maxLineLength));
      if (newSize == 0) {
        break;
      }
      pos += newSize;
      inputByteCounter.increment(newSize);
      if (newSize < maxLineLength) {
        break;
      }

    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

  @Override
  public LongWritable getCurrentKey() {
    return key;
  }

  @Override
  public Text getCurrentValue() {
    return value;
  }

首先在initialize方法里，根据传入的FileSplit来获取到当前读取文件的path，起始位置，并以此创建真正的文件读取流in，我们可以看见在nextKeyValue方法里，就是由in来读取文件，更新key和value的值。

至此，Hadoop如何把文件数据读取出来，并以何种方式传给Map函数，就一目了然了，同时也更加理解了Yahoo Hadoop 教程里面提到的譬如FileInputFormat的默认实现，TextInputFormat是如何实现Key-Value组合等等内容。最大的好处在于，如果我要实现一些自定义的东西，我应该如何去修改代码，如何去在合适的地方嵌入自定义的东西。

查看图片附件

0
顶

0
踩

分享到：

Hadoop MapReduce中如何处理跨行Block和Un ... | 虚拟机ubuntu为什么不能上网?

2011-06-02 15:27
浏览 8555
评论(3)
分类:编程语言
查看更多

3 楼 austincao 2012-07-23

看来我也要潜下心来，多看看代码了，谢谢LZ分享这么好的文章，学习了~

2 楼 waytofall 2011-06-05

waytofall 写道

兄弟你这源码能发一份吗？
waytofall916@gmail.com
hadoop包里源码和jar的版本竟然不匹配啊……

哦，好吧，我看错了……
原来在另一个mapreduce包里……
嗯，文章写得不错，赞钻研精神～
（新版和旧版很坑爹）

1 楼 waytofall 2011-06-05

兄弟你这源码能发一份吗？
waytofall916@gmail.com
hadoop包里源码和jar的版本竟然不匹配啊……

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

博客专栏

文章分类

社区版块

存档分类

最新评论

Hadoop MapReduce数据流程（上）

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

博客专栏

文章分类

社区版块

存档分类

最新评论

Hadoop MapReduce数据流程（上）

评论

发表评论

相关推荐

Hadoop MapReduce中如何处理跨行Block和UnputSplit

Hadoop MapReduce数据流程

最近访客更多访客>>