Introduction to
Hadoop C++ Extension
肖 康
xiaokang@baidu.com
Outline
Big Picture
Why HCE
HCE Implementation
HCE Usage
HCE Reference
Other Works
Baidu Statistics
Current
>10 cluster, 4000 nodes
largest cluster: 1000 nodes
8 core/16GB/12*1TB per node
data per day: >3PB
jobs per day: >3w
Soon
>10000 nodes
data per day: 10PB
Big Picture
Computing Resource Management Layer
Communication Intensive – HPC … Data & Computing Intensive – DC...
Scheduling Layer
(HPC Scheduler, Agent)
Scheduling Layer
(DC Scheduler, Agent)
Classification
RegressionVector
Clustering
Computing Model
MapReduce DAG
Computing Model
Algorithm Description Layer SQL-like Representation Layer
Why HCE
Current API
java
streaming/bistreaming
pipes
Why HCE
java language efficiency
sort, compress/decompress
C++ 10% ~ 40% improvement
java memory control
full featured C++ API
HCE Implementation
韩富晟 hanfusheng@baidu.com
TaskTracker
Child
Child JVM
MapTask
or
ReduceTask
run
Launch
socket
C++ Wrapper Library
command Status/progress
C++ Map or Reduce
or Reader or
Writer or
Partitioner or
Combiner or
Comitter class
Tasktracker Node
HCE
Data
Data
HCE Implementation
Java
RunTask
HceMapRunner
HDFS
①
HceOutputCommitter
HceNoJavaInputFormat
HceSubmitter
LineRecordReader
Mapper
MapOutputCollector
LocalFS IFileWriter
Status/Progress/Counters
HceMapRunner HceOutputCommitter
IFileReader
Reducer
LocalFS
ReduceInputReader
C++
LineRecordWriter
HadoopOutputCommitter
Shuffle & MSort
Hadoop
File.out map.out
HCE Usage – basic interface
setup(), cleanup(), map(),
reduce() is not optional,
return 0 for success。
emit() for output K/V,
TaskContext for conf and
counter
HCE Usage –
word
word文档格式规范word作业纸小票打印word模板word简历模板免费word简历
count map
韩富晟 hanfusheng@baidu.com
HCE Usage – wordcount reduce
韩富晟 hanfusheng@baidu.com
HCE Usage – wordcount run
Partitioner Combiner OutputCommitter RecordReader
RecordWriter
$HADOOP_HOME/bin/hadoop hce \
-mapper wordcount-demo \
-reducer wordcount-demo \
-file ./wordcount-demo \
-jobconf mapred.reduce.tasks=1 \
-input /user/test/sample_input \
-output /user/test/sample_output
HCE Usage – advanced interface
Interface Function
JobConf Get job configuration from object
Counter Allow for user defined counter
Partitioner HashPartitioner by default
Utilitity : IntHashPartitioner and MapIntPartitioner
Combiner Allow for user defined combiner
RecordReader LineRecordReader by default
SequenceRecordReader for SequenceFile
RecordWriter LineRecordWriter by default
SequenceRecordWriter for SequenceFile
HCE Reference
JIRA : MAPREDUCE-1270
https://issues.apache.org/jira/browse/MAPREDUCE-1270
patch
demo tarball
design doc
install manual
tutorial
performance test
Other Works
JobHistory Server
separate JobTracker & JobHistory
query job history from DB
TaskScheduler
queue schedule based on CapacityTaskScheduler
queue priority support
queue update at run time
preemption support
Other Works
Shuffle ( in plan )
problems
random IO
connection
shuffle use reduce slot
separate shuffle from tasktracker & reduce task
C++ implementation in HCE
random IO vs. sequential IO
pull vs. push model
data distribution service : shuffle, bt
Thanks