HCE: A MapReduce Framework
towards Improve Resource Utilization
Yang Dong
yangdonglee@gmail.com
About Me
• Research Area
– Distributed Storage System
• HDFS
• Hypertable
– Distributed Computing System
• MapReduce
• DataStream
2
Agenda
• Background and Motivation
• Framework Model
• Evaluation
• Conclusion
• Q&A
3
Agenda
• Background and Motivation
– State of Art
– Challenge
– Solution
• Framework Model
• Evaluation
• Conclusion
• Q&A
4
State of Art
50000+ jobs
10000+ nodes
10P+ data processed per day
5
How to improve
the efficiency of
clusters?
How to improve
development
efficiency?
How to satisfy
customer
requirements?
How to
control and
maintain?
Challenge
• Resource Utilization
– Job optimization
• Resource Scheduling
• Dynamic Configuration
– Task optimization
• Framework optimization for small tasks
• User program optimization for big tasks
6
Challenge
• Cluster Status
– Most tasks are small
• 80% map tasks time < 1min
• Map tasks num ~= 2 * reduce
tasks num
Framework Optimization is
important
• MapReduce Users
– Streaming interface is popular
• Development efficiency, e.g. c++,
scripts
• User program is independent
User Program Optimization is
needed
7
Challenge
• Hadoop MapReduce
– Task Runtime
• Java for cross-platform
• Multi-level handling for
extensibility
• JNI-based compression
• User programs are
independent of
framework
8
Solution
• Our Goal
– Improve resource
utilization of clusters
• optimizing task framework
• optimizing user programs
– Improve development
efficiency of engineers
• Multiple programming
Interfaces
9
Agenda
• Background and Motivation
• Framework Model
– Overview
– Function Model
– Process Model
– Language Model
– HCE versus Hadoop
• Evaluation
• Conclusion
• Q&A
10
Overview
11
MapReduce Phases
• Map task
– Reading
– Map Processing
– Spilling
– Merging
– Committing
• Reduce task
– Shuffling
– Sorting
– Reduce Processing
– Writing
– Committing
12
MapReduce Phases
13
Function Model
14
Java C++ Python APIs Streaming Php
Compiler Optimization Code Translator Translators Layer
Reader Writer Partitioner Combiner Committer
Execution Layer
Mapper Reducer
File Formats Compress Libs Storage APIs Access Layer
Process Model
• Data-Flow
Java
RunTask
HceMapRunner
HDFS
HceOutputCommitter
HceInputFormat
HceSubmitter
LineRecordReader
Mapper
MapOutputCollector
LocalFS
IFileWriter
Status/Progress/Counters
HceReduceRunner HceOutputCommitter
IFileReader
Reducer
LocalFS
ReduceInputReader
C++
LineRecordWriter
HadoopOutputCommitter
Shuffle & Sort
HADOOP
File.out map.out
15
Process Model
• Streaming Over HCE
TaskTracker
Child
Child JVM
MapTask
or
ReduceTask
run
User Process Launch
Tasktracker Node
Hadoop Streaming
Input
Key/values
Output
Key/values
stdin stdout
TaskTracker
Child
Child JVM
Proxy MapTask
or
ReduceTask
run
User C++ Process Launch
Tasktracker Node
HCE
Commands Status/
Progress Socket
Input
Key/values
Output
Key/values
libhce
TaskTracker
Child
Child JVM
Proxy MapTask
or
ReduceTask
run
Streaming C++ Process Launch
Tasktracker Node
Streaming over HCE
Commands Status/
Progress Socket
User Process
Input
Key/values
Output
Key/values
stdin stdout
16
Process Model
• Python Over HCE
TaskTracker
Child
Child JVM
Proxy MapTask
or
ReduceTask
run
Launch
Tasktracker Node
Commands Status/
Progress Socket
User Python File
Python C++ Process
Input
Key/values
Output
Key/values
Interpret
Python over HCE
TaskTracker
Child
Child JVM
Proxy MapTask
or
ReduceTask
run
User C++ Process Launch
Tasktracker Node
HCE
Commands Status/
Progress Socket
Input
Key/values
Output
Key/values
libhce
17
Process Model
• HCE Over SSE
– SSE Instruction Set
• Functions
– memcmp, strcmp, strncmp, strlen, strchr, memcpy, memmove,
strcpy, memset …
• Performance
– CRC32 16x, memcmp 3.4x, strcmp 3.5x, strncmp 14x,
strchr/strnchr 2.5x, strncpy 3x, memcpy 1.3x
– Optimization
• Framework 10%
• User programs
18
Language Model
Programming
Interface
Optimization Performance Development Usage
Java
(Hadoop)
- - Java Hive
Streaming
(Hadoop)
Don’t care user
programs
5% vs. Java Read or write stdin
& stdout
Common jobs
C++
(HCE)
Framework
Optimization
User program
Optimization
5%~30% vs. Java Libhce (C++ library) Data warehouse
or big-task jobs
Streaming
(HCE)
Framework
Optimization
10%~30% vs.
Streaming(Java)
Read or write stdin
& stdout
Common jobs
Python
(HCE)
Framework
Optimization
User program
Optimization
10%~30% vs.
Streaming(Java)
Libpyhce (python
library)
Python jobs
19
HCE versus Hadoop
• 2 kinds of programming interface
• Difficult to support other storage
system
• JNI-based compression
• No compiler optimization for
MapReduce framework
• No compiler optimization for user
program
• Memory control by java gc
• Quick Sort
• Difficult to implement combiner
with streaming interface
• 4-5 kinds of programming
interface
• Easy to support other storage
system e.g. Hypertable
• Direct native compression
• Static compiler optimization for
MapReduce framework
• Static compiler optimization for
user program
• Memory control by framework
• Bucket Sort + Quick Sort
• Easy to implement combiner
20
Agenda
• Background and Motivation
• Framework Model
• Evaluation
– Benchmark
– Application
• Conclusion
• Q&A
Benchmark
• Map Task Timings
22
0
20
40
60
80
100
120
140
160
180
200
Hadoop HCE
E
x
e
c
u
ti
o
n
T
im
e
(
s
e
c
)
Wordcount Map Timings
MERGE
SPILL
CLEANUP
COLLECT
MAP
READ
SETUP
Benchmark
23
Bonus! SSE can improve 10% in addition
0
200
400
600
800
1000
1200
E
x
e
c
u
ti
o
n
T
im
e
(
s
e
c
)
WordCount Performance with Different
Compression Strategies
Hadoop Streaming
HCE Streaming
• Compression impacts
– 100 GB, 10 nodes
Application
• Language impacts • SSE impacts
24
0
10
20
30
40
50
60
Hadoop
Streaming
HCE Streaming HCE Python
E
x
e
c
u
ti
o
n
T
im
e
(
s
e
c
)
APP1
0
10
20
30
40
50
60
70
Hadoop Streaming HCE Streaming User SSE
E
x
e
c
u
ti
o
n
T
im
e
(
s
e
c
)
APP2
Agenda
• Background and Motivation
• Framework Model
• Evaluation
• Conclusion
– How to Optimize Jobs
– Contribution
• Q&A
How to Optimize Jobs
• Decrease the number of reduce tasks by
combiner
• Use c++ programming interface
• Improve the efficiency of tasks by compiler
• Use Lzo/QuickLz compression strategy for map
tasks
26
Contribution
• China
– Clusters
• Future
– All Hadoop Clusters in 2011
– Applications
• Jobs whose tasks are big
• MapReduce-based warehouse
27
HCE can save >10%
machines at least
Contribution
• Facebook
– Hive Over HCE
• Implementation
– HiveMapper and HiveReducer
– RC-File RecordReader and RecordWriter
• Performance
– CPU utilization 20%~50% improvement
• Patches to Apache Jira
– http://issues.apache.org/jira/browse/MAPREDUCE-1270
– https://issues.apache.org/jira/browse/MAPREDUCE-2446
28
Thanks for your Attention
29
Questions
30
本文档为【CSDN大数据应用大会PPT——01-杨栋:HCE提升资源利用率的MapReduce框架】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。