CSDN大数据应用大会PPT——01-杨栋：HCE提升资源利用率的MapReduce框架

CSDN大数据应用大会PPT——01-杨栋：HCE提升资源利用率的MapReduce框架 HCE: A MapReduce Framework towards Improve Resource Utilization Yang Dong yangdonglee@gmail.com About Me • Research Area – Distributed Storage System • HDFS • Hypertable – Distributed Computing System • MapReduce • DataStream 2 ...

HCE: A MapReduce Framework towards Improve Resource Utilization Yang Dong yangdonglee@gmail.com About Me • Research Area – Distributed Storage System • HDFS • Hypertable – Distributed Computing System • MapReduce • DataStream 2 Agenda • Background and Motivation • Framework Model • Evaluation • Conclusion • Q&A 3 Agenda • Background and Motivation – State of Art – Challenge – Solution • Framework Model • Evaluation • Conclusion • Q&A 4 State of Art 50000+ jobs 10000+ nodes 10P+ data processed per day 5 How to improve the efficiency of clusters? How to improve development efficiency? How to satisfy customer requirements? How to control and maintain? Challenge • Resource Utilization – Job optimization • Resource Scheduling • Dynamic Configuration – Task optimization • Framework optimization for small tasks • User program optimization for big tasks 6 Challenge • Cluster Status – Most tasks are small • 80% map tasks time < 1min • Map tasks num ~= 2 * reduce tasks num  Framework Optimization is important • MapReduce Users – Streaming interface is popular • Development efficiency, e.g. c++, scripts • User program is independent  User Program Optimization is needed 7 Challenge • Hadoop MapReduce – Task Runtime • Java for cross-platform • Multi-level handling for extensibility • JNI-based compression • User programs are independent of framework 8 Solution • Our Goal – Improve resource utilization of clusters • optimizing task framework • optimizing user programs – Improve development efficiency of engineers • Multiple programming Interfaces 9 Agenda • Background and Motivation • Framework Model – Overview – Function Model – Process Model – Language Model – HCE versus Hadoop • Evaluation • Conclusion • Q&A 10 Overview 11 MapReduce Phases • Map task – Reading – Map Processing – Spilling – Merging – Committing • Reduce task – Shuffling – Sorting – Reduce Processing – Writing – Committing 12 MapReduce Phases 13 Function Model 14 Java C++ Python APIs Streaming Php Compiler Optimization Code Translator Translators Layer Reader Writer Partitioner Combiner Committer Execution Layer Mapper Reducer File Formats Compress Libs Storage APIs Access Layer Process Model • Data-Flow Java RunTask HceMapRunner HDFS HceOutputCommitter HceInputFormat HceSubmitter LineRecordReader Mapper MapOutputCollector LocalFS IFileWriter Status/Progress/Counters HceReduceRunner HceOutputCommitter IFileReader Reducer LocalFS ReduceInputReader C++ LineRecordWriter HadoopOutputCommitter Shuffle & Sort HADOOP File.out map.out 15 Process Model • Streaming Over HCE TaskTracker Child Child JVM MapTask or ReduceTask run User Process Launch Tasktracker Node Hadoop Streaming Input Key/values Output Key/values stdin stdout TaskTracker Child Child JVM Proxy MapTask or ReduceTask run User C++ Process Launch Tasktracker Node HCE Commands Status/ Progress Socket Input Key/values Output Key/values libhce TaskTracker Child Child JVM Proxy MapTask or ReduceTask run Streaming C++ Process Launch Tasktracker Node Streaming over HCE Commands Status/ Progress Socket User Process Input Key/values Output Key/values stdin stdout 16 Process Model • Python Over HCE TaskTracker Child Child JVM Proxy MapTask or ReduceTask run Launch Tasktracker Node Commands Status/ Progress Socket User Python File Python C++ Process Input Key/values Output Key/values Interpret Python over HCE TaskTracker Child Child JVM Proxy MapTask or ReduceTask run User C++ Process Launch Tasktracker Node HCE Commands Status/ Progress Socket Input Key/values Output Key/values libhce 17 Process Model • HCE Over SSE – SSE Instruction Set • Functions – memcmp, strcmp, strncmp, strlen, strchr, memcpy, memmove, strcpy, memset … • Performance – CRC32 16x, memcmp 3.4x, strcmp 3.5x, strncmp 14x, strchr/strnchr 2.5x, strncpy 3x, memcpy 1.3x – Optimization • Framework 10% • User programs 18 Language Model Programming Interface Optimization Performance Development Usage Java (Hadoop) - - Java Hive Streaming (Hadoop) Don’t care user programs 5% vs. Java Read or write stdin & stdout Common jobs C++ (HCE) Framework Optimization User program Optimization 5%~30% vs. Java Libhce (C++ library) Data warehouse or big-task jobs Streaming (HCE) Framework Optimization 10%~30% vs. Streaming(Java) Read or write stdin & stdout Common jobs Python (HCE) Framework Optimization User program Optimization 10%~30% vs. Streaming(Java) Libpyhce (python library) Python jobs 19 HCE versus Hadoop • 2 kinds of programming interface • Difficult to support other storage system • JNI-based compression • No compiler optimization for MapReduce framework • No compiler optimization for user program • Memory control by java gc • Quick Sort • Difficult to implement combiner with streaming interface • 4-5 kinds of programming interface • Easy to support other storage system e.g. Hypertable • Direct native compression • Static compiler optimization for MapReduce framework • Static compiler optimization for user program • Memory control by framework • Bucket Sort + Quick Sort • Easy to implement combiner 20 Agenda • Background and Motivation • Framework Model • Evaluation – Benchmark – Application • Conclusion • Q&A Benchmark • Map Task Timings 22 0 20 40 60 80 100 120 140 160 180 200 Hadoop HCE E x e c u ti o n T im e ( s e c ) Wordcount Map Timings MERGE SPILL CLEANUP COLLECT MAP READ SETUP Benchmark 23 Bonus! SSE can improve 10% in addition 0 200 400 600 800 1000 1200 E x e c u ti o n T im e ( s e c ) WordCount Performance with Different Compression Strategies Hadoop Streaming HCE Streaming • Compression impacts – 100 GB, 10 nodes Application • Language impacts • SSE impacts 24 0 10 20 30 40 50 60 Hadoop Streaming HCE Streaming HCE Python E x e c u ti o n T im e ( s e c ) APP1 0 10 20 30 40 50 60 70 Hadoop Streaming HCE Streaming User SSE E x e c u ti o n T im e ( s e c ) APP2 Agenda • Background and Motivation • Framework Model • Evaluation • Conclusion – How to Optimize Jobs – Contribution • Q&A How to Optimize Jobs • Decrease the number of reduce tasks by combiner • Use c++ programming interface • Improve the efficiency of tasks by compiler • Use Lzo/QuickLz compression strategy for map tasks 26 Contribution • China – Clusters • Future – All Hadoop Clusters in 2011 – Applications • Jobs whose tasks are big • MapReduce-based warehouse 27 HCE can save >10% machines at least Contribution • Facebook – Hive Over HCE • Implementation – HiveMapper and HiveReducer – RC-File RecordReader and RecordWriter • Performance – CPU utilization 20%~50% improvement • Patches to Apache Jira – http://issues.apache.org/jira/browse/MAPREDUCE-1270 – https://issues.apache.org/jira/browse/MAPREDUCE-2446 28 Thanks for your Attention 29 Questions 30

                    本文档为【CSDN大数据应用大会PPT——01-杨栋：HCE提升资源利用率的MapReduce框架】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

CSDN大数据应用大会PPT——01-杨栋：HCE提升资源利用率的MapReduce框架

你可能还喜欢