Big Data Beyond Hadoop
Real-Time Analytical Processing (RTAP) Using
Spark and Shark
Jason Dai
Engineering Director & Principal Engineer
Intel Software and Services Group
CCF YOCSEF
Shanghai
Agenda
Big Data beyond Hadoop
Introduction to Spark and Shark
Case study: real-time analytical processing (RTAP)
Big Data beyond Hadoop
Big Dta today
• The is in the room
Big Data beyond Hadoop
• Real-time analytical processing (RTAP)
– Discover and explore data iteratively and interactively for real-time insights
• Advanced machine leaning and data mining (MLDM)
– Graph-parallel predictive analytics (non-SQL)
• Distributed in-memory analytics
– Exploit available main memory in the entire cluster for >100x speedup
RTAP: Real-Time Analytical Processing
Real-Time Analytical Processing (RTAP)
• Data ingested & processed in a streaming fashion
• Real-time data queried and presented in an online fashion
• Real-time and history data combined and mined interactively
• Predominantly RAM-based processing
Advanced, Graph-Parallel MLDM
Advanced machine learning and data mining (MLDM)
• Information retrieval (e.g., page rank)
• Recommendation engine (e.g., ALS)
• Social network analysis (e.g., clustering)
• Natural language processing (e.g., NER)
• …
Graph parallel computations
• A sparse graph G(V, E)
• A vertex program P runs on each vertex in parallel
& repeatedly
• Vertices interact along edges
Advanced, Graph-Parallel MLDM
Data-Parallel Graph-Parallel
MapReduce Pregel/GraphLab
• Independent data
• Single-pass
• (Bulk) synchronous
• (Sparse) data dependence
• Iterative
• Dynamically prioritized
10x~100x speedup
• Exploit graph structure to reduce
computation & communications
• Efficient graph partition to balance
computation/storage, and minimize
network transfer
MLDM
Distributed In-Memory Analytics
Memory is king
• 64GB/node mainstream, 192GB not uncommon, fast cheap NVRAM on the horizon
Hadoop inherently disk-based architecture
• Full table scan in Hive from RAM only ~40% speedup
• Read all the main-memory DB literatures
Distributed in-memory analytics
• Efficient compute integrated with columnar compression
• Reliable RAM-oriented storage layer across the cluster
• Holistic allocation of memory in the cluster
– Inputs, intermediate results, temporary data, computation state, etc.
Agenda
Big Data beyond Hadoop
Introduction to Spark and Shark
Case study: real-time analytical processing (RTAP)
Project Overview
Research & open source projects initiated by AMPLab in UC Berkeley
• Leveraging existing SW stacks (e.g., HDFS, Hive, etc.)
• Moving beyond Hadoop w/ BDAS
– In-memory, real-time data analysis (Spark, Shark, Tachyon, etc.)
– Advanced, graph-parallel machine leaning (GraphX, MLBase, etc.)
• Intel China collaborating with AMPLab on joint open source development
• Active communities and early adopters evolving
– Spark Apache incubator proposal @ https://wiki.apache.org/incubator/SparkProposal
9
Berkeley Data Analytics Stack (BDAS)
https://amplab.cs.berkeley.edu/
http://spark-project.org/
http://shark.cs.berkeley.edu/
What is Spark?
A distributed, in-memory, real-time data processing framework
• A general, efficient, Dryad-like engine
– A superset of MapReduce, compatible with Hadoop’s storage APIs, but up to 40x faster
than Hadoop
– Avoid launching multiple chained MR jobs or storing intermediate results on HDFS
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
= RDD (resilient distributed
datasets )
= partition in an RDD
What is Spark?
A distributed, in-memory, real-time data processing framework
• Extremely low latency
– Optimized for tasks as short as 100s of milliseconds
– Speed of MPP and/or in-memory databases (i.e., interactive queries), but with finer-
grained fault recovery
• Efficient in-memory, real-time computing
– Allow working set to be cached in memory, with graceful degradation under low memory
– Efficient support for real-time and/or iterative data analysis
– Interactive, streaming, iterative, graph-parallel, etc.
What is Shark?
A Hive-compatible data warehouse on Spark
• Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs,
etc.)
– Shark/Spark specific optimizations (hash- and memory-based shuffle, data co-
partitioning, etc.)
– Up to 40x faster than Hive, and support interactive queries
• Allow table to be cached in memory for online & iterative mining
• Integration with Spark to combine SQL and machine learning algorithms
12
Use Cases
Ad-hoc & interactive queries
• Allow close-to sub-second latency
– E.g., similar to Dremel & Implala (but with fine-grained fault-tolerance)
In-memory, real-time analysis
• Load data (reliably) in distributed memory for online analysis
– E.g., similar to PowerDrill
Iterative, graph-parallel analysis (esp. machine learning)
• Cache intermediate results in memory for iterative machine learning
• Graph-parallel computing (e.g., Pregrel and GraphLab models) on Spark
Use Cases
Stream processing
• Spark streaming
– Run streaming computation as a series of very small, deterministic batch jobs
– As frequent as ~1/2 second
– Better fault tolerance, straggler handling & state consistency
– Potentially combine batch, interactive & streaming workloads
time = 0 - 1:
time = 1 - 2:
batch operations
input
input
immutable
distributed dataset
(replicated in memory)
immutable distributed
dataset, stored in
memory as RDD
input stream state stream
…
…
…
state / output
Agenda
Big Data beyond Hadoop
Introduction to Spark and Shark
Case study: real-time analytical processing (RTAP)
RTAP Architecture
Messagin
g / Queue
Stream
Processi
ng
RAM
Stor
e
Interactive
Query / BI
Online
Analysis /
Dashboar
d
Event
Logs
Persistent Storage
NoSQL
Data
Warehouse
Low
Latency
Query
Engine
data latency: 5~10 seconds
online query
latency:
sub-second
Interactive
query latency:
<5 seconds
data
stream
Denormlized,
aggregated
results
history
table
real-time
data
Ad-hoc
queries
Lightweight
queries
We are partnering with several web sites
on building the RTAP framework using Spark & Shark
RTAP Use Cases
Online dashboard
• Pages/Ads/Videos/Items — time base aggregations — break-down by categories/demography
Interactive BI
• Combined with history & dimension data when necessary
– E.g., top 100 viewed videos under each category in the last month
/vehicle/car
Name … View in last 30s
Sports 500002
Jeep 430045
… …
Top 10 Viewed Category
30s Minute Hour
View Count
Sports
Jeep
Family
RTAP Framework using Spark & Shark
Kafka
Spark
Streamin
g
In-Memory
Shark
Table
Interactive
Query / BI
Online
Analysis /
Dashboar
d
Event
Logs
Shark Tables
(HDFS)
Shark
Messaging
/ Queue
Stream
Processing RAM Store Query
Engine
Persistent Storage
A work in progress
Real-Time Data Stream Processing
Kafka
Spark
Streamin
g
In-Memory
Shark
Table
Event
Logs
Shark Tables
(HDFS)
Messaging
/ Queue
Stream
Processing
RAM Store
Persistent
Storage
Logs streamed into Spark Streaming through
Kafka in real-time
Incoming logs processed by Spark Streaming
in small batches (e.g., 5 seconds)
• Compute multiple aggregations over logs
received in the last window
• Join logs and history tables when necessary
Plan to add the Streaming support directly in Shark
• Raw click stream
– 0.6.38.68 - - BAF42487E0C7076CE576FAAB0E1852EC [14/Dec/2012
8:21:16 -0] "GET ?video=8745 HTTP/1.1" 101 1345
http://www.foo.com/bar/?ivideo=8745
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)“
• Compute page view in the last minute
– E.g., www.foo.com/bar/?video=8745, www.foo.com/bar,
www.foo.com, etc.
• Compute category view count in the last minute
– E.g., join logs and the video table (assuming video 8745
belongs to /vehicle/car/sports) for /vehicle, /vehicle/car,
/vehicle/car/sports, etc.
Real-Time Data Store and Query Engine
Spark
Streamin
g
In-Memory
Shark
Table
Shark Tables
(HDFS)
Stream
Processing Query Engine
Persistent
Storage
Shark
RAM Store
Aggregation results written to Shark table cached in memory
• Currently output as cached RDD by Spark Streaming
– Require Spark Streaming embedded in the Shark server JVM
• Plan to move to Tachyon for better sharing and fault tolerance
Both real-time aggregations and history data queried through Shark
• History data loaded into memory for iterative mining
• Working on query optimizations & standard SQl-92 support
Interactive
Query / BI
Online
Analysis /
Dashboar
d
Online and Interactive Queries
In-Memory
Shark
Table
Shark Tables
(HDFS)
Query Engine
Persistent
Storage Shark
RAM Store
Interactive
Query / BI
Online
Analysis /
Dashboar
d
Online analysis
• A lightweight UI frontending Shark for online dashboard
• Mostly time-based lightweight queries (filtering, ordering, TopN, aggregations, etc.) with sub-second
latency
Interactive query / BI
• Ad-hoc, (more) complex SQL queries (with <5 seconds latency)
• Heavily denormalized to eliminate join as much as possible
Summary
Real-Time Analytical
Processing
Graph-Parallel MLDM
Distributed In-Memory
Analysis
Big Data beyond Hadoop
BDAS: one stack to rule them all!
Intel China collaborating
with UC Berkeley & web
sites
on production deployment
Active communities and
early adopters evolving
(e.g., Spark Apache
incubator proposal )
Work with us on next-gen Big Data beyond Hadoop using Spark/Shark
1
2
Call to action 3
2013英特尔® 软件学院课程概览
英特尔计划于9月举办大数据师资研讨活动,有兴趣参与的老师请联系:
hai.shen@intel.com
本文档为【超越Hadoop的大数据技术:用Spark 和Shark进行基于内存的实时大数据分析】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。