首页 数据存储和查找

数据存储和查找

举报
开通vip

数据存储和查找 THUIRDB :A Large-Scale, Highly- Efficient Index, Fast-Access Key-Value Store 10billion records,1bit index per record,1million/sec throughput in 1 machine 梁斌 清华大学计算机科学与技术系 http://www.thuir.org/thuirdb/ mgigabyte@gmail.com Structure of Report • Int...

数据存储和查找
THUIRDB :A Large-Scale, Highly- Efficient Index, Fast-Access Key-Value Store 10billion records,1bit index per record,1million/sec throughput in 1 machine 梁斌 清华大学计算机科学与技术系 http://www.thuir.org/thuirdb/ mgigabyte@gmail.com Structure of Report • Introduction • Requirement and motivation • Background review • Problems • Problems for binary-search • Problems for B+-tree • Problems for hash • Problems for building database • Rational • Separating key and value by location • Sorting then linear write • Data lay-out • Building the index bottom-up • Highly-efficiency compressing Requirement and motivation • A special requirement for key-value store: (1)Large scale, billions of record (2)Random query, hundreds of random query for a task (3)Static dataset, once built never changed, just read no update (4) Low cost, sometimes should be built in a machine • Pratical application situation: (1)log-based analysis systems (2)language-model based machine translation Backgroud Review • [Tobin 1986] issue an new data structure:T- tree . T-tree inherits from both B-tree and AVL-tree and is used by main-memory databases, such as Datablitz, eXtremeDB, MySQL Cluster. • [Philip 2001]pk-Ttree and pkB-tree add index compression. • [1998 Rao Jun]Rao created CSS-tree, and developed to CSB+ tree and CSS+tree Background Review Google introduced LMS-tree and multi-level skip list based key-value store called LevelDB in 2011. http://blog.nosqlfan.com/html/3041.html From Alon Efrat’s ppt 关于艾滋病ppt课件精益管理ppt下载地图下载ppt可编辑假如ppt教学课件下载triz基础知识ppt of skip list Problems for binary-search • Importance of data is out of control – 9 is the key and index key, very important – 8 and 10 is only key, not important, but have been put into L1 cache with 9 unfortunately • Poor Data reference locality Problems for B+-tree (1)Each node stores child pointers and Each node 50% full. (2)Data write is low-efficiency. (3)Data write is not sequential Problems for Hash • High space overhead • Hard to compress • Can’t support ordered access • Cant’s support range-query Problems for building database • Both B+-tree-based and Hash-based databases are in trouble with memory-shortage, why? • Why them need so much memory in hand? Why not sync these dirty data to disk from buffer? Rational of THUIR-DB • (1) separating key and value by location • (2) sorting the key-value pairs, linear write them • (3) build the index bottom-up • (4) pointer eliminating • (5)highly-efficiency compressing Separation of index and data • Insert: location = store_s.put(value) search_s.insert(key,location) • query: location = search_s.find(key) value = store_s.get(location) • Separating index and data makes the index smaller, provides chances to apply different compression methods and some other tricks, such as data reordering. • In a word, Indirection provides flexibility Variable-length key and value • insert: location = store_s.put(key,value); inner_key = md5(key) search_s.insert(inner_key,location); • query: inner_key = md5(selected_key) locations = search_s.find(inner_key) array()=store_s.get(locations) for each item in arry if(keyi = selected_key) return valuei • Certainty is the base of optimizing linear inserting, building index bottom-up advantage: (1)important data constructed together(cache conscious) (2)write data sequentially (3)full-write each node (4) set the stage for compression Lay-out Compression of Integer Sequence • 1)if we known value-range in advance a give integer sequence S,all integer in the range [0,L],then it cost bits to save them all • 2)if we know it is ordered . use difference to make the smaller, such as 1,4,6,15,25,40, after differencing, we get 1,3,2,9,10,15,if all integer in the new sequence is in the range [0,L’],then it cost bits. • 3)sometimes there are very large number, such as 1,4,6,15,25,10000,100001, use exception block to save them, Pfordelta and New-Pfordelta solve the problem very well. 2* logS L   2* log 'S L   Adaptive-compression • If we have the integer sequence like these. 1 1 2 3 15 16 16 17 19 20 21 22… • Try to find a economical way to compress 1 1 2 3 [1bit gap] -> X bit /record 1 1 2 3 15 16 17.. [4bit gap]-> Y bit/record If X
本文档为【数据存储和查找】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_655024
暂无简介~
格式:pdf
大小:1MB
软件:PDF阅读器
页数:27
分类:互联网
上传时间:2012-11-03
浏览量:30