数据存储和查找下载_在线阅读_27

THUIRDB :A Large-Scale, Highly- Efficient Index, Fast-Access Key-Value Store 10billion records,1bit index per record,1million/sec throughput in 1 machine 梁斌清华大学计算机科学与技术系 http://www.thuir.org/thuirdb/ mgigabyte@gmail.com Structure of Report • Introduction • Requirement and motivation • Background review • Problems • Problems for binary-search • Problems for B+-tree • Problems for hash • Problems for building database • Rational • Separating key and value by location • Sorting then linear write • Data lay-out • Building the index bottom-up • Highly-efficiency compressing Requirement and motivation • A special requirement for key-value store：（1）Large scale, billions of record （2）Random query, hundreds of random query for a task （3）Static dataset, once built never changed, just read no update （4） Low cost, sometimes should be built in a machine • Pratical application situation：（1）log-based analysis systems （2）language-model based machine translation Backgroud Review • [Tobin 1986] issue an new data structure:T- tree . T-tree inherits from both B-tree and AVL-tree and is used by main-memory databases, such as Datablitz, eXtremeDB, MySQL Cluster. • [Philip 2001]pk-Ttree and pkB-tree add index compression. • [1998 Rao Jun]Rao created CSS-tree, and developed to CSB+ tree and CSS+tree Background Review Google introduced LMS-tree and multi-level skip list based key-value store called LevelDB in 2011. http://blog.nosqlfan.com/html/3041.html From Alon Efrat’s ppt of skip list Problems for binary-search • Importance of data is out of control – 9 is the key and index key, very important – 8 and 10 is only key, not important, but have been put into L1 cache with 9 unfortunately • Poor Data reference locality Problems for B+-tree (1)Each node stores child pointers and Each node 50% full. (2)Data write is low-efficiency. (3)Data write is not sequential Problems for Hash • High space overhead • Hard to compress • Can’t support ordered access • Cant’s support range-query Problems for building database • Both B+-tree-based and Hash-based databases are in trouble with memory-shortage, why? • Why them need so much memory in hand? Why not sync these dirty data to disk from buffer? Rational of THUIR-DB • (1) separating key and value by location • (2) sorting the key-value pairs, linear write them • (3) build the index bottom-up • (4) pointer eliminating • (5)highly-efficiency compressing Separation of index and data • Insert: location = store_s.put(value) search_s.insert(key,location) • query: location = search_s.find(key) value = store_s.get(location) • Separating index and data makes the index smaller, provides chances to apply different compression methods and some other tricks, such as data reordering. • In a word, Indirection provides flexibility Variable-length key and value • insert： location = store_s.put(key,value); inner_key = md5(key) search_s.insert(inner_key,location); • query： inner_key = md5(selected_key) locations = search_s.find(inner_key) array()=store_s.get(locations) for each item in arry if(keyi = selected_key) return valuei • Certainty is the base of optimizing linear inserting, building index bottom-up advantage：（1）important data constructed together(cache conscious) （2）write data sequentially （3）full-write each node （4） set the stage for compression Lay-out Compression of Integer Sequence • 1）if we known value-range in advance a give integer sequence S，all integer in the range [0，L],then it cost bits to save them all • 2）if we know it is ordered . use difference to make the smaller, such as 1,4,6,15,25,40, after differencing, we get 1,3,2,9,10,15，if all integer in the new sequence is in the range [0,L’],then it cost bits. • 3）sometimes there are very large number, such as 1,4,6,15,25,10000,100001, use exception block to save them, Pfordelta and New-Pfordelta solve the problem very well. 2* logS L   2* log 'S L   Adaptive-compression • If we have the integer sequence like these. 1 1 2 3 15 16 16 17 19 20 21 22… • Try to find a economical way to compress 1 1 2 3 [1bit gap] -> X bit /record 1 1 2 3 15 16 17.. [4bit gap]-> Y bit/record If X

                    本文档为【数据存储和查找】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

数据存储和查找

你可能还喜欢