首页 > > > OReilly.Cassandra.The.Definitive.Guide.2010.pdf

OReilly.Cassandra.The.Definitive.Guide.2010.pdf

OReilly.Cassandra.The.Definitiv…

上传者: ronnin.lee 2012-02-14 评分1 评论0 下载29 收藏10 阅读量696 暂无简介 简介 举报

简介:本文档为《OReilly.Cassandra.The.Definitive.Guide.2010pdf》,可适用于专题技术领域,主题内容包含Cassandra:TheDefinitiveGuideCassandra:TheDefinitiveGuideEbenHewittBeijing•符等。

Cassandra: The Definitive Guide Cassandra: The Definitive Guide Eben Hewitt Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo Cassandra: The Definitive Guide by Eben Hewitt Copyright 2011 Eben Hewitt. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editor: Mike Loukides Production Editor: Holly Bauer Copyeditor: Genevieve d’Entremont Proofreader: Emily Quill Indexer: Ellen Troutman Zaig Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Robert Romano Printing History: November 2010: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Cassandra: The Definitive Guide, the image of a Paradise flycatcher, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. TM This book uses RepKover, a durable and flexible lay-flat binding. ISBN: 978-1-449-39041-9 [M] 1289577822 This book is dedicated to my sweetheart, Alison Brown. I can hear the sound of violins, long before it begins. Table of Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1. Introducing Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What’s Wrong with Relational Databases? 1 A Quick Review of Relational Databases 6 RDBMS: The Awesome and the Not-So-Much 6 Web Scale 12 The Cassandra Elevator Pitch 14 Cassandra in 50 Words or Less 14 Distributed and Decentralized 14 Elastic Scalability 16 High Availability and Fault Tolerance 16 Tuneable Consistency 17 Brewer’s CAP Theorem 19 Row-Oriented 23 Schema-Free 24 High Performance 24 Where Did Cassandra Come From? 24 Use Cases for Cassandra 25 Large Deployments 25 Lots of Writes, Statistics, and Analysis 26 Geographical Distribution 26 Evolving Applications 26 Who Is Using Cassandra? 26 Summary 28 2. Installing Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Installing the Binary 29 Extracting the Download 29 vii What’s In There? 29 Building from Source 30 Additional Build Targets 32 Building with Maven 32 Running Cassandra 33 On Windows 33 On Linux 33 Starting the Server 34 Running the Command-Line Client Interface 35 Basic CLI Commands 36 Help 36 Connecting to a Server 36 Describing the Environment 37 Creating a Keyspace and Column Family 38 Writing and Reading Data 39 Summary 40 3. The Cassandra Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 The Relational Data Model 41 A Simple Introduction 42 Clusters 45 Keyspaces 46 Column Families 47 Column Family Options 49 Columns 49 Wide Rows, Skinny Rows 51 Column Sorting 52 Super Columns 53 Composite Keys 55 Design Differences Between RDBMS and Cassandra 56 No Query Language 56 No Referential Integrity 56 Secondary Indexes 56 Sorting Is a Design Decision 57 Denormalization 57 Design Patterns 58 Materialized View 59 Valueless Column 59 Aggregate Key 59 Some Things to Keep in Mind 60 Summary 60 viii | Table of Contents 4. Sample Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Data Design 61 Hotel App RDBMS Design 62 Hotel App Cassandra Design 63 Hotel Application Code 64 Creating the Database 65 Data Structures 66 Getting a Connection 67 Prepopulating the Database 68 The Search Application 80 Twissandra 85 Summary 85 5. The Cassandra Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 System Keyspace 87 Peer-to-Peer 88 Gossip and Failure Detection 88 Anti-Entropy and Read Repair 90 Memtables, SSTables, and Commit Logs 91 Hinted Handoff 93 Compaction 94 Bloom Filters 95 Tombstones 95 Staged Event-Driven Architecture (SEDA) 96 Managers and Services 97 Cassandra Daemon 97 Storage Service 97 Messaging Service 97 Hinted Handoff Manager 98 Summary 98 6. Configuring Cassandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Keyspaces 99 Creating a Column Family 102 Transitioning from 0.6 to 0.7 103 Replicas 103 Replica Placement Strategies 104 Simple Strategy 105 Old Network Topology Strategy 106 Network Topology Strategy 107 Replication Factor 107 Increasing the Replication Factor 108 Partitioners 110 Table of Contents | ix Random Partitioner 110 Order-Preserving Partitioner 110 Collating Order-Preserving Partitioner 111 Byte-Ordered Partitioner 111 Snitches 111 Simple Snitch 111 PropertyFileSnitch 112 Creating a Cluster 113 Changing the Cluster Name 113 Adding Nodes to a Cluster 114 Multiple Seed Nodes 116 Dynamic Ring Participation 117 Security 118 Using SimpleAuthenticator 118 Programmatic Authentication 121 Using MD5 Encryption 122 Providing Your Own Authentication 122 Miscellaneous Settings 123 Additional Tools 124 Viewing Keys 124 Importing Previous Configurations 125 Summary 127 7. Reading and Writing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Query Differences Between RDBMS and Cassandra 129 No Update Query 129 Record-Level Atomicity on Writes 129 No Server-Side Transaction Support 129 No Duplicate Keys 130 Basic Write Properties 130 Consistency Levels 130 Basic Read Properties 132 The API 133 Ranges and Slices 133 Setup and Inserting Data 134 Using a Simple Get 140 Seeding Some Values 142 Slice Predicate 142 Getting Particular Column Names with Get Slice 142 Getting a Set of Columns with Slice Range 144 Getting All Columns in a Row 145 Get Range Slices 145 Multiget Slice 147 x | Table of Contents Deleting 149 Batch Mutates 150 Batch Deletes 151 Range Ghosts 152 Programmatically Defining Keyspaces and Column Families 152 Summary 153 8. Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Basic Client API 156 Thrift 156 Thrift Support for Java 159 Exceptions 159 Thrift Summary 160 Avro 160 Avro Ant Targets 162 Avro Specification 163 Avro Summary 164 A Bit of Git 164 Connecting Client Nodes 165 Client List 165 Round-Robin DNS 165 Load Balancer 165 Cassandra Web Console 165 Hector (Java) 168 Features 169 The Hector API 170 HectorSharp (C#) 170 Chirper 175 Chiton (Python) 175 Pelops (Java) 176 Kundera (Java ORM) 176 Fauna (Ruby) 177 Summary 177 9. Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Logging 179 Tailing 181 General Tips 182 Overview of JMX and MBeans 183 MBeans 185 Integrating JMX 187 Interacting with Cassandra via JMX 188 Cassandra’s MBeans 190 Table of Contents | xi org.apache.cassandra.concurrent 193 org.apache.cassandra.db 193 org.apache.cassandra.gms 194 org.apache.cassandra.service 194 Custom Cassandra MBeans 196 Runtime Analysis Tools 199 Heap Analysis with JMX and JHAT 199 Detecting Thread Problems 203 Health Check 204 Summary 204 10. Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Getting Ring Information 208 Info 208 Ring 208 Getting Statistics 209 Using cfstats 209 Using tpstats 210 Basic Maintenance 211 Repair 211 Flush 213 Cleanup 213 Snapshots 213 Taking a Snapshot 213 Clearing a Snapshot 214 Load-Balancing the Cluster 215 loadbalance and streams 215 Decommissioning a Node 218 Updating Nodes 220 Removing Tokens 220 Compaction Threshold 220 Changing Column Families in a Working Cluster 220 Summary 221 11. Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Data Storage 223 Reply Timeout 225 Commit Logs 225 Memtables 226 Concurrency 226 Caching 227 Buffer Sizes 228 Using the Python Stress Test 228 xii | Table of Contents Generating the Python Thrift Interfaces 229 Running the Python Stress Test 230 Startup and JVM Settings 232 Tuning the JVM 232 Summary 234 12. Integrating Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 What Is Hadoop? 235 Working with MapReduce 236 Cassandra Hadoop Source Package 236 Running the Word Count Example 237 Outputting Data to Cassandra 239 Hadoop Streaming 239 Tools Above MapReduce 239 Pig 240 Hive 241 Cluster Configuration 241 Use Cases 242 Raptr.com: Keith Thornhill 243 Imagini: Dave Gardner 243 Summary 244 Appendix: The Nonrelational Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Table of Contents | xiii Foreword Cassandra was open-sourced by Facebook in July 2008. This original version of Cassandra was written primarily by an ex-employee from Amazon and one from Mi- crosoft. It was strongly influenced by Dynamo, Amazon’s pioneering distributed key/ value database. Cassandra implements a Dynamo-style replication model with no sin- gle point of failure, but adds a more powerful “column family” data model. I became involved in December of that year, when Rackspace asked me to build them a scalable database. This was good timing, because all of today’s important open source scalable databases were available for evaluation. Despite initially having only a single major use case, Cassandra’s underlying architecture was the strongest, and I directed my efforts toward improving the code and building a community. Cassandra was accepted into the Apache Incubator, and by the time it graduated in March 2010, it had become a true open source success story, with committers from Rackspace, Digg, Twitter, and other companies that wouldn’t have written their own database from scratch, but together built something important. Today’s Cassandra is much more than the early system that powered (and still powers) Facebook’s inbox search; it has become “the hands down winner for transaction pro- cessing performance,” to quote Tony Bain, with a deserved reputation for reliability and performance at scale. As Cassandra matured and began attracting more mainstream users, it became clear that there was a need for commercial support; thus, Matt Pfeil and I cofounded Riptano in April 2010. Helping drive Cassandra adoption has been very rewarding, especially seeing the uses that don’t get discussed in public. Another need has been a book like this one. Like many open source projects, Cassan- dra’s documentation has historically been weak. And even when the documentation ultimately improves, a book-length treatment like this will remain useful. xv Thanks to Eben for tackling the difficult task of distilling the art and science of devel- oping against and deploying Cassandra. You, the reader, have the opportunity to learn these new concepts in an organized fashion. —Jonathan Ellis Project Chair, Apache Cassandra, and Cofounder, Riptano xvi | Foreword Preface Why Apache Cassandra? Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems. Cassandra first started as an incubation project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, re- leased version 0.3 of Cassandra, and have steadily made minor releases since that time. Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used in production by some of the biggest properties on the Web, including Facebook, Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more. Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure. It is highly available and offers a schema-free data model. Is This Book for You? This book is intended for a variety of audiences. It should be useful to you if you are: • A developer working with large-scale, high-volume websites, such as Web 2.0 so- cial applications • An application architect or data architect who needs to understand the available options for high-performance, decentralized, elastic data stores • A database administrator or database developer currently working with standard relational database systems who needs to understand how to implement a fault- tolerant, eventually consistent data store xvii • A manager who wants to understand the advantages (and disadvantages) of Cas- sandra and related columnar databases to help make decisions about technology strategy • A student, analyst, or researcher who is designing a project related to Cassandra or other non-relational data store options This book is a technical guide. In many ways, Cassandra represents a new way of thinking about data. Many developers who gained their professional chops in the last 15–20 years have become well-versed in thinking about data in purely relational or object-oriented terms. Cassandra’s data model is very different and can be difficult to wrap your mind around at first, especially for those of us with entrenched ideas about what a database is (and should be). Using Cassandra does not mean that you have to be a Java developer. However, Cas- sandra is written in Java, so if you’re going to dive into the source code, a solid under- standing of Java is crucial. Although it’s not strictly necessary to know Java, it can help you to better understand exceptions, how to build the source code, and how to use some of the popular clients. Many of the examples in this book are in Java. But because of the interface used to access Cassandra, you can use Cassandra from a wide variety of languages, including C#, Scala, Python, and Ruby. Finally, it is assumed that you have a good understanding of how the Web works, can use an integrated development environment (IDE), and are somewhat familiar with the typical concerns of data-driven applications. You might be a well-seasoned developer or administrator but still, on occasion, encounter tools used in the Cassandra world that you’re not familiar with. For example, Apache Ivy is used to build Cassandra, and a popular client (Hector) is available via Git. In cases where I speculate that you’ll need to do a little setup of your own in order to work with the examples, I try to support that. What’s in This Book? This book is designed with the chapters acting, to a reasonable extent, as standalone guides. This is important for a book on Cassandra, which has a variety of audiences and is changing rapidly. To borrow from the software world, I wanted the book to be “modular”—sort of. If you’re new to Cassandra, it makes sense to read the book in order; if you’ve passed the introductory stages, you will still find value in later chapters, which you can read as standalone guides. Here is how the book is organized: Chapter 1, Introducing Cassandra This chapter introduces Cassandra and discusses what’s exciting and different about it, who is using it, and what its advantages are. Chapter 2, Installing Cassandra This chapter walks you through installing Cassandra on a variety of platforms. xviii | Preface Chapter 3, The Cassandra Data Model Here we look at Cassandra’s data model to understand what columns, super col- umns, and rows are. Special care is taken to bridge the gap between the relational database world and Cassandra’s world. Chapter 4, Sample Application This chapter presents a complete working application that translates from a rela- tional model in a well-understood domain to Cassandra’s data model. Chapter 5, The Cassandra Architecture This chapter helps you understand what happens during read and write operations and how the database accomplishes some of its notable aspects, such as durability and high availability. We go under the hood to understand some of the more com- plex inner workings, such as the gossip protocol, hinted handoffs, read repairs, Merkle trees, and more. Chapter 6, Configuring Cassandra This chapter shows you how to specify partitioners, replica placement strategies, and snitches. We set up a cluster and see the implications of different configuration choices. Chapter 7, Reading and Writing Data This is the moment we’ve been waiting for. We present an overview of what’s different about Cassandra’s model for querying and updating data, and then get to work using the API. Chapter 8, Clients There are a variety of clients that third-party developers have created for many different languages, including Java, C#, Ruby, and Python, in order to abstract Cassandra’s lower-level API. We help you understand this landscape so you can choose one that’s right for you. Chapter 9, Monitoring Once your cluster is up and running, you’ll want to monitor its usage, memory patterns, and thread patterns, and understand its general activity. Cassandra has a rich Java Management Extensions (JMX) interface baked in, which we put to use to monitor all of these and more. Chapter 10, Maintenance The ongoing maintenance of a Cassandra cluster is made somewhat easier by some tools that ship with the server. We see how to decommission a node, load-balance the cluster, get statistics, and perform other routine operational tasks. Chapter 11, Performance Tuning One of Cassandra’s most notable features is its speed—it’s very fast. But there are a number of things, including memory settings, data storage, hardware choices, caching, and buffer sizes, that you can tune to squeeze out even more performance. Preface | xix Chapter 12, Integrating Hadoop In this chapter, written by Jeremy Hanna, we put Cassandra in a larger context and see how to integrate it with the popular implementation of Google’s Map/Reduce algorithm, Hadoop. Appendix Many new databases have cropped up in response to the need to scale at Big Data levels, or to take advantage of a “schema-free” model, or to support more recent initiatives such as the Semantic Web. Here we contextualize Cassandra against a variety of the more popular nonrelational databases, examining document- oriented databases, distributed hashtables, and graph databases, to better understand Cassandra’s offerings. Glossary It can be difficult to understand something that’s really new, and Cassandra has many terms that might be unfamiliar to developers or DBAs coming from the re- lational application development world, so I’ve includ

该用户的其他资料

  • 名称/格式
  • 评分
  • 下载次数
  • 资料大小
  • 上传时间

用户评论

0/200
    暂无评论
上传我的资料

相关资料

资料评价:

/ 330
所需积分:1 立即下载
返回
顶部
举报
资料
关闭

温馨提示

感谢您对爱问共享资料的支持,精彩活动将尽快为您呈现,敬请期待!