6 Top Tools for Taming Big Data

6 Top Tools for Taming Big Data 54 ·中国制造业信息化·2012年8期数字工厂/管理 actoryFigitalD 6 Top Tools for 思想者 Thinker The industry now has a buzzword, "big data," for how we're going to do something with the huge amount of information pil- ing up. "Big data" is replacing "business intelligence," w...

54 ·中国制造业信息化·2012年8期数字工厂/管理 actoryFigitalD 6 Top Tools for 思想者 Thinker The industry now has a buzzword, "big data," for how we're going to do something with the huge amount of information pil- ing up. "Big data" is replacing "business intelligence," which sub- sumed "reporting," which put a nicer gloss on "spreadsheets," which beat out the old-fashioned "printouts." Managers who long ago studied printouts are now hiring mathematicians who claim to be big data specialists to help them solve the same old problem: What's selling and why? It's not fair to suggest that these buzzwords are simple re- placements for each other. Big data is a more complicated world because the scale is much larger. The information is usually spread out over a number of servers, and the work of compiling the data must be coordinated among them. In the past, the work was largely delegated to the database software, which would use its magical JOIN mechanism to compile tables, then add up the columns before handing off the rectangle of data to the reporting software that would paginate it. This was often harder than it sounds. Da- tabase programmers can tell you the stories about complicated JOIN commands that would lock up their database for hours as it tried to produce a report for the boss who wanted his columns just so. The game is much different now. Hadoop is a popular tool for organizing the racks and racks of servers, and NoSQL data- bases are popular tools for storing data on these racks. These mechanism can be much more powerful than the old single machine, but they are far from being as polished as the old data- base servers. Although SQL may be complicated, writing the JOIN query for the SQL databases was often much simpler than gath- ering information from dozens of machines and compiling it into one coherent answer. Hadoop jobs are written in Java, and that requires another level of sophistication. The tools for tackling big data are just beginning to package this distributed computing power in a way that's a bit easier to use. The biggest challenge may be dealing with the expecta- tions built up by the major motion picture "Moneyball." All the bosses have seen it and absorbed the message that some clever statistics can turn a small-budget team into a World Series winner. Never mind that the Oakland Athletics never won the World Series during the "Moneyball" era. That's the magic of Michael Lewis' prose. The bosses are all thinking, "Perhaps if I can get some good stats, Hollywood will hire Brad Pitt to play me in the movie version." None of the software in this collection will come close to luring Brad Pitt to ask his agent for a copy of the script for the movie version of your Hadoop job. That has to come from within you or the other humans working on the project. Understanding the data and finding the right question to ask is often much more complicated than getting your Hadoop job to run quickly. That's really saying something because these tools are only half of the job. To get a handle for the promise of the field, I downloaded some big data tools, mixed in data, then stared at the answers for Einstein-grade insight. The information came from log files to the website that sells some of my books (wayner.org), and I was Taming Big Data ■ JakoB BJ orklund 54 ·中国制造业信息化·2012年8期 2012年8期·www.miechina.com·55 数字工厂/管理 actoryFigitalD looking for some idea of what was selling and why. So I un- packed the software and asked the questions. Big data tools: Jaspersoft BI Suite The Jaspersoft package is one of the open source leaders for producing reports from database columns. The software is well-polished and already installed in many businesses turning SQL tables into PDFs that everyone can scrutinize at meetings. The company is jumping on the big data train, and this means adding a software layer to connect its report generating software to the places where big data gets stored. The JasperReports Server now offers software to suck up data from many of the major storage platforms, including MongoDB, Cassandra, Redis, Riak, CouchDB, and Neo4j. Hadoop is also well-represented, with JasperReports providing a Hive connector to reach inside of HBase. This effort feels like it is still starting up -- many pages of the documentation wiki are blank, and the tools are not fully integrated. The visual query designer, for instance, doesn't work yet with Cassandra's CQL. You get to type these queries out by hand. Once you get the data from these sources, Jaspersoft's server will boil it down to interactive tables and graphs. The reports can be quite sophisticated interactive tools that let you drill down into various corners. You can ask for more and more details if you need them. This is a well-developed corner of the software world, and Jaspersoft is expanding by making it easier to use these sophisti- cated reports with newer sources of data. Jaspersoft isn't offering particularly new ways to look at the data, just more sophisticated ways to access data stored in new locations. I found this surpris- ingly useful. The aggregation of my data was enough to make basic sense of who was going to the website and when they were going there. Big data tools: Pentaho Business Analytics Pentaho is another software platform that began as a report generating engine; it is, like JasperSoft, branching into big data by making it easier to absorb information from the new sources. You can hook up Pentaho's tool to many of the most popular NoSQL databases such as MongoDB and Cassandra. Once the databases are connected, you can drag and drop the columns into views and reports as if the information came from SQL databases. I found the classic sorting and sifting tables to be extremely useful for understanding just who was spend- ing the most amount of time at my website. Simply sort- ing by IP address in the log files revealed what the heavy users were doing. Pentaho also provides software for drawing HDFS file data and HBase data from Hadoop clusters. One of the more intrigu- ing tools is the graphical programming interface known as either Kettle or Pentaho Data Integration. It has a bunch of built-in modules that you can drag and drop onto a picture, then connect them. Pentaho has thoroughly integrated Hadoop and the other sources into this, so you can write your code and send it out to execute on the cluster. Big data tools: Karmasphere Studio and Analyst Many of the big data tools did not begin life as reporting tools. Karmasphere Studio, for instance, is a set of plug-ins built on top of Eclipse. It's a specialized IDE that makes it easier to create and run Hadoop jobs. I had a rare feeling of joy when I started configuring a Hadoop job with this developer tool. There are a number of stages in the life of a Hadoop job, and Karmasphere's tools walk you through each step, showing the partial results along the way. I guess debuggers have always made it possible for us to peer into the mechanism as it does its work, but Karmasphere Studio does something a bit better: As you set up the workflow, the tools display the state of the test data at each step. You see what the temporary data will look like as it is cut apart, analyzed, then reduced. “Karmasphere also distributes a tool called Karmasphere Analyst, which is designed to simplify the process of plowing through all of the data in a Hadoop cluster. It comes with many useful building blocks for programming a good Hadoop job, like 2012年8期·www.miechina.com·55 56 ·中国制造业信息化·2012年8期数字工厂/管理 actoryFigitalD subroutines for uncompressing Zipped log files. Then it strings them together and parameterizes the Hive calls to produce a table of output for perusing. Big data tools: Talend Open Studio Talend also offers an Eclipse-based IDE for stringing to- gether data processing jobs with Hadoop. Its tools are designed to help with data integration, data quality, and data management, all with subroutines tuned to these jobs. Talend Studio allows you to build up your jobs by dragging and dropping little icons onto a canvas. If you want to get an RSS feed, Talend's component will fetch the RSS and add proxying if necessary. There are dozens of components for gathering infor- mation and dozens more for doing things like a "fuzzy match." Then you can output the results. Stringing together blocks visually can be simple after you get a feel for what the components actually do and don't do. This was easier for me to figure out when I started looking at the source code being assembled behind the canvas. Talend lets you see this, and I think it's an ideal compromise. Visual programming may seem like a lofty goal, but I've found that the icons can never represent the mechanisms with enough detail to make it possible to understand what's going on. I need the source code. Talend also maintains TalendForge, a collection of open source extensions that make it easier to work with the company's products. Most of the tools seem to be filters or libraries that link Talend's software to other major products such as Salesforce.com and SugarCRM. You can suck down information from these sys- tems into your own projects, simplifying the integration. Big data tools: Skytree Server Not all of the tools are designed to make it easier to string together code with visual mechanisms. Skytree offers a bundle that performs many of the more sophisticated machine-learning algorithms. All it takes is typing the right command into a com- mand line. Skytree is more focused on the guts than the shiny GUI. Skytree Server is optimized to run a number of classic machine- learning algorithms on your data using an implementation the company claims can be 10,000 times faster than other packages. It can search through your data looking for clusters of mathemati- cally similar items, then invert this to identify outliers that may be problems, opportunities, or both. The algorithms can be more precise than humans, and they can search through vast quanti- ties of data looking for the entries that are a bit out of the ordinary. This may be fraud -- or a particularly good customer who will spend and spend. The free version of the software offers the same algorithms as the proprietary version, but it's limited to data sets of 100,000 rows. This should be sufficient to establish whether the software is a good match. Big data tools: Tableau Desktop and Server Tableau Desktop is a visualization tool that makes it easy to look at your data in new ways, then slice it up and look at it in a different way. You can even mix the data with other data and examine it in yet another light. The tool is optimized to give you all the columns for the data and let you mix them before stuffing it into one of the dozens of graphical templates provided. Tableau Software started embracing Hadoop several ver- sions ago, and now you can treat Hadoop "just like you would with any data connection." Tableau relies upon Hive to structure the queries, then tries its best to cache as much information in memory to allow the tool to be interactive. While many of the other reporting tools are built on a tradition of generating the reports offline, Tableau wants to offer an interactive mechanism so that you can slice and dice your data again and again. Caching helps deal with some of the latency of a Hadoop cluster. The software is well-polished and aesthetically pleasing. I often found myself reslicing the data just to see it in yet another graph, even though there wasn't much new to be learned by switching from a pie chart to a bar graph and beyond. The soft- ware team clearly includes a number of people with some artistic talent.(This article, "6 top tools for taming big data," was origi- nally published at InfoWorld.com) 56 ·中国制造业信息化·2012年8期

                    本文档为【6 Top Tools for Taming Big Data】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

6 Top Tools for Taming Big Data

你可能还喜欢