54 ·中国制造业信息化·2012年8期
数字工厂/管理
actoryFigitalD
6 Top Tools for
思想者
Thinker
The industry now has a buzzword, "big data," for how we're
going to do something with the huge amount of information pil-
ing up. "Big data" is replacing "business intelligence," which sub-
sumed "reporting," which put a nicer gloss on "spreadsheets,"
which beat out the old-fashioned "printouts." Managers who long
ago studied printouts are now hiring mathematicians who claim
to be big data specialists to help them solve the same old problem:
What's selling and why?
It's not fair to suggest that these buzzwords are simple re-
placements for each other. Big data is a more complicated world
because the scale is much larger. The information is usually spread
out over a number of servers, and the work of compiling the data
must be coordinated among them. In the past, the work was largely
delegated to the database software, which would use its magical
JOIN mechanism to compile tables, then add up the columns
before handing off the rectangle of data to the reporting software
that would paginate it. This was often harder than it sounds. Da-
tabase programmers can tell you the stories about complicated
JOIN commands that would lock up their database for hours as it
tried to produce a report for the boss who wanted his columns
just so.
The game is much different now. Hadoop is a popular tool
for organizing the racks and racks of servers, and NoSQL data-
bases are popular tools for storing data on these racks. These
mechanism can be much more powerful than the old single
machine, but they are far from being as polished as the old data-
base servers. Although SQL may be complicated, writing the JOIN
query for the SQL databases was often much simpler than gath-
ering information from dozens of machines and compiling it into
one coherent answer. Hadoop jobs are written in Java, and that
requires another level of sophistication. The tools for tackling big
data are just beginning to package this distributed computing
power in a way that's a bit easier to use.
The biggest challenge may be dealing with the expecta-
tions built up by the major motion picture "Moneyball." All the
bosses have seen it and absorbed the message that some clever
statistics can turn a small-budget team into a World Series winner.
Never mind that the Oakland Athletics never won the World
Series during the "Moneyball" era. That's the magic of Michael
Lewis' prose. The bosses are all thinking, "Perhaps if I can get
some good stats, Hollywood will hire Brad Pitt to play me in the
movie version."
None of the software in this collection will come close to
luring Brad Pitt to ask his agent for a copy of the script for the
movie version of your Hadoop job. That has to come from within
you or the other humans working on the project. Understanding
the data and finding the right question to ask is often much more
complicated than getting your Hadoop job to run quickly. That's
really saying something because these tools are only half of the
job.
To get a handle for the promise of the field, I downloaded
some big data tools, mixed in data, then stared at the answers for
Einstein-grade insight. The information came from log files to
the website that sells some of my books (wayner.org), and I was
Taming Big Data
■ JakoB BJ orklund
54 ·中国制造业信息化·2012年8期
2012年8期·www.miechina.com·55
数字工厂/管理
actoryFigitalD
looking for some idea of what was selling and why. So I un-
packed the software and asked the questions.
Big data tools: Jaspersoft BI Suite
The Jaspersoft package is one of the open source leaders
for producing reports from database columns. The software is
well-polished and already installed in many businesses turning
SQL tables into PDFs that everyone can scrutinize at meetings.
The company is jumping on the big data train, and this means
adding a software layer to connect its report generating software
to the places where big data gets stored. The JasperReports Server
now offers software to suck up data from many of the major
storage platforms, including MongoDB, Cassandra, Redis, Riak,
CouchDB, and Neo4j. Hadoop is also well-represented, with
JasperReports providing a Hive connector to reach inside of
HBase.
This effort feels like it is still starting up -- many pages of
the documentation wiki are blank, and the tools are not fully
integrated. The visual query designer, for instance, doesn't work
yet with Cassandra's CQL. You get to type these queries out by
hand.
Once you get the data from these sources, Jaspersoft's server
will boil it down to interactive tables and graphs. The reports can
be quite sophisticated interactive tools that let you drill down into
various corners. You can ask for more and more details if you
need them.
This is a well-developed corner of the software world, and
Jaspersoft is expanding by making it easier to use these sophisti-
cated reports with newer sources of data. Jaspersoft isn't offering
particularly new ways to look at the data, just more sophisticated
ways to access data stored in new locations. I found this surpris-
ingly useful. The aggregation of my data was enough to make
basic sense of who was going to the website and when they were
going there.
Big data tools: Pentaho Business Analytics
Pentaho is another software platform that began as a report
generating engine; it is, like JasperSoft, branching into big data
by making it easier to absorb information from the new sources.
You can hook up Pentaho's tool to many of the most popular
NoSQL databases such as MongoDB and Cassandra. Once the
databases are connected, you can drag and drop the columns into
views and reports as if the information came from SQL
databases.
I found the classic sorting and sifting tables to be
extremely useful for understanding just who was spend-
ing the most amount of time at my website. Simply sort-
ing by IP address in the log files revealed what the heavy
users were doing.
Pentaho also provides software for drawing HDFS file data
and HBase data from Hadoop clusters. One of the more intrigu-
ing tools is the graphical programming interface known as either
Kettle or Pentaho Data Integration. It has a bunch of built-in
modules that you can drag and drop onto a picture, then connect
them. Pentaho has thoroughly integrated Hadoop and the other
sources into this, so you can write your code and send it out to
execute on the cluster.
Big data tools: Karmasphere Studio and
Analyst
Many of the big data tools did not begin life as reporting
tools. Karmasphere Studio, for instance, is a set of plug-ins built
on top of Eclipse. It's a specialized IDE that makes it easier to
create and run Hadoop jobs.
I had a rare feeling of joy when I started configuring a
Hadoop job with this developer tool. There are a number of
stages in the life of a Hadoop job, and Karmasphere's tools
walk you through each step, showing the partial results along
the way. I guess debuggers have always made it possible for
us to peer into the mechanism as it does its work, but
Karmasphere Studio does something a bit better: As you set
up the workflow, the tools display the state of the test data at
each step. You see what the temporary data will look like as
it is cut apart, analyzed, then reduced.
“Karmasphere also distributes a tool called Karmasphere
Analyst, which is designed to simplify the process of plowing
through all of the data in a Hadoop cluster. It comes with many
useful building blocks for programming a good Hadoop job, like
2012年8期·www.miechina.com·55
56 ·中国制造业信息化·2012年8期
数字工厂/管理
actoryFigitalD
subroutines for uncompressing Zipped log files. Then it strings
them together and parameterizes the Hive calls to produce a table
of output for perusing.
Big data tools: Talend Open Studio
Talend also offers an Eclipse-based IDE for stringing to-
gether data processing jobs with Hadoop. Its tools are designed
to help with data integration, data quality, and data management,
all with subroutines tuned to these jobs.
Talend Studio allows you to build up your jobs by dragging
and dropping little icons onto a canvas. If you want to get an RSS
feed, Talend's component will fetch the RSS and add proxying if
necessary. There are dozens of components for gathering infor-
mation and dozens more for doing things like a "fuzzy match."
Then you can output the results.
Stringing together blocks visually can be simple after you
get a feel for what the components actually do and don't do. This
was easier for me to figure out when I started looking at the source
code being assembled behind the canvas. Talend lets you see this,
and I think it's an ideal compromise. Visual programming may
seem like a lofty goal, but I've found that the icons can never
represent the mechanisms with enough detail to make it possible
to understand what's going on. I need the source code.
Talend also maintains TalendForge, a collection of open
source extensions that make it easier to work with the company's
products. Most of the tools seem to be filters or libraries that link
Talend's software to other major products such as Salesforce.com
and SugarCRM. You can suck down information from these sys-
tems into your own projects, simplifying the integration.
Big data tools: Skytree Server
Not all of the tools are designed to make it easier to string
together code with visual mechanisms. Skytree offers a bundle
that performs many of the more sophisticated machine-learning
algorithms. All it takes is typing the right command into a com-
mand line.
Skytree is more focused on the guts than the shiny GUI.
Skytree Server is optimized to run a number of classic machine-
learning algorithms on your data using an implementation the
company claims can be 10,000 times faster than other packages.
It can search through your data looking for clusters of mathemati-
cally similar items, then invert this to identify outliers that may be
problems, opportunities, or both. The algorithms can be more
precise than humans, and they can search through vast quanti-
ties of data looking for the entries that are a bit out of the ordinary.
This may be fraud -- or a particularly good customer who will
spend and spend.
The free version of the software offers the same algorithms
as the proprietary version, but it's limited to data sets of 100,000
rows. This should be sufficient to establish whether the software
is a good match.
Big data tools: Tableau Desktop and Server
Tableau Desktop is a visualization tool that makes it easy
to look at your data in new ways, then slice it up and look at it
in a different way. You can even mix the data with other data
and examine it in yet another light. The tool is optimized to
give you all the columns for the data and let you mix them
before stuffing it into one of the dozens of graphical templates
provided.
Tableau Software started embracing Hadoop several ver-
sions ago, and now you can treat Hadoop "just like you would
with any data connection." Tableau relies upon Hive to structure
the queries, then tries its best to cache as much information in
memory to allow the tool to be interactive. While many of the
other reporting tools are built on a tradition of generating the
reports offline, Tableau wants to offer an interactive mechanism
so that you can slice and dice your data again and again. Caching
helps deal with some of the latency of a Hadoop cluster.
The software is well-polished and aesthetically pleasing. I
often found myself reslicing the data just to see it in yet another
graph, even though there wasn't much new to be learned by
switching from a pie chart to a bar graph and beyond. The soft-
ware team clearly includes a number of people with some artistic
talent.(This article, "6 top tools for taming big data," was origi-
nally published at InfoWorld.com)
56 ·中国制造业信息化·2012年8期
本文档为【6 Top Tools for Taming Big Data】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。