首页 专题论坛:大数据

专题论坛:大数据

举报
开通vip

专题论坛:大数据nullnullnullBig Data vs Smart Model: Beauty and the BeastProf. Yike GuoDepartment of Computing Imperial College LondonnullModel : Mathematical Representation of a Simplified Physical World Modelling is an essential and inseparable part of all scientific a...

专题论坛:大数据
nullnullnullBig Data vs Smart Model: Beauty and the BeastProf. Yike GuoDepartment of Computing Imperial College LondonnullModel : Mathematical Representation of a Simplified Physical World Modelling is an essential and inseparable part of all scientific activity. A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way To understand the world or an object (called a target T), a model M is a simplified mathematical representation of it. Model is the result of abstraction from observations made, and it’s used to give prediction Human / SensorHuman / Machine Human / Machine.nullNo Model Is Perfect: • Inherent Uncertainty : These targets consist of a set of continuous phenomena (in both time and space), and they typically produce rich signals. Because of the continuity in both time and space of target, the signals are in principle infinite. But observations ( e.g. sensor readings ) are made at discrete points in time and space, so they are incomprehensive, and approximate, which brings the “uncertainty”. • Overfitting or Underfitting: When learning a model from observations, such as learning a nonlinear regression model, we need to choose the parameters such as K. Considering the fact that the information from observations is partial . It is hard to make a perfect choice of K. Such imperfectness causes the problem of model error, like underfitting (small k) and overfitting (large k).• Simplification: From observations, we project from a multi-dimensional world a simplified model with significant reduced dimensionality to focus on the features or properties we are interested in.Nonlinear regression: K- order polynomialnullGeorge Box (statistician) “All models are wrong, but some areuseful.” Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. ---1980Peter Norvig (Google) : "All models are wrong, and increasinglyyou can succeed without them." ------ 2008Chris Anderson (Wired) : There is now a better way. Petabytesallow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world hasever seen and let statistical algorithms find patterns where science cannot.(The Data Deluge Makes the Scientific Method Obsolete)----20124So, Why Model ?nullThe Google ArgumentAt the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.nullModel Free Sensor Informatics : Query Driventime 10am 10am .. 10amid 1 2 .. 7temp 20 21 … 29Database Table raw-dataSensor Network3. Write output to a file/back to the database 4. Write data processing tools to process/aggregate the output (maybe using User 1. Extract all readings into a file 2. Run MATLAB/R/other data processing tools DB) 5. Decide new data to acquire Repeat Model-free sensing treats the sensory system as a database, and sensing as querying to fetch data from physical world. One of the leading vendors [Crossbow] is bundling a query processor with their devices.nullWikisensing : A Model Free Sensor Informatics System Based on Big Data ArchitecturenullModel Free Sensing is Super Inefficient• Data misrepresentation without model• Latent information missing without model• High demand of computation/storage without model• Require too much of interoperability between sensorsand analyticsnullBayesian: Data Is Not the Enemy of Models , Rather a Great Supporter!Bayesian probability is a formalism that allows us to reason about beliefs of models under conditions of uncertainty based on the observations (data) .If we have observed that a particular event has happened, such as Britain coming 10th in the medal table at the 2004 Olympics, then there is no uncertainty about it.However, suppose a is the statement “Britain sweeps the boards at 2012 London Olympics, winning more than 30 Gold Medals!“ made before 28th of JulySince this is a statement about a future event, nobody can state with any certainty whether or not it is true. Different people may have different beliefs in the statement depending on their specific knowledge of factors that might effect its likelihoodThe belief’s of the model were changing daily based on the performance data available each day. By the 10 of August, most of people’s belief to this model should be almost 80% Thus, in general, a person's subjective belief in a statement a will depend on some body of knowledge K. We write this as P(a|K). Henry's belief in a is different from Marcel's because they are using different K's. However, even if they were using the same K they might still have different beliefs in a.The expression P(a|K) thus represents a belief measure. Sometimes, for simplicity, when K remains constant we just write P(a), but you must be aware that this is a simplification.nullModel and Data Interaction : Bayesian Inference10•Bayes Rule: Interaction between data and model•Learning as A Sequence of Interactionsp(Y | ) p( ) p(Y)P( | Y) nullBig Data Meets Smart Models : A Bayesian Approach towards Sensor Informatics•We need model : a model is the representation of our knowledge so far• • • • •Data : the observations which may revise our belief to the models we have Analysis : assessing our belief and updating our models to make them more believable Sensing : acquiring needed data to update (enrich) models Models are learned from data (observations) by scientists (theoretical abstraction) or by machine (machine learning) • Models are hypothesis ( when making new observation) • Models are knowledge (when established belief) Sensor Informatics: Sensing management ---Managing the “neediness” : when and where to sense • Sensing analytics--- Managing model updating : how to enrich models with observations • Reasoning--- Decision making based on integration of trusted models •P(M | D) = P(D | M ) P(M) / P(D)null Surprising Event : When an Observation Does not Fit a Known Model Posterior and prior (P(M|D) ~ P(M) ) has great variance -> surprise! How great is great variance? Surprise threshold α Kullback-Leibler divergence: Other methods: signficant level, Chebyshev’s Theorem, … From model, we get C(A, B) (e.g. a multivariate Gaussian distribution) A: 100mm B: 50mm Model consistentA: 100mm B: 500mm Surprise!nullCamera example: Image -> Analog Signal -> Digital Data -> Compressed Data -> Information Why sensing so much data and then throw them away? Why not sensing information directly?Using Compressive Sensing Technology to Optimize Observations Compressive sensing: Take the advantage of sparseness, to solve the under-determined signals with just a small amount of measurement. Unobserved behavior (behavior not captured by the current model) is typically sparse.Reconstruction method: L1-min, Bayesian CS.Sensing data is enough when we can recover the need information through compressive sensing.Ψ: CS Matrix  built from the modelΦ: Placement MatrixnullHow to Update Model – Parameter Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC 25 2011 21:15:23NODAL SOLUTION STEP=360 SUB =1TIME=1800 TEMP (AVG) RSYS=0 SMN =131.03 SMX =646.41 MX MN Z XEstimating parameter θ to maximize the likelihood of data given the model:nullModel : An Example in Digital CityModelling City Life via Causality : C(eA, eB) is used for predict current value of location (A) whenanother location (B) value is given Location : physical / logical locations with causality (through sensory cortex)(city areas, A. B) Relationship : topology (geo topology between A and B: diffusion Structure )  Event: events, which is the dynamics of observable signal S = f(E) (heavyrainfall)nullOntologies are adopted to represent locations L, relationships R* events E, and signals S.Diffusion: An event e1∈ E in n1 causes another event e2 ∈ E in n2, when two nodes n1, n2 in G arelinked. Digital City Model : looking into the details System T = (L, R, E)Model M(T) = (G, ∅, B)Training for causality ∅: use Bayesian network to represent the conditional independencies between cause and target variables: 1. Gaussian Mixture Models (GMMs), estimated via expectation maximization (EM) 2. Gaussian Process with Bayesian Inference.null When the surprise > surprise threshold  Diversity detected  identify the incorrect causality C(el, ep), which is sparse  Compressive sensing approach New observation-> measurement that could revise model in model space to maximize the likelihood of observations Focusing on diversity PlacementModel Updating Model Driven Sensing : No Surprise ! The dynamics of model update: Surprise -> Sensing -> Model Updating The goal for sensing: Capturing surprise The goal of analysis : Revising model A model cannot overfit / underfit, when there is diversity, it could be updated -> consistent with the universe (target)nullModel UpdateIt’s a Bayesian: P(M, ϴ | D) = P(D | M, ϴ) P(M, ϴ) / P(D)T: target, M: model, ϴ: top-down parameter* When ϴ is fixed: P(M | D) = P(D | M) P(M) / P(D)-> The variance between posterior and prior is “surprise”-> bottom-up attention -> model update (data assimilation):combining observations of the current state of a system with the results from a model (the forecast) to produce an analysis. The model is then advanced in time and its result becomes the forecast in the next analysis cycle* When ϴ is updated: P(M, ϴ) = P(M | ϴ)P(ϴ)-> top-down attention (alertness) -> model updatenullAdaptive Observation: Sensing and Numerical ModellingCityGML Ontology -> GIS -> Geometry meshnullBuilding An Initial Model and Making Prediction by SimulationsSetting up boundary conditions, numerical schemas, model parameters, etc.nullSimulation24 Building Case (Fine Mesh – 600000 Nodes): 20 ProcessorsnullSimulationMoving Vehicles and Scalar Dispersions in Street CanyonsnullUsing Sensor to Verify the Prediction Results of the Model Sensing: Acquiring data to get posterior of model, for validate (consistent) or update model . P(M | D) = P(D | M) P(M) / P(D)Data sensingModelvalidate updatenullNew WikiSensing: Elastic Sensing Environment for Large Scale Sensor Informatics• Elastic sensing theory based on Bayesian inference• Big Data architecture for large scale sensory data management• Ontology for the background knowledge management• Model driven adaptive observation support• Digital City and digital life applicationsnullThe architecture of the New WikiSensing SystemnullOntology Used to Organise the Complex knowledge managementUsing ontology to represent the targets, signals, sensing methods, measurements, etc.Ontology to support flexible resolution Upper ontology for unified operation OntoSensornullConclusion• Big data offers great opportunity for building smart models• Big data provides new methodology for model research• New informatics comes from the close coupled integration of the data and the model worlds• Bayesian theory provides a nature foundation for such an integration• Sensor Informatics is a good example for such a paradigm• A new uniform framework of sensor informatics can be developed based on the Bayesian theory wherethe dynamics of data and model capturing the essence of building a sensory system• We are developing the WikiSensing system to realise this paradigmnullThank younullUnderstanding Big DataHaixun WangnullData ExplosionMB = 106 bytesa typical book in text formatGB = 109 bytesa one hour video is about 1GB; data produced by a biology experiment in one dayTB = 1012 bytesastronomy data in one night;US Library of Congress has 1000 TB data; search log of Bing is 20 TB per day (2009)nullThe Arecibo TelescopeWorld’s largest radio telescopeDiameter : 305 m (1,000 ft) Area : 18 acresLocation: Arecibo, Puerto Rico http://www.naic.eduThe P-ALFA surveys800 Terabytes in 5 yearsnullSoftware Driven Telescopefrom few, large, expensive,directional dishes to many, small, cheap, omni directional antennaea large number of high-speed input streams(2Gbps per antenna, 25,000antennae in an area of 340 km in diameter)nullData sizeChallenge 1: It’s the data, stupid!Data complexityKey/value storeColumn storeDocument storeGraph SystemsnullBig data drives tomorrow’s economy.• The value of big data lies in its degree ofconnectedness.• Existing systems cannot handle richconnectedness of big data.nullRDBMS and Rich Relationships• Performance of multi-way joins is very poor inRDBMS• Managing data of rich connectedness requiresmulti-way Joins in RDBMSnullTrinity• A general purpose, distributed, in memory graph system • Online graph query processing • Offline graph analyticsnullTrinity Performance Highlight• Online query processing :– visiting 2.2 million users (3 hop neighborhood) on Facebook: <= 100ms – foundation for graph-based service, e.g., entity search• Offline graph analytics :– one iteration on a 1 billion node graph: <= 60sec – foundation for analytics, e.g., social analyticsnullPeople Search DemonullMulti-way Join vs. Graph TraversalCompanyIncidentProblem…IDCompanyID1ID2ID…IncidentID3ID4ID…ProblemRDBMS TrinitynullChallenge 2: Interpretation of Big Data• IBM Watson:– Runs on 2,880 cores, 15 terabytes of RAM, and80kW of power• A human brain:– Runs on a tuna fish sandwich and a glass of waternullansweringthe questionunconstrained natural languageinferencing & reasoningdomain specific languagesimple calculation Human (Turing Test)SIRI Watson Wolfram Alpha Google/Bing? the Eternal Quest understanding the question SQL calculatornullTurning the Web into a DatabasenullWhat you see when you look at my homepage …Haixun WangMicrosoft Research AsiaEmail: haixunw @ microsoft . com Tel: +86-10-58963289 Tel: +1-914-902-0749I joined Microsoft Research Asia in 2009. I was with IBM T. J. Watson ResearchCenter from 2000 to 2009. I received the B.S. and M.S. Degree in Computer Science from Shanghai Jiao Tong University in 1994 and 1996, the Ph.D. Degree in Computer Science fromUniversity of California, Los Angelesin June, 2000.nullAWhat a machine sees when it looks at my homepage …A JPEG Imagea jpeg Filetext in bigA bold fontA4 lines of textanother dozen lines of text with twoembedded URLsnullnullSemantic Web?• Number 1 trend in 2008– Richard MacManus• The infrastructure to power theSemantic Web is already here.– Tim Berners-Lee• Unstructured information will give way to structuredinformation – paving the road to intelligent computing.– Alex IskoldnullnullMore data beats better algorithmsBanko and Brill 2001nullMean translation quality(1=incomprehensible, 4 = perfect)English-Spanish translation quality,Microsoft technical texts2.5 23.52001200220032004200520062007Systran Improve algorithms, scale system, and add data!Rule-based system with expensive customizations for Microsoft 3 MSRMT Logos Off-the-shelf rule-based systemFrom Rick Rashid’s talk: It’s a data driven world – get over it!nullProbase isA (concept,entities)isPropertyOf (attributes)Co-occurrence (isCEOof, LocatedIn,etc)Concepts (“SpanishArtists”)Entities (“PabloPicaso”)nullExplicit vs. Latent Knowledge• Abstract representations (such as clustersfrom latent analysis) that lack linguisticcounterparts are hard to learn or validate and tend to lose information.• Human language has evolved over millennia tohave words for the important concepts; let’s use them.Halevy, Norvig, Pereira, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, 2009.nullWhat is interpretation?nullAdd Common Sense to ComputingPablo Picasso 25 Oct 1881SpanishnullWhich is “kiki” and which is “bouba”?nullsoundshapezigzaggednessnullChinaIndiacountryBrazilemerging marketnullbodytastesmell winenullIT companyThe engineer is eating an applefruitnull Multiple Concepts Obama’s real-estate policypresident, politicianinvestment, property, asset, plan, documentpresident, politician,investment, property, asset, plan, documentnullMultiple Concepts apple software company, brand, fruit, juice adobe brand, software company, materialsoftware company,software manufacturer, brand juice, materialbrand, company, fruit,null Multiple Concepts Obama’s real-estate policypresident, politicianinvestment, property, asset, plan, documentpresident, politician,investment, property, example plan, documentthing, issue, term, asset,nullExample: (from B. Dolan)Who assassinatedAbraham Lincoln?nullThe far reaching implicationsScientific MethodnullScientific MethodnullWhat really counts isunderstandingora mastery of some commonvocabularynullHow can big data help?A much more rapid cycle of hypothesisgeneration and testing• General access toknowledge in science• Autonomousexperimentation, with an ‘active learning’ modelnullTechnological Singularityif machines could even slightly surpass human intellect, they could improve their own designs in ways unforeseen by their designers, and thus recursively augment themselves into far greater intelligencesnullThanksnull大数据平台及互联网应用服务nullAgenda 当前面临问题和挑战 国内外公司解决 方案 气瓶 现场处置方案 .pdf气瓶 现场处置方案 .doc见习基地管理方案.doc关于群访事件的化解方案建筑工地扬尘治理专项方案下载  大数据领域腾讯解决之道nullAgenda第一篇:当前面临问题和挑战null大数据挑战(1)-海量数据存储技术? 1.PB级数据向ZB级演进,如何降低存储 和计算成本数据量:46PB机器数量:5600台2.工业级业务发展迅速对大数据计算时 效性和可靠性提出新的挑战null大数据挑战(2)—数据应用难null大数据挑战(3)-精准推荐难1.企业信息泛滥的问题(全互联网)2.推荐精度低3.推荐效果有效评估问题4.如何有效收集用户主动行为数据nullAgenda第二篇: 国内外公司解决方案nullhadoop开源 产品HbaseMahoutHive/Pig海豚技术海狗章鱼海星剑鱼蓝鲸…..…..海量计算: 基于Hadoop海量存储计 算集群,同时提供一站式的 计算和存储资源管理 分布式数据挖掘: 基于Mahout分布式数 据数据挖掘数据分发中心: 提供批量数据抽取和转载, 同时准实时消息,日志分发 (采用客户pull方式) 海量数据实时搜索: 基于Hbase和Solr集成, 提供千亿级别数据实时 查询和全文检索 流计算框架: 类似M/R流式计算框架, 可以实现应用快速,提供 在线数据加工服务海量数据查询: 基于hive和Pig,提供 Web页面海量数据 可视化查询服务国内案例-支付宝大数据平台 支付宝hadoop相关应用服务null• • • • •Online news, Google News reports that recommendations increase articles viewed by 38% (Das et al. 2007). Movies, Netflix reports that over 60% of their rentals originate from recommendations (Thompson 2008). Amazon, which sells music, books, and movies, 35% of sales are reported to originate from recommendations (Lamere & Green 2008). Video, YouTube 60% of all homepage video clicks are recommendations (Davidson et al, 2010) How about Tencent ?国外案例-精准推荐 精准推荐的应用是数据挖掘的热点nullFacebook • 800mil 活跃用户数达8亿,50%DAU • 100bn 在线人际关系链达1000亿 • 30bn 每月分享的 内容 财务内部控制制度的内容财务内部控制制度的内容人员招聘与配置的内容项目成本控制的内容消防安全演练内容 /项目达300亿 Twitter • 300mil 活跃用户数近3亿 • 6.5mil 每天650万条微博 YouTube • 2bn 每天有近20亿的视频浏览量 • 1mil 每月上传的视频近100万小时腾讯QQ • 700mil 活
本文档为【专题论坛:大数据】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_889567
暂无简介~
格式:ppt
大小:12MB
软件:PowerPoint
页数:0
分类:互联网
上传时间:2014-04-27
浏览量:94