首页 NodeMD - Diagnosing node-level faults in remote wireless sensor systems

NodeMD - Diagnosing node-level faults in remote wireless sensor systems

举报
开通vip

NodeMD - Diagnosing node-level faults in remote wireless sensor systems NodeMD: Diagnosing Node-Level Faults in Remote Wireless Sensor Systems Veljko Krunic, Eric Trumpler, Richard Han Department of Computer Science University of Colorado at Boulder krunic@ieee.org, Eric.Trumpler@colorado.edu, Richard.Han@colorado.edu ABSTRACT...

NodeMD - Diagnosing node-level faults in remote wireless sensor systems
NodeMD: Diagnosing Node-Level Faults in Remote Wireless Sensor Systems Veljko Krunic, Eric Trumpler, Richard Han Department of Computer Science University of Colorado at Boulder krunic@ieee.org, Eric.Trumpler@colorado.edu, Richard.Han@colorado.edu ABSTRACT Software failures in wireless sensor systems are notoriously diffi- cult to debug. Resource constraints in wireless deployments sub- stantially restrict visibility into the root causes of node-level system and application faults. At the same time, the high costs of deploy- ment of wireless sensor systems often far exceed the cumulative costs of all other sensor hardware, so that software failures that completely disable a node are prohibitively expensive to repair in real world applications, e.g. by on-site visits to replace or reset nodes. We describe NodeMD, a deployment management system that successfully implements lightweight run-time detection, log- ging, and notification of software faults on wireless mote-class de- vices. NodeMD introduces a debug mode that catches a failure be- fore it completely disables a node and drops the node into a stable state that enables further diagnosis and correction, thus avoiding on-site redeployment. We analyze the performance of NodeMD on a real world application of wireless sensor systems. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging—diagnos- tics, distributed debugging, error handling and recovery, tracing; C.2.1 [Computer-Communication Networks]: Network Archi- tecture and Design—wireless communication General Terms Design, Experimentation, Management, Performance, Reliability Keywords Diagnosis, Software Fault, Wireless Sensor Networks, Deployment 1. INTRODUCTION The vision of wireless sensor networks (WSNs) typically con- sists of a large number of very low cost sensor nodes that can be spread over a wide area to collect environmental data and relay that data back to a remote database or server via a self-organizing wire- less mesh network. WSNs are often deployed in distant rugged Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MobiSys’07, June 11–13, 2007, San Juan, Puerto Rico, USA. Copyright 2007 ACM 978-1-59593-614-1/07/0006 ...$5.00. environments, e.g. Great Duck Island off the coast of Maine [2], around wildfires in the Bitterroot National Forest in Idaho [3], and surrounding an active volcano in Ecuador [4]. These types of de- ployments are expensive and sometimes even dangerous to deploy- ment personnel. For example, in the FireWxNet [3] deployment, a helicopter was used by fire personnel to deploy nodes on three dif- ferent mountains, in some cases requiring the firefighters to climb down the mountain to place the nodes. Compounding the difficulty of WSN deployments is that soft- ware bugs are inevitably encountered in the field, following a fa- miliar theme that has been experienced all too commonly in other deployed software systems. Commercial applications and operat- ing systems typically have large quality-control resources devoted to testing of software prior to deployment, yet still encounter soft- ware bugs in the field that require frequent patching. Despite ex- haustive testing, commercial handheld devices with embedded soft- ware such as cell phones and wireless PDAs continue to suffer from software glitches during operation. As some well publicized software failures during space missions are showing (e.g. Mars PathFinder [15, 28]), software errors are a fact of life even for NASA, which has considerable resources at its disposal for testing prior to launching a mission. Our expectation is that WSN appli- cations will face similar difficulties with software bugs that occur in the field. Moreover, we expect these problems to be exacer- bated in WSNs by two factors: WSN systems typically are limited by having much scarcer resources available for testing than com- mercial and NASA-funded systems; and the data-driven nature of WSNs can create an unexpected fault-inducing combination of in- puts that is difficult to forecast during limited lab testing. Indeed, our own experiences deploying FireWxNet confirmed that software bugs arose during our deployment despite our best efforts to elimi- nate errors through lab testing. The cost of repairing a node that has been crippled due to a soft- ware failure is especially high inWSN applications, due to the time, money, and effort required to revisit a node deployed in such remote rugged terrain. Solutions available in other domains to address soft- ware failures do not easily apply to the case of WSNs, due to these extreme conditions of deployment as well as the extreme resource constraints characteristic of WSNs. For example, to achieve the vi- sion of many low cost sensor nodes, today’s sensor motes typically have extremely limited memory available, e.g. 4 KB of RAM and 128 KB of flash on MICA2 [5] class sensor motes. The embedded controllers characteristic of these sensor nodes also typically lack hardware memory protection and MMU units. Given these sub- stantial hardware limitations, i.e. up to six orders of magnitude less RAM for WSN systems than for PC systems, we expect desktop- class solutions for detecting and repairing software faults to be too expensive to directly apply to the resource-constrained domain of WSNs. Embedded systems such as cell phones are closer in re- sources to WSN systems, e.g. tens of MBs of RAM, but even here their solutions do not necessarily apply. For example, when cell phones/PDAs become unresponsive due to faulty embedded soft- ware, their owners can often fix the problem by manually resetting and/or power cycling the device. Manual reset is a prohibitively expensive option in remote wireless sensor deployments, requiring on-site visitation. A system that could catch a software fault before it completely disables a remote sensor node, and can provide diagnostic informa- tion to remotely troubleshoot the root cause of the fault, would be invaluable to in situ WSN deployments. The typical behavior after encountering a run-time software fault is for a remote node to enter a bad/unresponsive state that looks like a “black hole”. The fault is detected retroactively by what information we don’t receive. The node is completely disabled and needs to be redeployed. Even if this situation occurs in the lab during testing, the ability to provide more information than just a “black hole of silence” is clearly bene- ficial. Such a diagnostic system would be useful not only for in situ applications but also for troubleshooting errors during the testing phase. Our goal in this paper is to offer a diagnostic system, NodeMD, capable of (1) catching run-time software faults as they occur and before they completely disable a remote node, and (2) remotely di- agnosing the root cause of the fault, thereby substantially reducing the need for costly redeployment of nodes through on-site visits. Our solution must be tailored for WSNs, i.e. it must be lightweight and have a small footprint appropriate for the sensor network envi- ronment. A medical analogy can provide some insight into the state of the art with respect to current methods of sensor node debugging. Vis- iting a failed node in the field is similar to a country doctor that needs to visit a remote area to treat a sick patient. For both a doc- tor’s in-home visit and on-site repair of failed remote sensor nodes, the cost of the visit is prohibitively expensive. The WSN commu- nity has proposed a variety of approaches to mitigate these costs. SOS [8] provides an ability to remotely patch a sensor OS, and can be seen as analogous to a mail-order pharmacy that remotely provides medicine to alleviate a sickness. Marionnete [14] and Nu- cleus [13] provide the ability to remotely query a node for run-time state information, and is analogous to a doctor using the telephone to query a sick patient as to their health. t-kernel [23] provides a general framework that seeks to prevent certain software faults like livelock, but not others such as stack overflow, and can be seen as vaccinating a patient against certain diseases but not others. Nu- cleus also provides an event log in flash that can be recovered after a node has died, and is analogous to providing post-mortem analy- sis. Given all these pieces of the puzzle, we are still missing effective tools that are equivalent to a patient proactively reporting the rapid onset and current symptoms of an illness, as well as their history of behavior that led up to that illness, before that illness completely incapacitates that patient. There is no equivalent ability, in the suite of tools available to the WSN community, to a human patient that picks up the phone and reports “Doctor, I am not feeling well, these are the symptoms and this is what I did in the last few days”. Given today’s WSN debugging tools, a node can still fail without report- ing any information about the failure at the time of the failure. As a result, today’s WSN community still cannot completely avoid a need for the equivalent of in-home visits. NodeMD is the last piece of the puzzle that is necessary to real- ize the equivalent of a fully capable “remote doctor” in the world of WSNs and thereby drastically reducing the need for on-site visits. With NodeMD providing the missing link, we can envision a com- plete system based on keeping the “human in the loop”, in which problems with the software are brought immediately to the atten- tion of the programmer before they disable a node, good diagnos- tic tools are provided for timely diagnosis of the problem, and the appropriate remedy can be applied by remotely updating a sensor node with debugged code. Ultimately the goal of our system is to bring node debugging in these challenging, resource-constrained, remote wireless environments to a level that is as useful as what exists in modern desktop computing systems. The main contributions of this paper comprise the following: building a fault management system for WSNs that is capable of detecting a broad spectrum of software faults at run-time; intro- ducing a recovery/debug mode that catches those faults so as not to completely disable the afflicted node; timely notification of the fault along with a brief diagnostic history of the events that led up to the fault; continued interaction with the halted node to close the loop on the debugging cycle by including a human programmer; resource-constrained solutions to all of the above; and proof-of- concept implementation on a real world sensor application. The techniques proposed in this paper are designed to be generalizable across many different systems, and we foresee future implementa- tions of NodeMD being used in a wide context of embedded oper- ating systems. In Section 2, we discuss related work in fault management in WSNs. Section 3 presents the unified system architecture of NodeMD. Section 4 introduces our suite of algorithms for detecting faults at run-time, including stack overflow, deadlock, livelock, and application- specific faults. Section 5 discusses our solution for entering the recovery/debug mode upon a detected fault and providing notifica- tion via a compressed history of the events leading up to the fault. Section 6 closes the loop on fault management by allowing interac- tive debugging by a human of the remote node in the halted mode. Finally, section 7 provides a detailed analysis of the current imple- mentation in Mantis OS [7] for several real world sensor applica- tions. 2. RELATEDWORK Sensor network debugging today usually begins with staring at a set of blinking LEDs. JTAG interfaces on sensor boards pro- vide increased visibility into faults, but only for nodes directly con- nected to a wired network. For wireless sensor nodes in either an in situ wireless deployment or testbed environment, some systems are emerging that provide limited visibility into fault behavior. The Sympathy system [12] focuses on debugging networking faults, providing periodic reporting of various networking metrics to diag- nose the reason behind reduced network throughput. The approach is somewhat limited in its periodic reporting, though the period can be adjusted, and does not focus on detecting application and OS software failures on a node. Nucleus [13], a deployment debugging system, was developed to resolve a lack of information when live deployments fail. Its pri- mary features are a robust logging system and on-demand requests for information from nodes in the network. One essential theme we share is that our debugging methods must persist even when the application fails. Nucleus stores “printf” style messages in a limited buffer within main memory, and also writes them to flash memory to act as a sensor node “black box”. Such messages are in- efficient to store in main memory because the information logged vs. storage size is sparse. Also, the slow storage of messages in flash may affect timing in the program if log operations are called within timing sensitive code. Additionally, once a node has failed such information is only available after the node has been retrieved. Recent work done in t-kernel [23], a reliable OS kernel, takes an approach that ensures the system is always able to retake control from an application. At a low level, each branch instruction first jumps to the system for verification before jumping back to the target address. In fact, this preemption technique would be useful to support some of the techniques proposed by NodeMD. t-kernel provides a “safe execution environment” that allows the system to recover from problems such as deadlock or livelock. However, t- kernel is designed for reliability rather than debugging, and only ensures that the system can always execute. It does not react to the onset of such faults it may circumvent, i.e. deadlock and livelock. Nor does it address how to detect other types of faults, such as stack overflow, or how to efficiently provide useful information for fault diagnosis. Marionette [14] provides a mechanism to query the memory in nodes for their state. It is specific to TinyOS, and does not focus on detection, preemption, and notification of faults as they occur. A variety of approaches for remote code updates in WSNs have been proposed, and are summarized in [6]. These approaches can be roughly divided into a networking component that achieves reli- able code propagation, e.g. Deluge [9] and Aqueduct [10], and an operating system component that enables efficient update of code images on a sensor node, e.g. SOS [8] or the ELF loader [24]. Our fault management system is agnostic to the particular combination of mechanisms chosen for remote code updates. In theory any of them could be reused in NodeMD’s architecture. For example, the ELF dynamic modules loader [24] was recently implemented in- side of MOS to enable efficient code updates, the same platform upon which NodeMD is implemented. Our focus in this paper is not on these mechanisms, but instead is on our innovation in auto- mated fault detection, notification, and diagnosis, the missing links in fault management for WSN systems. 3. SYSTEMARCHITECTUREANDDESIGN GOALS NodeMD’s fault management system consists of three main sub- systems that correspond to the system shown in Figure 1. These subsystems are combined under a single unified architecture to pro- vide an expansive solution to node-level fault diagnosis in deployed WSNs. • The fault detection subsystem is designed for monitoring the health of the system and catching software faults such as stack overflow, livelock, deadlock, and application-defined faults as they occur, signified by the ’X’ of the failed node in the figure. • The fault notification or reporting subsystem is responsible for constant system-oriented logging, in a space and time- efficient manner, the sequence of events occurring in the sys- tem. This compressed event trace in the form of a circular bit vector is then conveyed in a notification message back to the human user immediately after a fault. • The fault diagnosis subsystem essentially closes the loop on the “debugging” cycle, halting the node and dropping it into a safe debug or error recovery mode wherein interactive queries can be accepted from a remote human user for more detailed diagnostic information, and remote code updates can also be accepted. NodeMD must accomplish the above diagnostic features while achieving a variety of other design goals. First, it is essential that Figure 1: System architecture of NodeMD. fault detection and notification be extremely memory-efficient and low overhead in terms of CPU and radio bandwidth, to fit within the extreme resource constraints demanded by deployed sensor nodes. This has strong implications, for example on streamlining the de- sign of the event logging in main memory. Second, the design of NodeMD should afford the human user flexibility to extend and customize its diagnostic capabilities, i.e. in pursuit of a particular bug or class of bugs. For example, NodeMD allows a user to define their own application-specific conditions for triggering the detec- tion of a “fault” and the subsequent halting of the node. Users can further request more detailed diagnostic information when a node is in the halted but stable/responsive debug mode. NodeMD also allows programmers to customize event logging by adding custom events to the history trace. Third, our goal is to introduce algo- rithms and solutions that are generally applicable to a wide range of embedded systems. For example, the stack overflow detection algorithm is applicable not just on thread-based systems like MOS, but also to event-driven single-stack systems like TinyOS. 4. FAULT DETECTION Detecting faults that can potentially disable a node is not a fully resolved problem in the context of WSNs. This section presents work towards identifying fault-prone conditions and implementing detection algorithms to prevent such conditions from paralyzing the node. Our system currently identifies three generic classes of high-risk faults to applications that are of especial interest in concurrent sen- sor operating systems: stack overflow, livelock and deadlock, and application-specific faults. NodeMD is architected so that other detectors can be added to our system, such as detection of out- of-bounds memory writes, but at present we have focused first on detecting these three general classes of faults. While many WSN operating systems follow event-driven mod- els, some fault classes between event-driven and concurrent sys- tems are mutually exclusive. Typical problems in event-driven pro- gramming concern the need for non-blocking concurrency and run- to-completion code segments, which are implicitly addressed by multithreaded scheduling. While our detection system is designed for the prominent issues in multithreaded systems, detection of some faults also applies to event-driven models, i.e. stack overflow. 4.1 Stack Overflow Due to the extremely limited memory available, e.g. 4 KB of RAM on MICA [5] class sensor motes, we have identified stack overflow as a key suspect in software failure. Although stack usage can be estimated by static analysis used in some approaches [21, 25], data dependencies common inWSNsmake it difficult to choose a stack size that is minimal yet guaranteed never to be exceeded. In addition, errors in the code can make static analysis invalid. By comparison, if static analysis is useful for finding a “ballpark” stack size, stack overflow detection in Node
本文档为【NodeMD - Diagnosing node-level faults in remote wireless sensor systems】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_036759
暂无简介~
格式:pdf
大小:1MB
软件:PDF阅读器
页数:0
分类:互联网
上传时间:2010-11-06
浏览量:5