NodeMD: Diagnosing Node-Level Faults in Remote
Wireless Sensor Systems
Veljko Krunic, Eric Trumpler, Richard Han
Department of Computer Science
University of Colorado at Boulder
krunic@ieee.org, Eric.Trumpler@colorado.edu, Richard.Han@colorado.edu
ABSTRACT
Software failures in wireless sensor systems are notoriously diffi-
cult to debug. Resource constraints in wireless deployments sub-
stantially restrict visibility into the root causes of node-level system
and application faults. At the same time, the high costs of deploy-
ment of wireless sensor systems often far exceed the cumulative
costs of all other sensor hardware, so that software failures that
completely disable a node are prohibitively expensive to repair in
real world applications, e.g. by on-site visits to replace or reset
nodes. We describe NodeMD, a deployment management system
that successfully implements lightweight run-time detection, log-
ging, and notification of software faults on wireless mote-class de-
vices. NodeMD introduces a debug mode that catches a failure be-
fore it completely disables a node and drops the node into a stable
state that enables further diagnosis and correction, thus avoiding
on-site redeployment. We analyze the performance of NodeMD on
a real world application of wireless sensor systems.
Categories and Subject Descriptors
D.2.5 [Software Engineering]: Testing and Debugging—diagnos-
tics, distributed debugging, error handling and recovery, tracing;
C.2.1 [Computer-Communication Networks]: Network Archi-
tecture and Design—wireless communication
General Terms
Design, Experimentation, Management, Performance, Reliability
Keywords
Diagnosis, Software Fault, Wireless Sensor Networks, Deployment
1. INTRODUCTION
The vision of wireless sensor networks (WSNs) typically con-
sists of a large number of very low cost sensor nodes that can be
spread over a wide area to collect environmental data and relay that
data back to a remote database or server via a self-organizing wire-
less mesh network. WSNs are often deployed in distant rugged
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MobiSys’07, June 11–13, 2007, San Juan, Puerto Rico, USA.
Copyright 2007 ACM 978-1-59593-614-1/07/0006 ...$5.00.
environments, e.g. Great Duck Island off the coast of Maine [2],
around wildfires in the Bitterroot National Forest in Idaho [3], and
surrounding an active volcano in Ecuador [4]. These types of de-
ployments are expensive and sometimes even dangerous to deploy-
ment personnel. For example, in the FireWxNet [3] deployment, a
helicopter was used by fire personnel to deploy nodes on three dif-
ferent mountains, in some cases requiring the firefighters to climb
down the mountain to place the nodes.
Compounding the difficulty of WSN deployments is that soft-
ware bugs are inevitably encountered in the field, following a fa-
miliar theme that has been experienced all too commonly in other
deployed software systems. Commercial applications and operat-
ing systems typically have large quality-control resources devoted
to testing of software prior to deployment, yet still encounter soft-
ware bugs in the field that require frequent patching. Despite ex-
haustive testing, commercial handheld devices with embedded soft-
ware such as cell phones and wireless PDAs continue to suffer
from software glitches during operation. As some well publicized
software failures during space missions are showing (e.g. Mars
PathFinder [15, 28]), software errors are a fact of life even for
NASA, which has considerable resources at its disposal for testing
prior to launching a mission. Our expectation is that WSN appli-
cations will face similar difficulties with software bugs that occur
in the field. Moreover, we expect these problems to be exacer-
bated in WSNs by two factors: WSN systems typically are limited
by having much scarcer resources available for testing than com-
mercial and NASA-funded systems; and the data-driven nature of
WSNs can create an unexpected fault-inducing combination of in-
puts that is difficult to forecast during limited lab testing. Indeed,
our own experiences deploying FireWxNet confirmed that software
bugs arose during our deployment despite our best efforts to elimi-
nate errors through lab testing.
The cost of repairing a node that has been crippled due to a soft-
ware failure is especially high inWSN applications, due to the time,
money, and effort required to revisit a node deployed in such remote
rugged terrain. Solutions available in other domains to address soft-
ware failures do not easily apply to the case of WSNs, due to these
extreme conditions of deployment as well as the extreme resource
constraints characteristic of WSNs. For example, to achieve the vi-
sion of many low cost sensor nodes, today’s sensor motes typically
have extremely limited memory available, e.g. 4 KB of RAM and
128 KB of flash on MICA2 [5] class sensor motes. The embedded
controllers characteristic of these sensor nodes also typically lack
hardware memory protection and MMU units. Given these sub-
stantial hardware limitations, i.e. up to six orders of magnitude less
RAM for WSN systems than for PC systems, we expect desktop-
class solutions for detecting and repairing software faults to be too
expensive to directly apply to the resource-constrained domain of
WSNs. Embedded systems such as cell phones are closer in re-
sources to WSN systems, e.g. tens of MBs of RAM, but even here
their solutions do not necessarily apply. For example, when cell
phones/PDAs become unresponsive due to faulty embedded soft-
ware, their owners can often fix the problem by manually resetting
and/or power cycling the device. Manual reset is a prohibitively
expensive option in remote wireless sensor deployments, requiring
on-site visitation.
A system that could catch a software fault before it completely
disables a remote sensor node, and can provide diagnostic informa-
tion to remotely troubleshoot the root cause of the fault, would be
invaluable to in situ WSN deployments. The typical behavior after
encountering a run-time software fault is for a remote node to enter
a bad/unresponsive state that looks like a “black hole”. The fault is
detected retroactively by what information we don’t receive. The
node is completely disabled and needs to be redeployed. Even if
this situation occurs in the lab during testing, the ability to provide
more information than just a “black hole of silence” is clearly bene-
ficial. Such a diagnostic system would be useful not only for in situ
applications but also for troubleshooting errors during the testing
phase.
Our goal in this paper is to offer a diagnostic system, NodeMD,
capable of (1) catching run-time software faults as they occur and
before they completely disable a remote node, and (2) remotely di-
agnosing the root cause of the fault, thereby substantially reducing
the need for costly redeployment of nodes through on-site visits.
Our solution must be tailored for WSNs, i.e. it must be lightweight
and have a small footprint appropriate for the sensor network envi-
ronment.
A medical analogy can provide some insight into the state of the
art with respect to current methods of sensor node debugging. Vis-
iting a failed node in the field is similar to a country doctor that
needs to visit a remote area to treat a sick patient. For both a doc-
tor’s in-home visit and on-site repair of failed remote sensor nodes,
the cost of the visit is prohibitively expensive. The WSN commu-
nity has proposed a variety of approaches to mitigate these costs.
SOS [8] provides an ability to remotely patch a sensor OS, and
can be seen as analogous to a mail-order pharmacy that remotely
provides medicine to alleviate a sickness. Marionnete [14] and Nu-
cleus [13] provide the ability to remotely query a node for run-time
state information, and is analogous to a doctor using the telephone
to query a sick patient as to their health. t-kernel [23] provides a
general framework that seeks to prevent certain software faults like
livelock, but not others such as stack overflow, and can be seen as
vaccinating a patient against certain diseases but not others. Nu-
cleus also provides an event log in flash that can be recovered after
a node has died, and is analogous to providing post-mortem analy-
sis.
Given all these pieces of the puzzle, we are still missing effective
tools that are equivalent to a patient proactively reporting the rapid
onset and current symptoms of an illness, as well as their history
of behavior that led up to that illness, before that illness completely
incapacitates that patient. There is no equivalent ability, in the suite
of tools available to the WSN community, to a human patient that
picks up the phone and reports “Doctor, I am not feeling well, these
are the symptoms and this is what I did in the last few days”. Given
today’s WSN debugging tools, a node can still fail without report-
ing any information about the failure at the time of the failure. As
a result, today’s WSN community still cannot completely avoid a
need for the equivalent of in-home visits.
NodeMD is the last piece of the puzzle that is necessary to real-
ize the equivalent of a fully capable “remote doctor” in the world of
WSNs and thereby drastically reducing the need for on-site visits.
With NodeMD providing the missing link, we can envision a com-
plete system based on keeping the “human in the loop”, in which
problems with the software are brought immediately to the atten-
tion of the programmer before they disable a node, good diagnos-
tic tools are provided for timely diagnosis of the problem, and the
appropriate remedy can be applied by remotely updating a sensor
node with debugged code. Ultimately the goal of our system is to
bring node debugging in these challenging, resource-constrained,
remote wireless environments to a level that is as useful as what
exists in modern desktop computing systems.
The main contributions of this paper comprise the following:
building a fault management system for WSNs that is capable of
detecting a broad spectrum of software faults at run-time; intro-
ducing a recovery/debug mode that catches those faults so as not
to completely disable the afflicted node; timely notification of the
fault along with a brief diagnostic history of the events that led up
to the fault; continued interaction with the halted node to close the
loop on the debugging cycle by including a human programmer;
resource-constrained solutions to all of the above; and proof-of-
concept implementation on a real world sensor application. The
techniques proposed in this paper are designed to be generalizable
across many different systems, and we foresee future implementa-
tions of NodeMD being used in a wide context of embedded oper-
ating systems.
In Section 2, we discuss related work in fault management in
WSNs. Section 3 presents the unified system architecture of NodeMD.
Section 4 introduces our suite of algorithms for detecting faults at
run-time, including stack overflow, deadlock, livelock, and application-
specific faults. Section 5 discusses our solution for entering the
recovery/debug mode upon a detected fault and providing notifica-
tion via a compressed history of the events leading up to the fault.
Section 6 closes the loop on fault management by allowing interac-
tive debugging by a human of the remote node in the halted mode.
Finally, section 7 provides a detailed analysis of the current imple-
mentation in Mantis OS [7] for several real world sensor applica-
tions.
2. RELATEDWORK
Sensor network debugging today usually begins with staring at
a set of blinking LEDs. JTAG interfaces on sensor boards pro-
vide increased visibility into faults, but only for nodes directly con-
nected to a wired network. For wireless sensor nodes in either an in
situ wireless deployment or testbed environment, some systems are
emerging that provide limited visibility into fault behavior. The
Sympathy system [12] focuses on debugging networking faults,
providing periodic reporting of various networking metrics to diag-
nose the reason behind reduced network throughput. The approach
is somewhat limited in its periodic reporting, though the period can
be adjusted, and does not focus on detecting application and OS
software failures on a node.
Nucleus [13], a deployment debugging system, was developed
to resolve a lack of information when live deployments fail. Its pri-
mary features are a robust logging system and on-demand requests
for information from nodes in the network. One essential theme
we share is that our debugging methods must persist even when
the application fails. Nucleus stores “printf” style messages in a
limited buffer within main memory, and also writes them to flash
memory to act as a sensor node “black box”. Such messages are in-
efficient to store in main memory because the information logged
vs. storage size is sparse. Also, the slow storage of messages in
flash may affect timing in the program if log operations are called
within timing sensitive code. Additionally, once a node has failed
such information is only available after the node has been retrieved.
Recent work done in t-kernel [23], a reliable OS kernel, takes
an approach that ensures the system is always able to retake control
from an application. At a low level, each branch instruction first
jumps to the system for verification before jumping back to the
target address. In fact, this preemption technique would be useful
to support some of the techniques proposed by NodeMD. t-kernel
provides a “safe execution environment” that allows the system to
recover from problems such as deadlock or livelock. However, t-
kernel is designed for reliability rather than debugging, and only
ensures that the system can always execute. It does not react to the
onset of such faults it may circumvent, i.e. deadlock and livelock.
Nor does it address how to detect other types of faults, such as stack
overflow, or how to efficiently provide useful information for fault
diagnosis.
Marionette [14] provides a mechanism to query the memory in
nodes for their state. It is specific to TinyOS, and does not focus on
detection, preemption, and notification of faults as they occur.
A variety of approaches for remote code updates in WSNs have
been proposed, and are summarized in [6]. These approaches can
be roughly divided into a networking component that achieves reli-
able code propagation, e.g. Deluge [9] and Aqueduct [10], and an
operating system component that enables efficient update of code
images on a sensor node, e.g. SOS [8] or the ELF loader [24]. Our
fault management system is agnostic to the particular combination
of mechanisms chosen for remote code updates. In theory any of
them could be reused in NodeMD’s architecture. For example, the
ELF dynamic modules loader [24] was recently implemented in-
side of MOS to enable efficient code updates, the same platform
upon which NodeMD is implemented. Our focus in this paper is
not on these mechanisms, but instead is on our innovation in auto-
mated fault detection, notification, and diagnosis, the missing links
in fault management for WSN systems.
3. SYSTEMARCHITECTUREANDDESIGN
GOALS
NodeMD’s fault management system consists of three main sub-
systems that correspond to the system shown in Figure 1. These
subsystems are combined under a single unified architecture to pro-
vide an expansive solution to node-level fault diagnosis in deployed
WSNs.
• The fault detection subsystem is designed for monitoring the
health of the system and catching software faults such as
stack overflow, livelock, deadlock, and application-defined
faults as they occur, signified by the ’X’ of the failed node in
the figure.
• The fault notification or reporting subsystem is responsible
for constant system-oriented logging, in a space and time-
efficient manner, the sequence of events occurring in the sys-
tem. This compressed event trace in the form of a circular bit
vector is then conveyed in a notification message back to the
human user immediately after a fault.
• The fault diagnosis subsystem essentially closes the loop on
the “debugging” cycle, halting the node and dropping it into a
safe debug or error recovery mode wherein interactive queries
can be accepted from a remote human user for more detailed
diagnostic information, and remote code updates can also be
accepted.
NodeMD must accomplish the above diagnostic features while
achieving a variety of other design goals. First, it is essential that
Figure 1: System architecture of NodeMD.
fault detection and notification be extremely memory-efficient and
low overhead in terms of CPU and radio bandwidth, to fit within the
extreme resource constraints demanded by deployed sensor nodes.
This has strong implications, for example on streamlining the de-
sign of the event logging in main memory. Second, the design of
NodeMD should afford the human user flexibility to extend and
customize its diagnostic capabilities, i.e. in pursuit of a particular
bug or class of bugs. For example, NodeMD allows a user to define
their own application-specific conditions for triggering the detec-
tion of a “fault” and the subsequent halting of the node. Users can
further request more detailed diagnostic information when a node
is in the halted but stable/responsive debug mode. NodeMD also
allows programmers to customize event logging by adding custom
events to the history trace. Third, our goal is to introduce algo-
rithms and solutions that are generally applicable to a wide range
of embedded systems. For example, the stack overflow detection
algorithm is applicable not just on thread-based systems like MOS,
but also to event-driven single-stack systems like TinyOS.
4. FAULT DETECTION
Detecting faults that can potentially disable a node is not a fully
resolved problem in the context of WSNs. This section presents
work towards identifying fault-prone conditions and implementing
detection algorithms to prevent such conditions from paralyzing the
node.
Our system currently identifies three generic classes of high-risk
faults to applications that are of especial interest in concurrent sen-
sor operating systems: stack overflow, livelock and deadlock, and
application-specific faults. NodeMD is architected so that other
detectors can be added to our system, such as detection of out-
of-bounds memory writes, but at present we have focused first on
detecting these three general classes of faults.
While many WSN operating systems follow event-driven mod-
els, some fault classes between event-driven and concurrent sys-
tems are mutually exclusive. Typical problems in event-driven pro-
gramming concern the need for non-blocking concurrency and run-
to-completion code segments, which are implicitly addressed by
multithreaded scheduling. While our detection system is designed
for the prominent issues in multithreaded systems, detection of
some faults also applies to event-driven models, i.e. stack overflow.
4.1 Stack Overflow
Due to the extremely limited memory available, e.g. 4 KB of
RAM on MICA [5] class sensor motes, we have identified stack
overflow as a key suspect in software failure. Although stack usage
can be estimated by static analysis used in some approaches [21,
25], data dependencies common inWSNsmake it difficult to choose
a stack size that is minimal yet guaranteed never to be exceeded.
In addition, errors in the code can make static analysis invalid. By
comparison, if static analysis is useful for finding a “ballpark” stack
size, stack overflow detection in Node
本文档为【NodeMD - Diagnosing node-level faults in remote wireless sensor systems】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。