The Mars Pathfinder mission was widely proclaimed as "flawless" in the
early days after its July 4th, 1997 landing on the Martian surface.
Successes included its unconventional "landing" -- bouncing onto the
Martian surface surrounded by airbags, deploying the Sojourner rover,
and
gathering and transmitting voluminous data back to Earth, including
the
panoramic pictures that were such a hit on the Web. But a few
days into
the mission, not long after Pathfinder started gathering meteorological
data, the spacecraft began experiencing total system resets, each resulting
in losses of data. The press reported these failures in terms
such as
"software glitches" and "the computer was trying to do too many things
at
once".
This week at the IEEE Real-Time Systems Symposium I heard a fascinating
keynote address by David Wilner, Chief Technical Officer of Wind River
Systems. Wind River makes VxWorks, the real-time embedded systems
kernel
that was used in the Mars Pathfinder mission. In his talk, he
explained in
detail the actual software problems that caused the total system resets
of
the Pathfinder spacecraft, how they were diagnosed, and how they were
solved. I wanted to share his story with each of you.
VxWorks provides preemptive priority scheduling of threads. Tasks
on the
Pathfinder spacecraft were executed as threads with priorities that
were
assigned in the usual manner reflecting the relative urgency of these
tasks.
Pathfinder contained an "information bus", which you can think of as
a
shared memory area used for passing information between different
components of the spacecraft. A bus management task ran frequently
with
high priority to move certain kinds of data in and out of the information
bus. Access to the bus was synchronized with mutual exclusion
locks
(mutexes).
The meteorological data gathering task ran as an infrequent, low priority
thread, and used the information bus to publish its data. When
publishing
its data, it would acquire a mutex, do writes to the bus, and release
the
mutex. If an interrupt caused the information bus thread to be
scheduled
while this mutex was held, and if the information bus thread then attempted
to acquire this same mutex in order to retrieve published data, this
would
cause it to block on the mutex, waiting until the meteorological thread
released the mutex before it could continue. The spacecraft also
contained
a communications task that ran with medium priority.
Most of the time this combination worked fine. However, very infrequently
it was possible for an interrupt to occur that caused the (medium priority)
communications task to be scheduled during the short interval while
the
(high priority) information bus thread was blocked waiting for the
(low
priority) meteorological data thread. In this case, the long-running
communications task, having higher priority than the meteorological
task,
would prevent it from running, consequently preventing the blocked
information bus task from running. After some time had passed,
a watchdog
timer would go off, notice that the data bus task had not been executed
for
some time, conclude that something had gone drastically wrong, and
initiate
a total system reset. This scenario is a classic case of priority inversion.
HOW WAS THIS DEBUGGED?
VxWorks can be run in a mode where it records a total trace of all
interesting system events, including context switches, uses of
synchronization objects, and interrupts. After the failure, JPL
engineers
spent hours and hours running the system on the exact spacecraft replica
in
their lab with tracing turned on, attempting to replicate the precise
conditions under which they believed that the reset occurred.
Early in the
morning, after all but one engineer had gone home, the engineer finally
reproduced a system reset on the replica. Analysis of the trace
revealed
the priority inversion.
HOW WAS THE PROBLEM CORRECTED?
When created, a VxWorks mutex object accepts a boolean parameter that
indicates whether priority inheritance should be performed by the mutex.
The mutex in question had been initialized with the parameter off;
had it
been on, the low-priority meteorological thread would have inherited
the
priority of the high-priority data bus thread blocked on it while it
held
the mutex, causing it to be scheduled with higher priority than the
medium-priority communications task, thus preventing the priority
inversion. Once diagnosed, it was clear to the JPL engineers that using
priority inheritance would prevent the resets they were seeing.
VxWorks contains a C language interpreter intended to allow developers
to
type in C expressions and functions to be executed on the fly during
system
debugging. The JPL engineers fortuitously decided to launch the
spacecraft
with this feature still enabled. By coding convention, the initialization
parameter for the mutex in question (and those for two others which
could
have caused the same problem) were stored in global variables, whose
addresses were in symbol tables also included in the launch software,
and
available to the C interpreter. A short C program was uploaded
to the
spacecraft, which when interpreted, changed the values of these variables
from FALSE to TRUE. No more system resets occurred.
ANALYSIS AND LESSONS
First and foremost, diagnosing this problem as a black box would have
been
impossible. Only detailed traces of actual system behavior enabled
the
faulty execution sequence to be captured and identified.
Secondly, leaving the "debugging" facilities in the system saved the
day.
Without the ability to modify the system in the field, the problem
could
not have been corrected.
Finally, the engineer's initial analysis that "the data bus task executes
very frequently and is time-critical -- we shouldn't spend the extra
time
in it to perform priority inheritance" was exactly wrong. It
is precisely
in such time critical and important situations where correctness is
essential, even at some additional performance cost.
HUMAN NATURE, DEADLINE PRESSURES
David told us that the JPL engineers later confessed that one or two
system
resets had occurred in their months of pre-flight testing. They
had never
been reproducible or explainable, and so the engineers, in a very
human-nature response of denial, decided that they probably weren't
important, using the rationale "it was probably caused by a hardware
glitch".
Part of it too was the engineers' focus. They were extremely focused
on
ensuring the quality and flawless operation of the landing software.
Should it have failed, the mission would have been lost. It is
entirely
understandable for the engineers to discount occasional glitches in
the
less-critical land-mission software, particularly given that a spacecraft
reset was a viable recovery strategy at that phase of the mission.
THE IMPORTANCE OF GOOD THEORY/ALGORITHMS
David also said that some of the real heroes of the situation were some
people from CMU who had published a paper he'd heard presented many
years
ago who first identified the priority inversion problem and proposed
the
solution. He apologized for not remembering the precise details
of the
paper or who wrote it. Bringing things full circle, it turns
out that the
three authors of this result were all in the room, and at the end of
the
talk were encouraged by the program chair to stand and be acknowledged.
They were Lui Sha, John Lehoczky, and Raj Rajkumar. When was
the last time
you saw a room of people cheer a group of computer science theorists
for
their significant practical contribution to advancing human knowledge?
:-)
It was quite a moment.
POSTLUDE
For the record, the paper was:
L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols:
An
Approach to Real-Time Synchronization. In IEEE Transactions on Computers,
vol. 39, pp. 1175-1185, Sep. 1990.