Compiling an application for use in highly radioactive environments

Working for about 4-5 years with software/firmware development and environment testing of miniaturized satellites*, I would like to share my experience here.

*(miniaturized satellites are a lot more prone to single event upsets than bigger satellites due to its relatively small, limited sizes for its electronic components)

To be very concise and direct: there is no mechanism to recover from detectable, erroneous situation by the software/firmware itself without, at least, one copy of minimum working version of the software/firmware somewhere for recovery purpose - and with the hardware supporting the recovery (functional).

Now, this situation is normally handled both in the hardware and software level. Here, as you request, I will share what we can do in the software level.

...recovery purpose.... Provide ability to update/recompile/reflash your software/firmware in real environment. This is an almost must-have feature for any software/firmware in highly ionized environment. Without this, you could have redundant software/hardware as many as you want but at one point, they are all going to blow up. So, prepare this feature!
...minimum working version... Have responsive, multiple copies, minimum version of the software/firmware in your code. This is like Safe mode in Windows. Instead of having only one, fully functional version of your software, have multiple copies of the minimum version of your software/firmware. The minimum copy will usually having much less size than the full copy and almost always have only the following two or three features:
1. capable of listening to command from external system,
2. capable of updating the current software/firmware,
3. capable of monitoring the basic operation's housekeeping data.
...copy... somewhere... Have redundant software/firmware somewhere.
1. You could, with or without redundant hardware, try to have redundant software/firmware in your ARM uC. This is normally done by having two or more identical software/firmware in separate addresses which sending heartbeat to each other - but only one will be active at a time. If one or more software/firmware is known to be unresponsive, switch to the other software/firmware. The benefit of using this approach is we can have functional replacement immediately after an error occurs - without any contact with whatever external system/party who is responsible to detect and to repair the error (in satellite case, it is usually the Mission Control Centre (MCC)).
  
  Strictly speaking, without redundant hardware, the disadvantage of doing this is you actually cannot eliminate all single point of failures. At the very least, you will still have one single point of failure, which is the switch itself (or often the beginning of the code). Nevertheless, for a device limited by size in a highly ionized environment (such as pico/femto satellites), the reduction of the single point of failures to one point without additional hardware will still be worth considering. Somemore, the piece of code for the switching would certainly be much less than the code for the whole program - significantly reducing the risk of getting Single Event in it.
2. But if you are not doing this, you should have at least one copy in your external system which can come in contact with the device and update the software/firmware (in the satellite case, it is again the mission control centre).
3. You could also have the copy in your permanent memory storage in your device which can be triggered to restore the running system's software/firmware
...detectable erroneous situation.. The error must be detectable, usually by the hardware error correction/detection circuit or by a small piece of code for error correction/detection. It is best to put such code small, multiple, and independent from the main software/firmware. Its main task is only for checking/correcting. If the hardware circuit/firmware is reliable (such as it is more radiation hardened than the rests - or having multiple circuits/logics), then you might consider making error-correction with it. But if it is not, it is better to make it as error-detection. The correction can be by external system/device. For the error correction, you could consider making use of a basic error correction algorithm like Hamming/Golay23, because they can be implemented more easily both in the circuit/software. But it ultimately depends on your team's capability. For error detection, normally CRC is used.
...hardware supporting the recovery Now, comes to the most difficult aspect on this issue. Ultimately, the recovery requires the hardware which is responsible for the recovery to be at least functional. If the hardware is permanently broken (normally happen after its Total ionizing dose reaches certain level), then there is (sadly) no way for the software to help in recovery. Thus, hardware is rightly the utmost importance concern for a device exposed to high radiation level (such as satellite).

In addition to the suggestion for above anticipating firmware's error due to single event upset, I would also like to suggest you to have:

Error detection and/or error correction algorithm in the inter-subsystem communication protocol. This is another almost must have in order to avoid incomplete/wrong signals received from other system
Filter in your ADC reading. Do not use the ADC reading directly. Filter it by median filter, mean filter, or any other filters - never trust single reading value. Sample more, not less - reasonably.

You may also be interested in the rich literature on the subject of algorithmic fault tolerance. This includes the old assignment: Write a sort that correctly sorts its input when a constant number of comparisons will fail (or, the slightly more evil version, when the asymptotic number of failed comparisons scales as log(n) for n comparisons).

A place to start reading is Huang and Abraham's 1984 paper "Algorithm-Based Fault Tolerance for Matrix Operations". Their idea is vaguely similar to homomorphic encrypted computation (but it is not really the same, since they are attempting error detection/correction at the operation level).

A more recent descendant of that paper is Bosilca, Delmas, Dongarra, and Langou's "Algorithm-based fault tolerance applied to high performance computing".

Here are some thoughts and ideas:

Use ROM more creatively.

Store anything you can in ROM. Instead of calculating things, store look-up tables in ROM. (Make sure your compiler is outputting your look-up tables to the read-only section! Print out memory addresses at runtime to check!) Store your interrupt vector table in ROM. Of course, run some tests to see how reliable your ROM is compared to your RAM.

Use your best RAM for the stack.

SEUs in the stack are probably the most likely source of crashes, because it is where things like index variables, status variables, return addresses, and pointers of various sorts typically live.

Implement timer-tick and watchdog timer routines.

You can run a "sanity check" routine every timer tick, as well as a watchdog routine to handle the system locking up. Your main code could also periodically increment a counter to indicate progress, and the sanity-check routine could ensure this has occurred.

Implement error-correcting-codes in software.

You can add redundancy to your data to be able to detect and/or correct errors. This will add processing time, potentially leaving the processor exposed to radiation for a longer time, thus increasing the chance of errors, so you must consider the trade-off.

Remember the caches.

Check the sizes of your CPU caches. Data that you have accessed or modified recently will probably be within a cache. I believe you can disable at least some of the caches (at a big performance cost); you should try this to see how susceptible the caches are to SEUs. If the caches are hardier than RAM then you could regularly read and re-write critical data to make sure it stays in cache and bring RAM back into line.

Use page-fault handlers cleverly.

If you mark a memory page as not-present, the CPU will issue a page fault when you try to access it. You can create a page-fault handler that does some checking before servicing the read request. (PC operating systems use this to transparently load pages that have been swapped to disk.)

Use assembly language for critical things (which could be everything).

With assembly language, you know what is in registers and what is in RAM; you know what special RAM tables the CPU is using, and you can design things in a roundabout way to keep your risk down.

Use objdump to actually look at the generated assembly language, and work out how much code each of your routines takes up.

If you are using a big OS like Linux then you are asking for trouble; there is just so much complexity and so many things to go wrong.

Remember it is a game of probabilities.

A commenter said

Every routine you write to catch errors will be subject to failing itself from the same cause.

While this is true, the chances of errors in the (say) 100 bytes of code and data required for a check routine to function correctly is much smaller than the chance of errors elsewhere. If your ROM is pretty reliable and almost all the code/data is actually in ROM then your odds are even better.

Use redundant hardware.

Use 2 or more identical hardware setups with identical code. If the results differ, a reset should be triggered. With 3 or more devices you can use a "voting" system to try to identify which one has been compromised.

NASA has a paper on radiation-hardened software. It describes three main tasks:

Regular monitoring of memory for errors then scrubbing out those errors,
robust error recovery mechanisms, and
the ability to reconfigure if something no longer works.

Note that the memory scan rate should be frequent enough that multi-bit errors rarely occur, as most ECC memory can recover from single-bit errors, not multi-bit errors.

Robust error recovery includes control flow transfer (typically restarting a process at a point before the error), resource release, and data restoration.

Their main recommendation for data restoration is to avoid the need for it, through having intermediate data be treated as temporary, so that restarting before the error also rolls back the data to a reliable state. This sounds similar to the concept of "transactions" in databases.

They discuss techniques particularly suitable for object-oriented languages such as C++. For example

Software-based ECCs for contiguous memory objects
Programming by Contract: verifying preconditions and postconditions, then checking the object to verify it is still in a valid state.

And, it just so happens, NASA has used C++ for major projects such as the Mars Rover.

C++ class abstraction and encapsulation enabled rapid development and testing among multiple projects and developers.

They avoided certain C++ features that could create problems:

Exceptions
Templates
Iostream (no console)
Multiple inheritance
Operator overloading (other than new and delete)
Dynamic allocation (used a dedicated memory pool and placement new to avoid the possibility of system heap corruption).

Compiling an application for use in highly radioactive environments

Tags:

C++

C

Gcc

Embedded

Fault Tolerance

Related

Recent Posts