Resilience in Space: Designing Radiation-Tolerant Systems

Article By : Troy Jones, Xilinx

Responding to our coverage of Xilinx's expansion of its adaptive compute acceleration platform (ACAP) family, a reader suggested a deeper dive. Here it is.

Space is easily the most challenging environment for IC designers. Without Earth’s atmosphere to protect them, electronic systems are vulnerable to high-energy (ionizing) radiation including alpha and beta particles, gamma and x-rays as well as galactic cosmic radiation.

Ionizing radiation has enough energy to remove an electron from its orbit. When that electron represents a bit in memory or a value on a bus interface, its value can be changed or “flipped.” Such an event goes by many names including single-event effect (SEE), single-event upset, or single-event latch-up. Regardless of what you call it, if the wrong bit is flipped, such as an instruction in the application code or a control bit in a register, the entire system could fail.

Radiation-tolerant vs. radiation-hardened

To operate in space, electronic systems need protection against radiation-based events. Some IC manufacturers offer “hardened” components such as insulating substrates in place of the standard semiconductor wafer. Hardened ICs are more resistant to radiation-based events but not immune to them. In addition, hardened ICs are significantly more expensive because of their more complex design requirements and lower production volumes.

Among the factors deterring spacecraft designers from choosing hardened ICs is the lag time for a hardened component to enter production if the desired component can even be designed as a hardened IC.

Rather than attempting to prevent ionizing radiation effects through radiation-hardened-by-design methods, designers can instead utilize devices and design techniques intended to detect and correct them when they happen.

This is known as radiation tolerance.

A key advantage of this approach is that many components can be made radiation tolerant. For example, many memory technologies employ error code technology to detect, and correct bit flips in memory.

Triple modular redundancy

Consider the complexity required to detect a bit flip in a register or that data retrieved from memory had a bit flipped during its transfer over the bus interface. Developers commonly detect and correct events of this nature by using triple modular redundancy. With TMR, key circuits are implemented identically, three times in parallel, and a “voting” circuit compares the outputs of these identical paths and chooses the majority answer. (See Figure 1)

If one of the circuits experiences an event that affects the output, that output will differ from the other two circuits.  If just two identical circuits were used and compared, having different outputs will identify that an event occurred but not the individual circuit on which it occurred.

Which one is correct?  With three circuits, the correct output can be determined (based on the reasonable assumption that the odds of identical SEEs occurring in two of the circuits is effectively zero).

Developers can then accept the majority output or reevaluate the operation. Many OEMs utilize custom ICs for their designs, so to achieve TMR, they place three copies of the IC in parallel on the board with an added voter IC.

Figure 1: With triple modular redundancy, three identical circuits in parallel are evaluated with a “voting” circuit to ensure the circuit produces the correct (majority) output. (Click on image to enlarge.)

Mission-critical TMR

TMR provides a high level of reliability with minimal impact (i.e., latency) on system performance. However, this reliability clearly comes at a cost, increasing system footprint, power consumption and expense. Given that not all circuits are equally important, developers ideally want to implement TMR only where necessary.

Consider a temperature sensor. An infrequent data point error won’t affect overall monitoring as samples can be averaged over time. Thus, there is no need to bear the additional expense of three sensors or three monitoring circuits.

An alternative approach to duplicating circuits three times on a board is to implement the circuits in space-grade programmable devices like the Xilinx XQR Versal ACAP or adaptable SoCs. The Xilinx integrated programmable logic approach enables designers to implement complex TMR in a single chip. Instead of placing three ICs in parallel, a single programmable logic device holds the three circuits and voting circuits all in one. (See Figure 2)

Figure 2: A Xilinx XQR Space grade device shown here allows mission-critical circuits to be implemented using TMR, all in a single chip. (Click on image to enlarge.)

 A major advantage of using programmable logic is that designers can implement TMR only where needed. In this way, mission-critical blocks can be implemented with the highest reliability without duplication of less important blocks, thereby driving up cost and power consumption.

In addition, because an adaptive system in an ACAP or FPGA is not fixed in functionality like a custom IC, designers can introduce new features without the delay or cost of spinning a new IC.

Adaptive flexibility is becoming increasingly important as evolving AI and machine learning technology becomes integral in electronic systems. This means hardware systems can be updated with new AI inference models like a software update. These updates can also be implemented by systems on orbit, improving their efficiency and performance even after deployment, something that was not possible until recently.

Scrubbing

One difference between programmable logic and a custom IC is that an ACAP/FPGA utilizes a configuration. This configuration defines how the programmable device will function and is stored in SRAM-based cells, commonly called a configuration RAM, or CRAM. As a result, the CRAM can be affected by a radiation-based event, potentially changing the desired “personality” of the programmable device.

Scrubbing is a methodology employed for protecting the configuration memory cells. A dedicated portion of the device constantly checks the CRAM using checksum analysis on each frame. If an event is detected, a reconfiguration is initiated. The device “scrubs” (i.e., reloads) the configuration frame that was corrupted by the ionizing radiation. With the event corrected, processing can continue.

Note that only the affected frame requires scrubbing while the entire system continues to operate without interruption. Alternatively, the ACAP/FPGA can employ a “blind scrub.” Instead of checking for an event, the device regularly reconfigures itself to guarantee it is in a known-good state. This approach is quite robust since it forces a refresh of the CRAM even if unnecessary.

In previous generations, Xilinx CRAM single-event mitigation via scrubbing was implemented in an external IC. Now it is an integrated function within either the programmable logic of an FPGA or a dedicated processor in an ACAP.

By their nature, electronics do not have inherent radiation resistance. Through advanced design approaches, systems can identify and correct radiation-based events, increasing overall system tolerance to radiation and significantly improving reliability and resilience. By working with adaptive platforms, designers can optimize system costs and real estate as well as power consumption by applying triple modular redundancy and scrubbing techniques.

This article was originally published on EE Times.

Troy Jones is a space systems architect for the aerospace and defense marketing team at Xilinx Inc.

Subscribe to Newsletter

Leave a comment