How to design electronics to last 40 years or more?

You asked about Space probes specifically, but your question also had a more general flavour. I've addressed "how to make things last" generally. In space the eg AC mains aspects is vanishingly unlikely to be relevant - but power supply issues still are.
This answer is necessarily incomplete and overlaps other comments and answers in some areas. These are "out of my head". I my come back and add more later. Or not :

Longish ago I set out to build portable solar lights, mass manufactured in China, with a target lifetime of 20 years. That's what the client wanted. The client, the manufacturers and Murphy conspired against me at every turn. I failed. But managed to make some seriously robust products in the process. One of these days ... :-).

Not all of the following derives from the above experience. But, a fair amount is "informed" by it.

Do not use wet Aluminum electrolytic caps.

Do not use Tantalum caps.

  • OK - you CAN use Tantalum caps if you REALLY know what you are doing.
    As a starting point, do not use Tantalum caps.

Look to see if Rad Hard is liable to help (even if not in a radiation intense environment).

Temperature derate to take advantage of (or avoid) Arrhenius multiplier.

Use a superb conformal coating.

  • A conformal coating MUST have low to no voids at PCBA surface, low dissolved water, low degradation in applicable environment, not produce damaging degradation products and/or scavenge degradation products.

  • ALL coatings pass water vapour - having an essentially void free surface against the PCBA and minimal water in the coating means that the concentration of water reaching the surface is very low and reaction rates are accordingly reduced.

  • As an example of degradation products and scavenging. Glass fronted PV (solar) panels have minimal water transmission through the glass (no surprise). The industry standard bonding material is EVA plastic which is heat and pressure polymerised to form an essentially clear void free adhesive layer between glass and PV cells. Over a decade plus gradual UV attack produces products which enhance cell corrosion. Modern glass front sheets contain scavengers to absorb these reaction products. Lifetimes of 30+ years are "easily enough" obtained. [I have an old tired but still operating BP 50 Watt PV panel more than 40 years old].

  • Parylene is king but not the only answer (See Here and Here ). Use the right PArylene - it's a family and some suit some areas better than others.
    Dow Corning* 1-2577 and family are "pretty good".

Do not rely on bonding agents to hold things together or in place.

  • Acid-free-cure Silicone Rubbers give 20+ years service if properly matched to surfaces. They may last 30 or 40 years, or more. Do you trust anyone to guarantee this to be the case.
    Surface materials matter - experts will tell you what's needed for tricky surfaces.
    But, not relying on binding agents is better.

Vibration protect appropriately.

  • Be aware that while ferrous materials have a lower stress limit below which fatigue failure does not occur, non-ferrous metals have NO LOWER STRESS LIMIT below which fatigue failure will not ultimately occur. So eg an Aluminium bracket that is stressed to well below its tensile limit may still fail after say 35 years if stressed repeatedly to some lower limit.

Voltage derate excessively in areas where appropriate.
DO NOT voltage derate where inappropriate.

  • eg the wet Al ecaps that you are NOT using should not be run vastly below voltage spec.

Be aware of ceramic cap attributes that may hurt you.
eg voltage ringing on voltage steps, microphonic and major voltage spikes from apposite vibrations.

Be aware of corrosion mechanisms.

  • Some coatings provide electrochemical sacrificial protection of underlying metals.
    Some don't.
    Some are worse long term than no coating!. eg zinc "galvanised" coatings protect underlying iron/steel by being more active electrochemically.

But eg Nickel (or the now far less often seen tin) do NOT provide electrochemical protection - rather just the opposite. These coatings provide mechanical barriers to corrosion products. If / once / when the coating is breached over a small area an electrochemical cell is formed that selectively targets the underlying layer and the small area exposed means the corrosion rate is higher than if the while item was NOT plated (!).

In any case - DO NOT USE TIN COATINGS - see below

Do not use Tin coatings

  • Tin is nowadays renowned for growing whiskers on surfaces - sometimes at fast rates and sometimes with astounding lengths. In some cases whisker growth takes decades and is unimportant. In other cases failures can occur in very short periods (say under one year).
    At least one communication satellite is believed to have been lost due to tin whiskers.
  • I have some extremely old relays. Some of their metal surfaces are smooth to the touch. Other portions are extremely rough and the sprouting tin whiskers are clearly visible.

Be aware that EMI matters.

  • EMI (electromagnetic interference) at usual levels can be formally designed against. If you know with certainty that nobody is going to operate a 1 kW linear amplifier, unshielded Magnetron, high energy spark source, .... within a critical distance of your product for the next 40 or 50 years then you may decide to not protect against such. If you are not certain of this then protection may be in order.

Be aware of worst worst worst case mains and power supply issues.

  • A very long life device will usually have external energy supply. Typically mains AC, battery charged from some external source of maybe solar. Just maybe thermal, radioactive, ... .

  • If your mains input at eg 110 VAC oe 230 VAC will NEVER have an 11 kV line dropped onto its feeder in the next 40 years then you may not wish to protect against such a possibility. I occasionally hear of telephones leaping off walls or houses bursting into flames when this happens. It's rare. It happens. There is a limit to what you can choose to protect against. You have to choose what the limit is.

  • Lightning happens. In two years I lost 2 multifunction printers to lightning strike nearby in a residential area not known for overly much lightning activity. After the second I decided that having a fax line connected to my printer was overrated. No telephones were damaged.

  • Mains energy spikes can be "very enthusiastic". There are standards to be met to protect against such. Murphy does not care about standards.

Use only utterly reliable suppliers and ensure provenance for all parts sources.

  • These overlap. In some cases you may be dealing directly with suppliers or middlemen.

  • Be sure you know the standing of the entity you are dealing with. In Asia a supplier purporting to be the manufacturer may in fact be reselling product from elsewhere.

  • Factory visits help, but, do not be fooled. (I have been). And ensure that products which come from a given source continue to come from that source.

  • Name brand products with a good reputation will often be counterfeited. Be sure that what you receive IS from the claimed manufacturer. [eg GP (Goldpeak) AA NiMH (and other) batteries are relatively unknown by that name in the est - but GP are one of the largest battery makers in China. So much so that pirate GP lookalikes abound.

  • You do not HAVE to buy from a supplier who jealously defends their reputation (Digikey, Mouser, ....) or products from manufacturers of impeccable standing, but it certainly helps.

  • If you have to source a product and do not have time for adequate due diligence or source checking, if Panasonic make it, buy Panasonic. (That's sort of with a :-) - but I'm also serious. I have zero financial or business links with Panasonic, but I don't recall them ever doing other than superbly in any area they choose to touch).

Learn how Murphy works.

  • If something can go wrong it will.
    If you know that something can't go wrong Murphy will do his utmost to prove that your knowledge is false. Look at every possible multi-factor failure mode, and as many impossible ones that you can manage.

Impossible series of faults or conditions are not as impossible as we'd like

  • A large proportion of major disasters occur when 3 or 4 or 5 almost impossible events occur simultaneously. This happens often enough that 'you'd think that people may have noticed' - but people seem not to.

In a nut shell, quality, quality, quality.

The first thing you do is to use high reliablity parts. NASA specifies 4 quality levels starting with commercial (the lowest grade), moving to '883B (a mil standard); then QML level Q, and finally QML level V. With each step up in level, the screening requirements become more stringent; the paper trail more onerous; and the cost ever increasing.

With increasing quality levels comes lower predicted failure rates. This means that when you do your reliability prediction (or more accurately, your probability of mission success), your Ps increases with better quality parts.

Adequate derating also plays into this, particularly with new technology or new parts for which there is no history. We are sometimes told to use a 100 V MOSFET for a 20V application because of this.

Redundancy helps a lot. But with redundancy comes added complexity and more parts, which actually degrades the serial failure rate.

With any hi-rel design, you need to do an analysis to identify and mitigate, to the extent possible, any single point failures (SPfs). An SPF is a failure that would degrade or cause the loss of the entire function, or mission. SPF analysis is particularly important when redundancy is employed because you do not want a single failure to cause both the primary and redundant set of hardware to not work.

Finally, on those Voyager missions, I'll bet they were designed for an 8 or 10 year mission life, not 40.

Edit 1:

While you cannot test your way into a highly reliable system, testing plays a big part in weeding out marginal parts. All of our assemblies go through some type of environmental stress screening, which includes functional testing over the expected temperature range, and temperature cycling, both powered and un-powered. Systems destined for space go through testing in a thermal vacuum (TVAC) chamber. There also may be vibration or shock testing, but these are usually done on a test article.

EDIT 2 8/6/2020 - Added blurb on temperature swings

Several who have responded to this question mentioned temperature and its effects on reliability. So I thought I would expound on this a bit more.

Semiconductors exhibit a failure rate that approximately doubles for every 10 deg C increase in temperature. There are papers out there arguing whether 2X is the right value; that maybe it should be 1.8, or 2.5, or some other amount. But for purposes of this discussion I’ll use 2X as it is a value that’s “accepted” by industry, the government, and the reliability disciplines.

With that out of the way, it makes sense that, from a reliability standpoint, you want to keep your electronics as cool as possible. 85 deg C operating temperature is better than 95 deg C, and 75 deg C is better than 85 deg C.

But in addition to the operating temperature, be it average or peak, there is also the temperature swing, or variation. Temperature swings are bad from a reliability standpoint in that it is temperature changes that stress interconnects, particularly those involving IC’s or even discrete semiconductors. These temperature changes induce a stress on the interconnect between the component and the board due to the differences in Coefficients of Thermal Expansion (CTEs) between the component and the board. For example a typical FR4 PCB has a CTE of ~15 ppm, while a BGA package might have a CTE closer to 6 ppm. These differences in CTE cause a stress to be exerted on the solder joints that attaches the part to the board as the temperature changes. These stresses are proportional to the changes in temperature and the size of the package and over time, given enough temperature cycles, can lead to a fracture of the solder joint or attachment to the board.

Leaded parts, such as the old 14/16/20 pin flat packs are much more forgiving in this environment than are rigidly attached packages such as Ball Grid Arrays (BGAs) because the leads of the former provide a significant amount of compliance that reduces the stress on the solder joint.

The reason for bringing all this up is that what we usually care about is the reliability of the system as a whole, or more properly the Probability of Mission Success (Ps) of the system. Because of how temperature changes and the average operating temperature affect various aspects of the system’s reliability, it may turn out that it’s better to operate a system at a constant higher temperature (say 85 deg C) as opposed to letting the temperature swing from 10 deg C to 70 deg C on a regular basis.

I recall a business called Continental Testing Laboratories. They had the first computer I ever used. Punch cards existed for EACH resistor, capacitor, transistor, diode, that went thru the test/heat/test/heat/test/heat, where the parameters were examined for DRIFTING.

Components that DRIFTED differently from the other components were discarded.

They also XRAYed the components to look for voids and for foreign particles.

With all this, a 1 cent resistor becomes 100 cents, and has a small tag attached, so the final circuit documentation describes the recorded parameters for each component.

Transistors are assumed to track if they are differential_pairs inside a 6-lead metal case (such as 2n2020, if I recall rightly). Thus neutron bombardment in orbit is assumed to degrade the Beta of each transistor equally, and the "matching" is maintained.

The V_base_emitter is assumed to not perfectly drift, thus a design margin for offset voltage becomes part of your worst-case-design analysis (slide rules were used).

To implement anything better than about 8 bit or 10 bit ADCs is not possible using the allowed discretes. I think BurrBrown or TRW may have produced metal_hybird DAC networks that were shown adequately stable for decades.

Additionally, the team I worked in was allocated a THERMAL ENGINEER; he used finite element methods (on an IBM 1630) to model the heat flows.

Since the applications were space-borne, the allocated powers were SMALL, and a simple ground plane to a mounting stud (or 4 or 6 of them, to handle shock) was all that was needed to let heat flow out of the circuit/PCB/module to the spacecraft chassis and then radiate out into space.

To prevent freezing the spacecraft, I've heard shutters are used, these regulating how much of the spacecraft is actually exposed to the coldness of space.

================================ Aug 6, 2020

Specifically following up on answer of SteveSh and the paragraphs on heat

  • having embedded PLANES for Ground(s) and for VDD(s) are excellent.

  • though FR-4 is poor heat conductor (its glass and glue), adjacent layers of Planes will easily exchange heat, especially if has 4 layers or 6 layers in total 1/16" thickness. Thus the Power planes become as useful as the Ground planes for heat removal.

  • You can use thicker copper foil, to drop foil's R_thermal by 2:1 or 4:1

  • Example: 1 watt (MCU?) of size 1centimeter, in middle of 9_cm board with copper beryllium card_cage_slides on 2 of the 4 edges, to remove heat. That 1cm^2 footprint has EIGHT surrounding 1cm squares (in a 3 * 3 grid). If the only plane is Ground (example), then removing the heat will be at most 70 ° C per watt per square divided by 8 (the EIGHT heat exit paths), or 8 ° C per watt.

  • But the heat is not yet at the edge of the PCB (where the card_edge_slides will remove heat to the spacecraft chassis.

  • This is a 9cm by 9cm PCB. Model it as large squares, each 3 * 3 cm; thus we now have a new grid, fully filling the PCB. The middle square is our heat source. Assuming the heat flows left and right to the card_slides, we can use SIX of the 8 large squares as heat removal (the 2 squares at center top and bottom do not contact the card_slides). With 6 squares of heat removal, the additional heat flow resistance is 70/6 = 11 ° C per watt.

  • Thus the R_thermal, from the 1cm MCU to the edges of the 9cm by 9cm PCB edges, is 9 + 11 ° C per watt, or 20. This assumes the MCU easily dumps heat into at least one PLANE.

  • Thicker foil will drop this. More planes will drop this.