What is difference between fault tolerance and fault resilience?

The Fault Tolerant means the ability of an architecture to survive (tolerate) when an environment misbehaves by taking corrective actions, e.g, surviving a server crash or preventing a misbehaving API from bringing down the whole system, etc. The Fault Resilience is probably the capacity to recover from these type of scenarios quickly.

After further reading of Netflix blogs and wikis, it seemed the terms Fault Resilience and Fault Tolerant were used interchangeably.


Fault tolerance: User does not see any impact except for some delay during which failover occurs.
Fault resilience: Failure is observed in some services. But rest of system continues to function normally.


  • Fault Tolerance: any user of the service does not observe any fault (observing delays is normal).

  • Fault Resilience: a fault may be observed, but only in uncommitted data (like the database may respond with an error to the attempt to commit a transaction, etc.).

[Reference]