Distributed Systems

https://github.com/sahil-time/cpp/blob/main/README.md

Failure

Failure is an intrinsic and an unavoidable part of distributed systems. The goal therefore, is to design a system, where even in case of partial failures of some parts of the system, the system can still progress as a whole. Just like wait-free programming. This means that the goal is to make the system tolerant to failures.

This is in stark contrast to other systems that are not designed to be tolerant to failures. For e.g. a single computer, or a high-performance computing system.

System Design w.r.t Failure

So basically there are 2 philosophies in designing systems w.r.t failure:

Cloud Computing Philosophy – Here we treat partial failures, as something that the system overall needs to tolerate. The reason is, that unlike the HPC philosophy on failures, failures are an unavoidable part of these massive systems. So failures are always happening, and therefore the system must be designed to tolerate failures.

High-Performance Computing [ HPC ] Philosophy – Here we treat partial failures as total failures and we either checkpoint to an older state or just reboot all over. For e.g. PC’s, routers, other small systems. The reason is, failure is not expected here, so the system is not designed to tolerate failures.

Basically a distributed system is eventually composed of thousands of HPC systems that are not fault-tolerant, but the system as a whole needs to be. What this means is that small components of a system must work well almost all the time, with a negligible failure rate [ which is why it does not need to be fault-tolerant ]. But when hundreds of thousands of these small components are pieced together, that negligible failure adds up to a number big enough to be dealt with. Therefore in big systems, things fail all the time – Werner Vogels. Now the system as a whole needs to be designed to be fault-tolerant [ job of the software architects ] whilst also making the individual components more robust and long-lasting [ job of the OEM’s ].

Time

Physical Clocks

Physical clocks are used for either measuring current time or measuring time durations:

Measuring current time – Clocks used for this are called “time-of-day” clocks. These tell you the time of the day, they are used in error logs, logging, scheduling tasks and all modern devices. These are generally synchronized between machines using protocols like NTP etc. They can jump forward or backwards and therefore they make a bad choice to calculate durations or intervals.

Measuring time durations – Clocks used for this are called “monotonic” clocks. These clocks only go forward, like the real time arrow. These can be used per-machine but they are not synchronized between machines. So they cannot be used to synchronize anything, but they can be used to calculate intervals/durations on a single machine.

READ: Someone used a “time-of-day” clock instead of “monotonic” clock – https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/

Logical Clocks

Now as far as real-time programming is concerned, it extensively uses “time-of-day” clocks. “monotonic” clocks are used to measure time intervals like for e.g. the amount of time an API takes to execute, but distributed systems discards both of these physical clocks and uses the notion of logical clocks which measures the order of events.

Logical clocks deal with causality. Given 2 events, A and B, we say that A -> B i.e. A happens before B if:

A and B occur on the same process with A happening before B
A is a send event and B is a corresponding receive event
Due to a transitive property i.e. A -> C and C -> B therefore A -> B

System Design w.r.t Time

There are 3 system models w.r.t time that are commonly used:

Synchronous Model – This model assumes bounded network delay, bounded process pauses and bounded clock error. This does not imply exactly synchronized clocks or 0 network delay, it just means that there is an upper bound to these errors or delays. This isn’t a realistic model of most practical systems as this is a very restrictive assumption.

Partially Synchronous Model – This means that the system behaves like a synchronous system most of the time, but sometimes exceeds bounds. This is a realistic model for most real world systems, because for the most part the network and processes are very well behaved, and they shatter expectations occasionally.

Asynchronous Model – Here the algorithms make no timing assumptions, and we do no even have a clock. This is a very relaxed model but it cannot be really used to model most real world systems because we will take a massive hit on performance.

Summary

When designing a System, designing it to be distributed must be the last resort. It must be avoided until it simply cannot be avoided. It is similar to synchronization in multi-processor HPC Systems –

The best synchronization is the one that never takes place, and the best distributed system is the one that isn’t

References

Designing Data-Intensive Applications – Martin Kleppmann
CSE138 – Lindsey Kuper

Failure

Time

Summary

References

Share this: