Distributed systems

This conceptual guide explains the fundamental principles that apply when working in distributed systems, such as eventual consistency and the CAP theorem.

Definition

A distributed system is a system composed of multiple software components, running on different networked computers, that communicate and collaborate by passing messages to each other.

Distributed systems

Node: an individual computer, and the software components running on it, of a distributed system
Network: the underlying data communication technology and protocols over which nodes communication
Message: a piece of information that one node sends to one or more other nodes

For a more formal definition and introduction, check out this paper.

Benefits

Distributed systems have several benefits that make them very useful in some scenarios.

Scalability

In a distributed system, computation on one node happens independently from computation on another node. This makes it easy to scale performance and add functionality. In both cases, you can scale the system by adding additional nodes that either duplicate existing or implement new functionality.

Reliability

Because computations are split over multiple nodes—vs. a single computer—distributed systems are significantly more reliable. When faults occur (it happens), only the functionality provided by the affected nodes is unavailable. Other nodes can continue working; there are no single-points-of-failure that can take down the entire system.

Performance

Distributed systems allow you to split your workload into multiple tasks and have nodes work on them in parallel. This allows you to increase the overall performance of your system. Additionally, you benefit from improved communication performance as individual groups of nodes communicate over the sub-network, without affecting communication in other parts of the network.

Challenges

Distributed systems pose a number of challenges to system architects and developers building software for them.

Coordination

Because computation is split amongst multiple nodes, additional effort needs to go into coordinating who does what and when. This is especially challenging because individual nodes may be disconnected or in a failure-state at any time. As a coordinator, you need to be able to deal with all these situations.

Latency volatility

With increased communication—some of it a pure overhead—and very dynamic system behavior, communication between nodes may at times be very fast, and at other times slow. This happens, for instance, when nodes coordinate complex tasks or synchronize amongst each other following downtime.

Monitoring

The behavior of a distributed system is significantly harder to reason about than a centralized, single-node system. Nodes may be up or down at any time, communication may be faster or slower, etc. This make monitoring a distributed system significantly harder than a centralized system.

Distributed vs. decentralized

A distributed system may be coordinated by a central node. Consider, for example, how map-reduce jobs are usually launched by a central node, the results computed in a distributed fashion, and the results reduced back into the central node.

Decentralized systems are a special form of a distributed system where there is no such central coordination authority. Actyx is a distributed, but especially a completely decentralized system, with no central authority that distributes and coordinates work.

Relevance to Actyx

By building apps for Actyx you are building a decentralized, and thus also a distributed, system. Edge devices provide computing power and access to the underlying network. Apps run on nodes and communicate by passing messages (see Event Streams).

With Actyx you get all the benefits of distributed systems: scalability, reliability, and performance. Actyx also tries to reduce the challenges associated with a distributed system to a minimum.

In addition, Actyx offers machine-runner, a framework running on top Actyx that enables easy choreography of distributed and heterogeneous a swarm to execute a sequential workflow without complex and expensive coordination of states.

Actyx and the CAP theorem

As discussed above, systems that span more than one node (computer, tablet, etc.) connected over the network are called distributed systems. More formally, a system is called distributed if parts of it can fail while others continue to operate, i.e. they exhibit so-called partial failure. Such systems suffer from a host of problems unknown by systems that run on a single machine. A nice introduction to this topic can be found in fallacies of distributed computing. The first false claim is that the network is reliable. And this is very false indeed; computer networks are not very reliable. Therefore every device connected over a network can experience temporary disconnects, even on an ideally otherwise functioning network. Network Partition is the terminology we use to describe the state of one or more temporarily disconnected nodes in a cohesive distributed system.

Network Partition results in a situation where interactions of a node with the surroundings (e.g. input from the user or output to the user) become disconnected from the interactions of the remaining part of the system. Thus, not only does this node becomes oblivious of what is going on in the rest of the system but the rest of the system is oblivious to the state of this particular node. Now, this is critical, because the user of the partitioned node would make progress unaware of the situation in the remaining part of the system.

We have two possible procedures now, either we allow the user to progress during the partition, thus making the whole system available or make the system consistent (i.e.. not allowing the operator to progress while they don't have the full view of what is happening in the whole system).

Close up on consistency

What exactly is consistency? Let's make an example. Imagine a banking system that keeps the track of account balances. Imagine having a constraint that a customer is not allowed to have a negative balance (i.e.. owe money to the bank). Let us call the customer Alice. So Alice lives in London and her bank balance is 1000 dollars. Now the system gets partitioned and the New York node does not see the rest of the system. Alice withdraws 500 dollars from the account in London, then boards a fast plane and off she goes shopping in New York! When she arrives there, the system is still partitioned and Alice's balance on New York node is still 1000 dollars. Yay! She buys some stuff that costs her 900 dollars and pays with a card. The transaction goes through because, from the perspective of the New York system, this leaves her with 100 dollars still in the account. Then suddenly, the network connectivity is restored and Alice ends up with her account state being -400 dollars, which violates the no negative balance rule.

In allowing the transaction to proceed on New York's partitioned node, we violated the consistency condition that says 'every read receives the most recent write or an error'. The New York node has not seen the most recent write that has happened in London and thus the global constraint could not be upheld. Instead, the system chose availability, which is every request receives a (non-error) response – without the guarantee that it contains the most recent write.

Consistency allows the builders of systems to maintain the illusion of a "system as a whole". So all parts of the system have exactly the same information about its global state, thus allowing it to uphold global constraints. Consistency is a natural descendant of centralized systems where all processing was being done by a single node. It also makes writing programs easier.

But is consistency essential? We would say the answer is "no". Some problems can be naturally re-stated without the requirement of consistency. Factory shop floor processes are naturally localized, they happen at workstations or islands that are closely knit together. Traditionally banking has been done by the means of branches (the SWIFT transfer code still expects a branch designator, most banks nowadays give just one "global" branch, but all this is relatively recent). So in the example above Alice would go to her London branch and ask them to make 500 dollars available to her at the New York branch. She would then have (temporarily) her balance in London reduced to 500 and her balance in New York to be the other 500 (the total balance would still read 1000). Then the system can become partitioned but Alice will not be able to overspend in New York. The same principle is still employed for ATM operations, as the machines can be partitioned pretty frequently. If the machine cannot contact the bank, it can still choose to offer a withdrawal of a small amount of cash. The same goes for offline card transactions.

The CAP theorem

Now we are finally ready to discuss the so called CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, who first stated it as a loose conjecture — it was proven in 2002 by Seth Gilbert and Nancy Lynch. This theorem states that it is impossible for a distributed application to simultaneously provide more than two out of the following three guarantees:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response – without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

No distributed system can eschew partition tolerance (only centralized systems do, and this is why then can be CA: consistent and available at the same time). For a distributed system the choice is between CP (consistent and partition tolerant) and AP (available and partition tolerant).

The CAP theorem is also very intuitive. When you think about the very definition of consistency, if a system wants to stay consistent, it cannot make any progress (i.e. be available to the users) during the partition, because there might be some writes to some other node that the given node does not know about. Conversely, if it makes progress during partition (is available), then it will not be consistent, because the read that was used to make progress might not be based on the most recent write.

The world is not black and white

The CAP theorem speaks only about systems that are consistent in a global manner. For the system to be considered consistent, every read needs to observe all writes, even the ones it does not depend on. Imagine a system that processes users' accounts, and the users are partitioned into nodes by the first letter of their surname. This system could still give consistent outcomes (i.e. the same outcomes as a consistent system would), despite read not observing writes globally, because to work correctly a read for a particular person's account needs only to observe the writes for the same account, occurring on the respective node. Thus no network partition will render the system inconsistent, though the system will also not be fully available for certain operators (the ones on the other side of the partition).

We can continue further, and research is currently ongoing. An example of a further topic would be to allow reads to declare what writes they causally depend on, and if they would be able to see all the writes they need they can still make progress, despite the system not being consistent in the sense of the strong definition of the CAP theorem. One of the examples is the CALM approach.

Learn more

Here are several resources that you can check out to dive deeper into distributed systems and the CAP theorem:

Definition​

Benefits​

Scalability​

Reliability​

Performance​

Challenges​

Coordination​

Latency volatility​

Monitoring​

Distributed vs. decentralized​

Relevance to Actyx​

Actyx and the CAP theorem​

Close up on consistency​

The CAP theorem​

The world is not black and white​

Learn more​