Designing for States that can Cycle

A state that can cycle is modelled by a finite state machine that is not monotonic: It can go back and forth repeatedly between different states. A simple example is a worker doing an activity: The worker may start and stop working (taking breaks) repeatedly, switching between states idle and working.

However, when working with Actyx Pond, this conceptionally cyclical state machine must be translated into a collection of monotonic ones: In the example, every workStarted event that puts the state into working should kick off a discrete working-state with a unique ID, and workStopped events should explicitly designate which working-states they terminate, by referring to the IDs.

A simple example of how this looks in practice:

type WorkStarted = {
eventType: 'workStarted'
workerId: string
activityId: string
type WorkStopped = {
eventType: 'workStopped'
workerId: string
activityId: string
endedWorkIds: string[]
type WorkStatusChanged = WorkStarted | WorkStopped
const WorkerTag = Tag('worker')
const ActivityTag = Tag('activity')
const WorkStatusChangedTag = Tag<WorkStatusChanged>('workStatusChanged')
type State = {
openWorkIds: string[]
const onEvent: Reduce<State, WorkStatusChanged> = (state, event, metadata) => {
if (event.type === 'workStarted') {
// Use the event id as unique id for the work
} else {
// Remove all work ids that were ended by this event
state.openWorkIds = state.openWorkIds.filter(x => !event.endedWorkIds.includes(x))
const endAllOpenWork = (pond, workerId, activityId) =>
mkWorkerActivityFish(workerId, activityId),
// End all locally known work assignments
(state, enqueue) => enqueue(
{ eventType: 'workStopped', endedWorkIds: state.openWorkIds }

So why this complication? Actyx uses a special algorithm to sort events: Lamport clock. The advantage of a Lamport clock is that it is guaranteed to capture causality even when device clocks are wrong: E.g. a given workStopped event will always come after the workStarted events it refers to, even if the device clock on the node is accidentally years in the past.

In this example, the clock on node B is wrong, but the events are still sorted correctly, thanks to the Lamport clock:

nodeeventlamportwall clock

The disadvantage of Lamport clocks is that between nodes that cannot communicate (network partition) it will order causally unrelated events arbitrarily. Consider this example, which is totally possible even with correct device clocks:

nodeeventlamportwall clock
(network partition starts)108:00
(network partition heals)2216:30

When the partition started, at 08:00 in the morning, both nodes agreed on lamport=1. However, the Lamport clock then progressed faster for node B. So the events that it logged at 10:00 and 11:00 in the morning were sorted after node A’s events from later in the same day.

This is the reason that workStopped events should explicitly declare the workStarted events they refer to. Consider this example:

nodeeventlamportwall clock
(network partition starts)108:00
(network partition heals)2216:30

Without the explicit endedWorkIds=[BAR], the partition healing at 16:30 would mistakenly set the state to idle, even though there actually is ongoing work that started at 16:00.

In case a network partition causes multiple open work IDs, they can still be closed within one workStopped, if all knowledge is available:

nodeeventlamportwall clock
(network partition starts)108:00
(network partition heals)2016:30

If you are familiar with Conflict-free Replicated Data Types, you will know a very similar concept: The Add-wins set.