# When QUORUM Isn't Enough: How Distributed Clock Skew Silently Discards Data

Table of Contents

TL;DR: I debugged intermittent silent failures in a ScyllaDB/Cassandra cluster where UPDATE queries were ignored despite QUORUM consistency. The root cause was clock skew on a single node caused by a failed NTP daemon. Since distributed databases use “Last-Write-Wins” (LWW) resolution, the drifting node assigned timestamps from the “past”, causing valid updates to be silently discarded.

Background

It started as a typical debugging session. I was investigating an issue where an update query would occasionally fail silently. No exceptions were raised, and the code proceeded as if the update was successful. Downstream features would then fail because the expected data update never happened.

To make things more interesting, the issue had a specific, reproducible pattern:

  • It happened only 10-20% of the time.
  • It occurred only in the staging environment.
  • Production and testing environments (running the exact same code) were fine.

Infrastructure

The architecture consists of several FastAPI services and Celery workers, backed by a 3-node ScyllaDB cluster. (Note: The logic below applies equally to Apache Cassandra).

The Stack

  • Database: ScyllaDB (3 Nodes).
  • Replication Factor: 3.
  • Consistency Level: QUORUM (2 nodes must acknowledge).
  • Driver: scylla-driver with GeventConnection.
    • Reason: The Celery workers are I/O heavy and run with gevent concurrency. The connection class must match this concurrency model.

Key Configuration

The system uses CQLEngine ORM for API operations and Prepared Statements for high-performance worker tasks. There was one specific configuration choice that became relevant:

The Investigation

1. The Usual Suspects

Naturally, my first hunch was an environment issue. Since Production and Testing were fine, Staging must be broken, right?

I ran a full health check on the Staging cluster. I verified the node configurations, checked network latency, and compared driver versions. Everything matched Production perfectly. The infrastructure was healthy.

2. The Code Execution Flow

If the infrastructure is healthy, the bug must be in the code. I traced the specific flow that was failing:

  1. Write A: INSERT a row into table1.
  2. Read A: SELECT the row to verify it exists. (Success)
  3. Batch: Execute a LOGGED BATCH of inserts into table2.
  4. Write B: UPDATE the row from Step 1 in table1.

The Failure: The UPDATE in Step 4 acts like a ghost. It executes without error, but the data remains unchanged (stuck at the state from Write A).

3. Debugging Silent Failures

Was the driver swallowing exceptions? I knew that session.execute_async() could hide errors if you don’t check the result:

future = session.execute_async("SELECT * FROM table")
try:
result = future.result() # Errors surface here
except Exception as e:
print(f"Query failed: {e}")

However, I was using the synchronous session.execute(), which raises exceptions immediately. Just to be sure, I attached custom callback and errback handlers to the driver and enabled verbose logging.

Result: Nothing. Every query returned SUCCESS.

I even restarted the application to force all Prepared Statements to be re-prepared on every node, just in case of a stale state. The issue persisted.

4. Verified Data Integrity

Was the application sending old data? I modified the code to fetch the writetimes of the updated columns before and after the update.

  • Expected: writetime increases after Write B.
  • Actual: writetime remained identical to Write A.

This confirmed that the database did not apply the update. It wasn’t a silent application error; the database itself was discarding the write.

5. The Smoking Gun: Trace Patterns

I decided to enable tracing for every query to see exactly which nodes were involved.

session.execute(query, trace=True)

After running the reproduction script multiple times, a distinct pattern emerged. The silent failure only happened when Node B was the Coordinator for the update.

  • Coordinator A -> Success.
  • Coordinator C -> Success.
  • Coordinator B -> Silent Failure.

Something was wrong with Node B. But what? To understand why a specific coordinator would cause a write to be ignored, we need to look at the life of a write query in distributed databases like ScyllaDB and Cassandra.

The Mechanics: Inside the Write Path

To understand why “Node B” matters, we need to look at the life of a write query in distributed databases like ScyllaDB and Cassandra.

1. The Setup (Coordinator)

When a write query arrives, the receiving node becomes the Coordinator.

  1. Replica Selection: It determines which nodes are replicas for that data using the Token Ring.
  2. The Timestamps Step: The Coordinator generates a mutation. Since Client Side Timestamp is False, the Coordinator assigns the timestamp using its own local clock.

Why Timestamps? In leaderless distributed databases, there’s no single node that decides “this write is correct.” When the same data is updated concurrently on different nodes, the database needs a deterministic way to resolve conflicts. Timestamps provide that mechanism: the write with the latest timestamp wins (“Last Write Wins”).

2. The Persistence (Replica Nodes)

The Coordinator sends the mutation to all replicas (including itself, if it is a replica). Each replica performs these steps:

  1. Mutation Validation: Checks schema and limits.
  2. CommitLog: Appends to the Write-Ahead Log (WAL) and fsyncs to disk for durability.
  3. Memtable: Writes to the in-memory data structure for fast access.
  4. Ack: Sends acknowledgment to the Coordinator.

Once QUORUM is met (2 nodes), the client gets a Success response.

What if a node is down? If a third node fails to acknowledge, the Coordinator stores a Hint to replay the mutation later. The client still receives a success response because QUORUM was met.

Note: Eventually, the in-memory Memtables are flushed to disk as immutable SSTables (Sorted String Tables). Background processes then compact (merge) these SSTables to discard old data (like overwritten updates) and improve read performance.

  1. The Retrieval (Read Path) When we read that data back:
  2. The Coordinator requests data from QUORUM replicas (2 nodes).

    Note: Whether the read requests go to 2 nodes (digest read) or 3 nodes depends on the read_repair setting.

  3. In each replica node, the database:
    1. Reads from the Memtable.
    2. Reads from SSTables.
    3. Merge: Merges results from these sources using Timestamps. The data with the latest timestamp wins.
    4. Returns the merged result to the Coordinator with its timestamp.
  4. The Coordinator Merge: The Coordinator compares results from the replicas.
    • If identical: Returns the result to the application.
    • If different: Merges them using Timestamps (“Last Write Wins”) and returns the latest data.
    • It may optionally trigger an asynchronous Read Repair to update the stale replicas.

Connecting the Dots

From the Write/Read path, we see that timestamp is the single source of truth for conflict resolution.

  1. Read A: Returns data from WRITE A.
  2. Read B: ALSO returns data from WRITE A (ignoring WRITE B).

According to the “Last Write Wins” rule, this has only one mathematical explanation: Timestamp(WriteB)<Timestamp(WriteA)Timestamp(Write_B) < Timestamp(Write_A)

Wait, how can the second write have an earlier timestamp?

  • WRITE A used Node A’s clock.
  • WRITE B used Node B’s clock.

If Node B’s clock is slow, it assigns a timestamp from the past.

The Theory: Distributed Time

To understand exactly how this happened, we need to step back and look at how time works in a distributed cluster.

Physical vs. Logical Time

In modern distributed systems, time is not perceived as a single, absolute “now.” Each node has its own independent hardware (crystal oscillator), causing clocks to drift apart.

  • Physical Time (Wall-clock): Tracks UTC. Unreliable for exact ordering across machines due to Clock Drift and Clock Skew.
  • Logical Time (Causal Ordering): Focuses on the sequence of events (e.g., Lamport Clocks) to ensure that if Event A causes Event B, A has a lower timestamp.

Synchronization Layers

To bridge the gap between physical drift and logical consistency, systems rely on synchronization protocols:

  1. NTP (Network Time Protocol): The default for most Linux systems. It synchronizes clocks by exchanging time-stamped messages with a central server.
  2. PTP (Precision Time Protocol): A hardware-assisted protocol for sub-microsecond accuracy.
  3. HLC (Hybrid Logical Clock): A hybrid approach combining physical time with logical sequencing.

The Root Cause: Clock Skew

Microsecond Precision

ScyllaDB and Cassandra use microseconds since the Unix epoch for write timestamps. Even though we humans think in milliseconds or seconds, this microsecond precision means that any clock drift—even a fraction of a millisecond—can cause data loss.

The Simulation

Let’s verify my “Node B is slow” theory with a simplified simulation. We’ll use a baseline of 1,000,000 microseconds.

Scenario 1: Healthy Cluster (Expected Flow)

  • All Nodes Clock: 1,000,000 (Synced)
StepTime (Real)EventCoordinatorAss. TimestampResult
1t=0msWRITE ANode A1,000,000Written
2t=50msREAD ANode A-Returns A
3t=150msWRITE BNode B1,150,000Written (Success)
4t=200msREAD BNode A-Returns B

Scenario 2: Node B is 200ms Slow (Actual Flow)

  • Node A Clock: 1,000,000 (Healthy)
  • Node B Clock: 800,000 (200ms behind)
StepTime (Real)EventCoordinatorAss. TimestampResult
1t=0msWRITE ANode A1,000,000Written
2t=50msREAD ANode A-Returns A
3t=150msWRITE BNode B950,000*Written (?)
4t=200msREAD BNode A-Returns A

*Node B Calculation: 1,000,000 (Base) + 150 (Elapsed) - 200 (Skew) = 950,000.

The Database Decision: When comparing 1,000,000 (Write A) vs 950,000 (Write B), the database correctly concludes that Write A is “newer”, even though Write B happened later in physical time. Write B is silently discarded.

The Smoking Gun

I decided to check the NTP synchronization status on Node B using chronyc tracking.

Terminal window
$ chronyc tracking
Reference ID : 00000000 ()
Stratum : 0
Ref time (UTC) : Thu Jan 01 00:00:00 1970
System time : 0.200453112 seconds slow of NTP time
Last offset : 0.000000000 seconds
RMS offset : 0.000000000 seconds

Discovery: The log confirmed my theory exactly. The System time was 0.200 seconds (200ms) slow.

Upon further investigation, I found that the chronyd daemon on Node B had been silently killed, leaving the clock to drift unchecked. Once I restarted chronyd, the nodes synchronized, and the silent failures vanished immediately.

Conclusion: When Last Write Wins, Time Matters

This investigation reinforced a critical lesson: In leaderless distributed databases, Time isn’t just metadata—it’s logic.

Key Takeaways

  1. Monitor Your Clocks: Don’t assume NTP is working. Monitor clock drift (e.g., via Prometheus node_exporter) and set alerts for skew > 50ms.

  2. Client-Side Timestamps: If your application logic requires strict causal ordering across multiple writes, considering using Client-Side Timestamps (monotonic). This shifts the source of truth to the client, removing the dependency on individual node clocks.

  3. Lightweight Transactions (LWT): For absolute correctness where “Last Write Wins” is insufficient, use LWT (UPDATE ... IF ...). This uses Paxos to guarantee order regardless of clocks, though it comes with a performance cost.

In Distributed Systems, if you can’t trust your clocks, you can’t trust your data.

References & Further Reading

For those who want to dive deeper into distributed systems theory and the specific technologies mentioned:

  1. Distributed Systems Course (Martin Kleppmann)

    • YouTube Playlist – Essential viewing. The entire series is a “must view”, but Lectures 3.1-4.1 (Time), 5.1-5.2 (Replication), and 7.1 (Consistency) are particularly relevant here.
  2. Designing Data-Intensive Applications (DDIA)

    • Martin Kleppmann – Chapter 8 (“The Trouble with Distributed Systems”), specifically the section on Unreliable Clocks.
  3. Amazon Dynamo Paper

  4. Logical Clocks

  5. Apache Cassandra Architecture

My avatar

Thanks for reading my blog post! Feel free to check out my other posts or contact me via the social links in the footer.


More Posts

Comments