Blog - andy tran

What Object Storage Native Looks Like

With regard to performance, commitment, effort, dedication, there is no middle ground. Or you do something very well or not at all.
- Ayrton Senna

If you inspect modern databases, you will see that they typically store their data in three places:

DRAM
Local NVME SSD
Object Storage

When we say object storage native databases, what we really mean is that their persistence layer is object storage and that they navigate this storage hierarchy with elegance. The amusement of this strategy is that it spreads data depending on how frequent it is used. Favor hot data in memory, warm data in disk, and cold data in object storage; this is the game.

Many databases elect to optimize within one particular niche of the compute storage hierarchy. Instead, we argue for hot-warm-cold harmony. The requirements coalesce in a way that appoints object storage as the sole dependency for durability. In pursuit of achieving this objective, we will identify the key conditions that make this architecture possible along the way.

Compute Storage Flexible
- Object Storage
- Manifest
- Cold to Hot
- Consistency
  - Strong Consistency
  - Eventual Consistency
- Conclusion

Compute Storage Flexible

On the path to object storage native, we will come across a new way to describe the architecture, which is compute-storage flexible. The convenience of bootstrapping onto object storage like S3 is that it already champions around eleven 9s of reliability. This used to require a lot of consensus, but in recent years, arrived at an inflection point that allows object storage to be a sole dependent on durability. As a result, we can think of local NVME SSDs and DRAM as a general cache layer.

Flexibility comes out in different ways. When serving compute nodes in the cache layer, they could just turn off. As long as a write gets acknowledged by object storage, all working data in memory and disk can die. One could also scale the compute nodes. This decoupling between compute and storage describes it as flexible.

Object Storage

The object storage layer is one end of the continuum. Its defining characteristic is that it provides cheap & durable memory, the round trip latency is slow (~100ms). S3 for example is (~0.002/GB/Mo). This is the economic appeal and tradeoff of object storage.

When a write (PUT) is acknowledged on object storage, it becomes a durable object. Strategies typically vary on how to organize data after the fact. Think of the background maintenance as compaction (to appeal to space) or indexing (to appeal to structure).

A simple way to handle this territory is to push off that work for "later" or wait until certain thresholds. This is usually not worth it in the long run and often call for passive ways to compact and index. This is unsolved territory and variance usually depends on the workload that the database aims to support.

GET and PUT requests to object storage are special because they can drive a lot of concurrency, meaning

Write Ahead Log (WAL)

A traditional WAL on disk supports frequent, small, sequential writes. You can think of the appends themselves as ~1 op but k ops to perform the fsync. An fsync ensures that an append on disk becomes durable. You can also use one fsync across a volume of appends. Let's take a closer look at an example of a WAL on disk visually scaled to 100ms -> 1s.

append 0 ~2 µs

fsync 0 ~100 ms

log (disk, unsorted appends)

The fsync operation itself costs ~1ms, but it's common to see folks config this latency down to as low as 1s to avoid it from contending with the IOPS.

It's useful to keep fast & simple writes, but it must be justified with responsibility. On disk, the bulk of the ops are used to append to the log with sequential write speed, then promise durability with fsync.

Obtaining durability on object storage looks similar, but at the object storage end of the storage continuum, the writes are of a different nature:

what is the dollar cost of the operation?
what is the latency cost of the operation?

The standard rate on S3 for a write (PUT) is $0.005 per 1000 PUTs (or $0.000005/PUT). We will make an attempt to show the same WAL-on-Disk strategy applied to this object storage pricing model.

Let's place the WAL in object storage and naively mark 1 append = 1 PUT request.

█ = one append = one PUT = ~100 ms

object storage (durable)

PUTs 0 cost $0

The first issue you might notice is that object storage is latency bound. We could not possibly match the total volume of appends we were making on disk. A PUT request is ~100ms in latency.

There is a huge opportunity waste by naively appending like this to object storage.

Instead, we can pool writes on an in-memory op buffer and let a single batched PUT request write the whole buffer as one WAL entry in object storage.

PUTs 0 durable 0 cost $0

memory 0 ops ~0.5 ns

PUT ~100 ms

object storage (WAL)

This is the same maneuver as the fsync on a WAL on disk. Let the writes accumulate, then commit them in one batch.

PUTs 0 durable 0 cost $0

memory 0 ops ~0.5 ns

PUT ~1 s

object storage (WAL)

There's a strong case to amortize the latency and the cost across all of them. This is a key contribution towards the economic responsibility of object storage native.

One of the challenges with a WAL on disk is that it tangles the read and write paths. Writes are kept dumb in the interest of sequential appends to the log, but reads, and the mechanisms to pamper the reads (like compaction & sorting), end up contending for the same IOPS on a machine.

By placing the WAL on object storage, the architecture becomes extremely tolerant with respect to what goes on in the memory and disk layer. As long as an object gets acked by the WAL, compute nodes can die. This is a useful property and the ways that object storage databases handle this tend to vary here but it typically looks like either of the following, both of which are simple in their own right.

One could dedicate writer nodes and reader nodes.

       
 ┏━━━━━━━━┓   ┏━ writer node ━━━━┓  ┏━━━━━━━━━┓
 ┃        ┃   ┃memory ███▓       ┃  ┃         ┃
 ┃        ┣┅┅┅╋▶                 ┣━━┫         ┃
 ┃        ┃   ┃NVME   ████████▓  ┃  ┃         ┃
 ┃        ┃   ┗━━━━━━━━━━━━━━━━━━┛  ┃ object  ┃
 ┃ router ┃                         ┃ storage ┃
 ┃        ┃   ┏━ reader node ━━━━┓  ┃         ┃
 ┃        ┃   ┃memory ███▓       ┃  ┃         ┃
━━▶       ┣┅┅┅╋▶                 ┣━━┫         ┃
 ┃        ┃   ┃NVME   ████████▓  ┃  ┃         ┃
 ┃        ┃   ┗━━━━━━━━━━━━━━━━━━┛  ┃         ┃
 ┃        ┃                         ┃         ┃
 ┃        ┃   ┏━ ....... ━━━━━━━━┓  ┃         ┃
 ┃        ┃   ┃                  ┃  ┃         ┃
 ┃        ┣┅┅┅┫ ....             ┣━━┫         ┃
 ┃        ┃   ┃                  ┃  ┃         ┃
 ┗━━━━━━━━┛   ┗━━━━━━━━━━━━━━━━━━┛  ┗━━━━━━━━━┛

With asymetric read and write loads, this can be an attractive schema; simply scale towards more read or write nodes to meet its demand.

Another method is to deploy generic nodes that serve both read and write work.

  
         ╔════════════════════╗ ┏━ object storage ━━━━━┓
━━━━━━━┓ ║ ┏━ R/W node  ━━━━┓ ║ ┃                      ┃ 
       ┃ ║ ┃memory ███▓     ┃ ║ ┃/{org_id}/{namespace}/┃
router ├─┼▶┃                ┃ ╠─┼▶                     ┃
       ┃ ║ ┃NVME   ████████▓┃ ║ ┃/{org_id}/{namespace}/┃
▶      ┃ ║ ┗━━━━━━━━━━━━━━━━┛ ║ ┃                      ┃
       ┃ ║                    ║ ┃/{org_id}/{namespace}/┃
       ┃ ║ ┏━ .......  ━━━━━┓ ║ ┃                      ┃
       ├─┼▶┃                ┃ ║ ┗━━━━━━━━━━━━━━━━━━━━━━┛
       ┃ ║ ┃ ...            ┃ ║ 
       ┃ ║ ┃                ┃ ║           
━━━━━━━┛ ║ ┗━━━━━━━━━━━━━━━━┛ ║ 
         ╚════════════════════╝

One could deploy a literal GCP or EC2 instance per namespace. It's also common to see a single R/W node handling multiple namespaces.

There are strong cases to make about any of these choices.

Notice how this substantiates the strategy of compute storage flexibility. Compute nodes died? Durability is covered by object storage. Want to scale compute? Add more VMs in the direction demanded from the system.

[Manifest]

So far we have discussed the write ideology that shuffles up from memory to object storage. Reads follow a similar trajectory, involving a search-freshest-data-first philosophy. When R/W take these trajectories, they are called read-through and write-through.

For example, in a traditional LSM, a typical read follows

memtable╮                ╭▶L1╮    ╭▶LN  
        ╰▶frozen memtable╯   ╰▶...╯
            (pre-flush)    └─────────┘
                            SSTables

├───────────── read path ─────────────▶

Object storage native employs the same principle, but it looks different. We will start with a naive picture of the architecture, then narratively bolt on key mechanisms, ultimately to arrive at a coherent picture of both the read and write paths.

Below is a writer node appending ID 0, ID 1, ID 2, ..., ID N to an "in-memory buffer" with the intent to accumulate appends to get batched into a PUT request as a durable object in the WAL in object storage. Next to it is a reader node repeatedly reading the ID 0.

PLAY

Here the object storage starts empty, then a stream of PUT requests batch the in-memory op buffer into the WAL. Now that ID 0 is the WAL, it's prepared for a read. Think of LSM files (L1, L2, .., LN) as a compaction layer. They aim to organize WAL files and clean scenarios where multiple versions of the ID exist, sorting along the way. Any data structure could live here, but typically a self-organizing structure like an LSM.

As the system appends IDs [0, N) on a 0.5ns basis, we're interested in doing a search-freshest-data-first strategy just like a traditional LSM on disk.

The appeal to this stratgy is to answer the scenario where ID 0 could get overwritten

           ID 0 (v2)
memtable╮           ╲    ╭▶L1╮    ╭▶LN  
   ╲    ╰▶frozen memtable╯   ╰▶...╯
    ID 0 (v3)                    ╲
                                  ID 0 (v1)

├───────────── read path ─────────────▶

At any point the read finds an ID 0, it stops; it found the freshest copy of ID 0. But in the object storage DEMO, ID 0 never even got updated, but each repeat read is doing a lot of work. This mechanism is particular helpful on disk, but since the LSM is on object storage, each check is its own GET request and dominates the added latency.

What is missing from this story is that a proper LSM has a MANIFEST file. The manifest is a small file that tracks which file covers which range of keys.

PLAY

With a traditional LSM on disk, this manifest file is often overlooked because reads happen much faster in memory/disk. On object storage, the unit of work is a ~100ms roundtrip, so the manifest becomes a largely important detail and is key to this object storage native architecture.

Without the manifest, the read path produced unpredictable behavior:

mem╮     ╭▶WAL╮    ╭▶... = ???
   ╰▶nvme╯    ╰▶...╯

By referring to the manifest, you can establish predictability:


mem╮     ╭▶manifest╮    = 2RT
   ╰▶nvme╯         ╰▶[]

where RT1 is to fetch the manifest and RT2 to retrieve ID 0.

Cold to Hot

It appears that reading ID 0 is quite popular, but it's wedged in the cold depths of object storage. The game is hot-warm-cold harmony, so how does it get to the cache layer?

PLAY

The first read for ID 0, it does 2 RTs on object storage:

RT1 = GET the manifest (sees ID 0 lives in L2)
RT2 = GET the L2 file.

We call this a cold query because it's slow, but on the way back, a copy of that L2 file is warmed onto NVMe. Subsequent warm queries are served from NVMe with no roundtrip to object storage. And because ID 0 keeps getting hit, its file is eventually promoted into memory, where it's served hot.

Different databases will have their own policies on the cache hierarchy / cache efficiency. This territory has tons of potential for improvement. What matters in the abstract is that the file can now be accessed at the cache layer.

Consistency

What if ID 0 got overwritten? What if it got overwritten multiple times: ID 0 v2,..., ID 0 vN?

PLAY

The system has no current way of checking if the cache has the freshest ID 0.

In this experiment, the cache is populated with a file from L2. After many reads, the writer node initiates a PUT request, which contains an ID 0 v2, which in turn means all subsequent reads on the cache layer are now stale!

How can the system be aware of this?

In brief, there are really only two ways to deal with this. When a write comes in, is it available to the next read? If the answer is yes, this is called Strong Consistency. When a write comes in and is not available to the next read, it's eventually available, so it's called Eventual Consistency. Each mode affects how the system performs, so it's important to distinguish the two.

Strong Consistency

To be strongly consistent requires a mechanism involving the manifest with some cooperation on both the read and write paths.

Every written object gets stamped with an ETag. The manifest is an object too, so it gets one.

On a cold query, RT1 requests for the manifest to be put in memory; its ETAG comes with it. Every read is packaged with a conditional GET that asks object storage whether their manifest ETag is the same one in cache. If nothing changed, the conditional GET response says nothing was modified.

PLAY

You can think of the ETag as a commit hash. In this scenario, the initial ETag was a1b2. A WAL object was created and the manifest's ETag changed to c3d4. The next read's conditional GET catches the new ETag and pulls the new manifest. The manifest has the most updated pointers to where the freshest data lives, so the reader initiates another GET request to satisfy the read request.

This works because object storage vendors like Azure, GCP, and AWS keep a metadata layer that tracks each object's ETag. The conditional GET checks the manifest's ETag there in the metadata plane, not needing to look inside its contents. If the ETag hasn't changed, a 304 is returned.

The metadata check runs about 5-18ms depending on the vendor. This is the latency price every read must pay to stay current (strongly consistent).

PLAY

The issue here is that multiple objects can be in the WAL.

To find the fresh values, the reader has to exhaustively scan every WAL object: one GET per object. The bill is 1 GET for the ETag plus 1 GET per relevant changed object.

The above exhaustive search look brutally slow, but why fire the GET requests in series?

Fire them in parallel! This makes the scan appear as one round trip of latency for the same $.

PLAY

This works because the requests are from the same ETag instance. The hope is that all the requests happen in a reasonable time. The exhaustive search is bounded by the slowest request.

Here is what the demo is doing. The cache is serving keys 1 through 5 at version 1. The writer overwrites all five, so the WAL now holds version 2 of each, and the cache does not know yet. The next read's conditional GET catches the moved ETag and goes to fetch the fresh values. Instead of pulling the five WAL objects one at a time, it fires all five GETs at once. They land together, version 2 of each, and version 2 wins over the version 1 the cache had. Five requests, one roundtrip of waiting.

So what does a strongly consistent read actually return? The index holds keys 1-5 at v1. The WAL holds v2 for the keys that got overwritten (here, 2 and 4). The read consults the index and scans the WAL at the same time, then merges them. For each key the newest value wins: 2 and 4 come back v2 from the WAL, the rest stay v1 from the index. Reading the index alone would have missed the two fresh writes. The merge is what makes it consistent.

PLAY

Eventual Consistency

A lot of mechanics were required to explain Strong Consistency. Thankfully, the guide to Eventual Consistency is quite brief in comparison.

Turn off Conditional GET.

This will remove the floor latency to check the metadata layer, so that's a 5-18ms saving and the system will just serve cache.

The cache only ever changes by a GET. Any policy that satisfies the "eventually" clause will work here. For example, a periodic timer which GETs the new manifest and data and updates the cache. So when the writer commits v2, the reader keeps serving stale v1 until the next refresh runs.

PLAY

Conclusion?

Enjoy these closing remarks in the form of bullet points. (i'm too dumb to make it complicated)

You can learn a lot about databases thinking in latency & tradeoffs
The game is hot data in memory, warm data in NVME, cold data in object storage.
memory fast! / disk fast / object storage slow
memory $$$ / disk $$ / object storage $
Consensus on object storage used to require a lot of dependencies. An inflection point occurred. Now it doesn't. It's the sole dependency of durability.
object storage native ≈ compute storage flexible
Cold queries are slow. Subsequent queries are fast.
You can just batch more.
You can make computers do anything.

Demo Y: the whole loop, single writer

One writer, no CAS. It buffers a write, PUTs it to the WAL, and the manifest's ETag bumps a1b2→c3d4. Meanwhile the reader does a conditional GET on every read, a cheap 304 from cache in steady state. When the write lands, the next read catches it (200), GETs the fresh object, and the cache is current again. One writer means no race, so there's no compare-and-swap.

PLAY

Demo Z: the whole loop, R/W fleet

A fleet of generic nodes that both read and write. Reads work exactly as in Y (conditional GET, 304s, catch-on-change), but because many nodes write at once, commits use compare-and-swap: two nodes both PUT if-match a1b2, one wins (→c3d4), the other gets a 412, re-reads, and retries (→e5f6). Same object storage, same read path; the only addition is CAS so concurrent writers don't clobber each other.

PLAY

What Object Storage Native Looks Like

Table of Contents

Compute Storage Flexible

Object Storage

[Manifest]

Cold to Hot

Consistency

Conclusion?

Demo Y: the whole loop, single writer

Demo Z: the whole loop, R/W fleet