May 5, 2018

Towards a Better NBD Server

Towards a Better NBD Server

I’ve had an idea floating around in the back of my mind since working on flexnbd that there ought to be a better way to do block storage. This is, admittedly, a fairly niche interest.

FlexNBD is an NBD server which allows you to live-migrate the storage to a new instance on a different machine. In a VM hosting environment, this is extremely handy: if you know your storage back-end is flakey, you can evacuate before you actually lose any data. It’s good for ops in general, in fact. If there’s any reason you need to take a physical storage box out of service, you’ve got a path to get there without the VM guests which are running off block stores on that machine noticing.

You point your running instance of flexnbd with a new instance on another machine. It toddles off and duplicates its current backing store and any writes that come in while that’s happening to that fresh instance. Once it knows that all data is on the new instance, it shuts down without accepting any more writes from the client. The client can then reconnect to the new instance (and when I was on this, we had to get a change into qemu to make that reconnection happen… but I digress) without the guest OS being any the wiser.

When we first looked at this problem, another NBD server did live data migration: XNBD. However, looking at it in detail, XNBD seemed to have a rather critical flaw: the replication was pull-based, not push. This meant that there was a potential race condition where the client would send a write after the destination thought it had all the data, which the destination server would never see. This seemed bad, so we wrote a new server. NBD is a teeny tiny protocol, extremely well suited to this sort of experimentation.

FlexNBD doesn’t do any replication itself. We relied on hardware RAID to solve that for us.

What I realised a fair distance into the project, well after we had FlexNBD in production, was that this separation of concerns was upside-down. By putting replication below FlexNBD, we restricted ourselves to having all replicas of the data on the same machine. This same design feature committed us to needing a lengthy data transfer process for evacuating a storage server. Even when you’re on a nice and snappy local network, shipping terabytes of data around isn’t something you can just do on a whim, so in practice the ability to live migrate data wasn’t quite as useful as it could have been.

(As an aside, I’ve not had access to that system in production for some time now, so it’s entirely possible that the ops team have found a far better solution for the problems I observed than writing a new server.)

I figured that it was better to already have the data on more than one machine, rather than having to wait to move it around. That would mean that there would be no need to evacuate an individual server. You could just switch it off.

This series of blog posts will be me, figuring out how you build that system. I am probably very wrong about some aspects of this system, but after all this time the only way I’m actually going to figure this out is by building the damn thing.

I should point out that Ceph comes very close to being this system, but for one detail: it’s operationally complex. It tries to solve many different problems, and the block storage system is layered on top of its object store system. What I want is a system where the file on disc is just the disc image. It should be possible to identify precisely which VMs are represented on a given storage server, and I should be able to give one of the backing files directly to qemu to boot from, or to process to recover data from. I don’t want to have to build further tools over and above those already available to get basic tasks done.

Some design decisions

1. Never Lose Data.

Kinda obvious, but worth stating: once a write has been acknowledged, the data is on disc. I’ll sort out flushing later.

Note that, at least in the first instance, this rules out any sort of writeback cache. I can get away with (read: can’t get away without) a readthrough cache, but writeback introduces too many Things That Can Go Wrong for me.

The general version of this principle is: design out ways to fail.

2. NBD only.

As Simple As Possible, but no simpler. I’ll use new-style NBD, but minimally. Each process will only know about a single disc, to avoid nasty problems where one client’s writes end up on another’s disc, or clients interact badly due to both stressing the same garbage collector.

3. Easy to operate.

Installation and operation must be straightforward. This is something we got right with flexnbd. Having the tool built from simple command-line utilities makes it easy to test, and easy to bend to whatever installation layout suits your use case.

4. Be pedantically reliable

Without a quorum, we’re read-only. With a quorum, we can accept writes. This avoids total service loss without risking inconsistency.

5. Only one writer at a time.

Again a bit obvious, but worth stating. Design into the system guards against writes coming from more than one source.

6. It must be practical.

If I can’t sustain a solid 3 digits of MB/s to the client, I’ve done something wrong. “Something” may, in this case, have been to ask something of physics that isn’t actually possible. I’m choosing to believe this isn’t true, for the moment.

Once more into the breach…

I struggled for a long time to understand how to correctly design the state machine that would drive the replication system, how to verify it was correct. Then I realised: the disc image is a serialisation of a log, the log of all writes. This is exactly the problem that Raft was invented for. That’s already been validated.

Now, there’s a great big proviso to that: Raft depends on being able to replicate the serialised state from one node to another in certain failure recovery conditions, not least of which is “recovering” from adding a new node. Now, we know that moving disc images between nodes is expensive, and could potentially take a long time in comparison to a Raft timeout. I also don’t want to force disc replication through the client. Given existing nodes A and B and a new node C, it’s got to be source-to-sink only, but for load distribution reasons it probably wants to be A-C and B-C simultaneously. I don’t yet know if lengthy state synchronisation breaks Raft, or if I need to be clever about introducing messages into the log that do something other than write. I’ll have to solve that as part of this project.

How?

The choice of what to write this system in is an interesting one. C is right out: it’s far too low-level. Having been round the houses in C on flexnbd, I can honestly say that the choice of C for performance and proximity to the POSIX API was outweighed by the slowdown caused by choosing threading as the primary client-handling abstraction; I measured a 30% threading overhead against single-threaded NBD implementations, and while I probably haven’t written my last C, I don’t see a convincing argument to use it here.

Erlang or Elixir is a very tempting option. I’m building a distributed system, and using BEAM would remove a whole mess of protocol design. I could just use ordinary messaging. Arguing against it is that right now I can’t find a suitable Raft implementation. I’d be writing my own. Raft is simple enough that this isn’t a total blocker, but if I can avoid it, I will. There’s Riak’s multi-paxos implementation, but I don’t understand multi-paxos, so I’m even further from understanding what I shouldn’t do with it than I am with Raft. Bytecode on BEAM also isn’t that fast. I don’t think that’s likely to be a problem for me, the vast majority of what the system will be doing is just shuffling data around.

There’s a further counter-argument in that I don’t fully understand how to prevent BEAM’s GC from hanging on to binary data in flight yet. I’m sure that’s solvable, I just don’t want a hot code update to accidentally introduce the need to learn about it at the worst possible moment.

There are two big arguments in favour of using Go. The first is that deployment is nice and simple. Build a binary, copy it to a server, run it. That’s very nice. I’m sure I can build a pipeline to make Elixir deployment that straightforward, but I don’t have that yet. The second argument is Hashicorp’s Raft implementation. Having that available means a large part of the problem Goes Away. Buying into that would mean I’d need to sort out the “distributed” part on my own. No epmd to help here.

I don’t have confidence in learning enough Rust to be useful; C++ is a footgun; Haskell is tempting but I’d have too much to learn to avoid laziness causing me the same problems binary GC would in BEAM; I don’t know enough about JVM tuning to go that way; and nothing interpreted seems like a good plan except for raw prototyping off the critical path.

Given these choices, I’m going to start with Go and see where I run into difficulties. I am definitely going to borrow Erlang’s crash-first philosophy, and something I’m going to need very early is some way to do no-downtime upgrades, another thing I’d get for free on BEAM.

And so to work…

On a meta note: this idea has been sitting in my head for at least 4 years. This series of blog posts is my way of forcing it out of my head, to make it come into contact with some sort of reality so that at worst I can at least say “well, I tried it and it didn’t work.” In the best case we get something useful out of it. In the worst case I (optimistically “we”) learn something, and I get some brain space back.

And, of course, since writing all the above, it’s come to my attention that Zab might be a more appropriate protocol than Raft. However, given that I can pretty much grab the latter off the shelf, I think it’s worth starting there and seeing how far I get, while keeping the rest of the project isolated from the details of the replication protocol in case swapping it out ends up looking like a good plan.

Want to hear more?

If this sort of thing floats your boat and you want to hear more, let me know. By giving me your email address, I’ll be able to let you know when I publish more in this series, or when I’m doing something relevant that you’ll be interested in.

That’s all I’ll use it for, just nice, juicy, relevant emails.