May 31, 2018

Building a Server

There are a whole load of moving parts in the system I want to build. For the most part, I know what they need to do at a high level, but there’s still a load of exploratory code I’m going to need to write to understand the nitty gritty.

In this post I’ll sketch out the various parts of the system, and start to break it down so that I can usefully spike each area before tying everything together.

NBD

It’s probably worth going into a little detail about what the NBD protocol looks like.

NBD has an interesting history, in that for a long time it had a single version, and that version was specifically designed for simplicity, and to be easy to implement. In that version, the client connects to the server, the server responds with a hello message about the size of the device back, then the client starts sending Read messages (with an offset and a length) or Write messages (with an offset, a length, and a load of bytes). The server responds to a Read with either an OK+data, or an Error message. It responds to a Write with either an OK or an Error. Each Read or Write message has an 8 byte “handle” which the OK and Error have to also contain, so the requests can in theory be interleaved.

That’s all there is to the protocol. A few other message types were also added: there’s a Flush, a Trim, a Write Zeros, and a Disconnect message too. Flush tells the server to make sure everything outstanding is written to disc. Trim reclaims large runs of zeros so they don’t take up space.

The interesting thing about this set of features is that beyond Read and Write, you can largely pretend there are no features. You can ignore flush without breaking things if you’re already ensuring that a write message doesn’t return before the data is on disc. This is slow, but safe. Trim is just a size optimisation. Since it doesn’t change the data returned by a read at those offsets (thanks to sparse files underneath), you can again ignore it if you don’t care about the space optimisation.

I’m going to break ground by using this simple version of the protocol.

There’s a newer version (“newstyle”) that I will expand to later. It introduces the idea that responses to a Read message can be split up across more than one reply. This format is called a Structured Reply. That opens up the idea that a Read can be farmed out to more than one backing node without having to splice the responses back together again. You could do this without structured replies, but you’d need to allow for getting the second half of the data before the first, which would mean caching a potentially large response.

Top-Level Architecture

There will be two types of OS process: the proxy, and the server. The proxy will run on the same machine as the client. An instance of the server process will run on each of several storage machines. In the first implementation, each server process will be responsible for one copy of the data.

The client translates NBD requests into state transitions for the Raft protocol. The Raft protocol itself will be executed by the server processes. It will be the client’s job to connect to the Raft leader. Critically, the decision of whether to accept a write or not, in the case of a cluster without a quorum, must sit with the server, not the proxy. The proxy will never acknowledge a write that hasn’t been committed to the Raft log.

Note that this doesn’t mean in itself that the write must have been written to the disc image itself (although for simplicity, that’s where I’ll start). As above, supporting Flush and FUA properly are where this gets optimised.

For performance reasons, the client will manage a readthrough cache. I’ll have to do some measurements to figure out how to control how big this is. “Fixed size” is simple but might be inefficient.

It’s important that as far as possible, any OS process should be able to crash and restart without human intervention. Any of the server processes should be able to crash and restart without the client noticing.

How to Test It

I’ve got a workhorse VM I can run which is just a Debian Jessie VM with shutdown -r now in /etc/rc.local. It boots, it shuts down. I can use that as a sanity check, and as a “real world use” benchmark. I can also make a VM to boot straight into a bonnie++ test for some more artificial throughput measuring.

On top of that I’ll need some basic protocol tests and a fuzzer.

What to build

This description immediately lends itself to a certain breakdown. I’m in Go, so goroutines and channels are first-class citizens. I can use channels as boundaries between Things I Need To Care About.

NBD Protocol Talking Thing

I’ll need a component to translate on-the-wire NBD protocol to messages on a channel.

To properly test this, I’ll also need something local that can understand these messages and just read/write to a local file.

Readthrough cache

Can I take the NBD Protocol Talking Thing and insert an in-memory cache between it and the local file writer without breaking things?

Note that at this point I’m not really expecting a change in performance: I ought to be duplicating the job of the OS file cache here. If anything things might go slower, because I’ve introduced an extra step along the data path. The point is more “does it work” rather than “does it work fast”.

Goroutine management

At this point I’ve got at least three goroutines all doing different things in the same OS process. What happens when one of them panics? How do I want to handle that? Am I safe to drop everything on the floor and restart the whole thing? Does that incur horrible performance penalties? Am I safe not to scrap everything and restart? Are there states where just restarting one of the goroutines can cause data loss?

Speaking Raft

Spark up a quorum, start talking, make some big state files, and see what state replication looks like when we’re recovering from an outage.

One interesting question here: what does replacing a node for planned maintenance look like? If I want to go from 3 nodes to 4 then back to 3, is that significantly painful compared to going from 3 to 2 to 3?

This step should also inform what the final command-line UI will have to look like. I’m aiming for consistency and ease of use, but fundamentally I can’t get away with simply masking complexity. If there’s inherent complexity, it’s better to reveal it so that it can be managed rather than hide it under a leaky abstraction. Of course, if I can find an abstraction that’s particularly non-leaky, so much the better.

ALARM!

How do we tell the operators when something’s gone wrong? My gut instinct is just to put messages into syslog and let existing systems handle that, but we’ve got to make sure there’s enough information there to help the operators properly manage the situation. What situations warrant messages? We don’t want to be too chatty. We also don’t want to hide precursor, warning messages. There’s also a question as to which messages about internal cluster state want to be presented back for the proxy to log on the client’s machine.

Putting it all together

Once I’ve answered all these questions (a piffling trifle, I’m sure you’ll agree), I should be in a position to stick everything together and make a practical package. I can only go so far testing locally, so as a final step I’m going to try running the system in three scenarios: a local network drive backing up to a big file somewhere cloudy; a data share between three raspberry pi’s on a wet piece of string local wifi network; and backing a proper VM host on proper hardware in a proper DC. I’m hoping that will give me a good spread of understanding of where the weak points are.

At least, that’s the plan. The plan will change. Some, none, or all of the above will be wrong.

Want to hear more?

If this sort of thing floats your boat and you want to hear more, let me know. By giving me your email address, I’ll be able to let you know when I publish more in this series, or when I’m doing something relevant that you’ll be interested in.

That’s all I’ll use it for, just nice, juicy, relevant emails.