Better Code

Swimming in Someone Else's Pool

TDD and Simplicity

| Comments

Oh, I didn’t write tests for that. It’s too hard to test.

Heard that before? I’ve made this argument myself in the past, but it’s only recently that I’ve figured out precisely why it’s a stupid argument.

On the surface, and in context, it can seem reasonable. The code in question is often either interacting with the OS via system calls, or is involved in multi-process coordination, os else involves fiddly thread interaction. These can all be notoriously tricky to get right, and are often just hacked into shape until they “work”.

Nothing I’ve just said justifies not writing unit tests. Quite the reverse.

One of the benefits which proponents of TDD claim is that the code you end up with is less coupled, more clearly factored, and in general easier to understand than otherwise. In general, I’ve found this to be broadly true. I think this happens because TDD guides you towards teasing the problem apart into simple parts which are individually easy to reason about.

With this in mind, it’s clear that an argument for skipping tests when the interactions are complex is an argument to make code which is difficult to get right harder to understand. This does not strike me as a sane rationale.

If you look back at the situations where this argument raises its head, you’ll note that they all involve at least two separate issues: firstly discovering the state of “the system” (where the “system” is either the OS, a bundle of processes, or a bundle of threads), and secondly applying application logic to decide how to change that state. All the cases I’ve seen which provoke the “it’s too hard” response have, to borrow an expression from Rich Hickey, complected the interactions with the system with the decision logic behind the interactions.

Time for an example. In this code, which roughly paraphrases some production code I refactored recently, we’re connecting to a remote host with an SSH master socket, then giving the user a console over that socket. We need to do this so that we can distinguish between a failure to connect to the remote host causing a time-out, and the user successfully connecting and using the console.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class SshConsole

  def initialize( user, host )
    @user = user
    @host = host
  end


  def connect(timeout_time=Time.now+20)
    pid = fork{ exec "ssh -fMN -S/tmp/foo.sock #{@user}@#{@host}" }

    until Time.now > timeout_time
      done_pid, status = Process.waitpid2( pid, Process::WNOHANG )
      unless done_pid
        sleep 1
        next
      end

      # Still here? Then we're done, find out what happened

      # Return if the master socket process was killed:
      return false if status && !status.exited?
      # Return if there was a problem setting up the master socket
      return false unless status && status.exitstatus == 0

      # Still here? Then the master socket setup worked.
      begin
        # Launch the ssh console
        system "ssh -S/tmp/foo.sock #{@user}@{@host}"
      ensure
        # Kill the master socket when we're done
        system "ssh -NOexit -S/tmp/foo.sock #{@user}@#{@host}"
      end
      return true
    end

    # If we fall through to here, the master socket process failed to
    # connect before timeout_time.

    Process.kill("TERM", pid)
    Process.waitpid2( pid )
    return false
  end


end

Hopefully you can see how this is not code that would be simple to unit test. We have a fair amount of system interaction which would necessitate either a lot of messy, fragile mocking, or actual forking of SSH processes and faking of network failures to ensure coverage. Neither of these is ideal, and the reason this is hard is because the code above mixes up discovering the state of the system with acting on that state. I rewrote it to look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class SshMasterProcess
  def initialize( user, host )
    @user = user
    @host = host
    @pid = nil
    @status=nil
  end

  def launch
    @pid = fork{ exec "ssh -fMN -S#{socket} #{@user}@#{@host}" }
  end

  def quit
    system "ssh -NOexit -S#{socket} #{@user}@#{@host}"
  end

  def finished?
    waited_pid, @status = Process.waitpid2( pid, Process::WNOHANG )
    !!waited_pid
  end

  def killed?
    @status && !@status.exited?
  end

  def succeeded?
    @status && @status.exitstatus == 0
  end

  def kill
    Process.kill(9, @pid)
    Process.waitpid2( @pid )
  end

  def socket
    "/tmp/foo.sock"
  end
end


class SshConsole

  def initialize( user, host, ps_builder=SshMasterProcess )
    @user = user
    @host = host
    @ps_builder = ps_builder
  end


  def connect( timeout_time = Time.now + 20 )
    master = @ps_builder.new( @user, @host )
    master.launch

    until Time.now > timeout_time
      unless master.finished?
        sleep 1
        next
      end

      return false if master.killed?
      return false unless master.succeeded?

      begin
        system "ssh -S#{master.socket} #{@user}@#{@host}"
      ensure
        master.quit
      end
      return true
    end

    master.kill
    return false
  end
end

In a way, this is a fairly typical refactoring: I’ve extracted behaviour into a class to enable dependency injection. The reason it’s interesting in this context is because I’ve isolated all the functionality related to identifying and modifying the state of the SSH master process away from the logic which drives that state. This means that the logic can be tested without going outside the test process. The structure of the logic hasn’t changed at all, but by adding a layer of indirection in the form of the SshMasterProcess class, the code has become simpler and much easier to test.

The take-away message from this is that when code looks hard to test, you should search for overlapping concerns, and keep an eye out for overlap between “system” state and modifications of that state.

Comments