Where’s all the magic? In the linker…

I have been trying to make a post per week about Go, but that requires learning something interesting during the week. I’m currently cycling between several little Go toys as I get the time. One is to make Go on raw hardware more useful/interesting. Another is a clone of the console server from conserver.com written in Go. Neither one of those little projects is at a point where I can really explain much about it, but not for want of trying… This week my Go console server project taught me that netchan cannot send channels (OK, I wasn’t really shocked at this, but I was hoping that it might work), and so I’ll need to make my own protocol, and include some proxy channel reader/writers at each end of the TCP connection to send the data into the channel like I want. On the raw-hardware side, I got stuck on code that ends up incorrect once it ends up in the ELF file. Go figure.

So though I don’t have the final answer yet, even what I know so far is at least a bit interesting, and that will be the topic of this week’s post. (Hopefully next week’s will be “I GOT IT! Here’s the solution…”. but don’t hold your breath.)

When you set GOOS to “tiny”, and then rebuild the runtime, you get a runtime system that knows how to run Go code on raw hardware with no underlying OS (see $GOROOT/src/pkg/runtime/tiny/README for some more info). Of course, if you try to compile something that reads a file, it will fail: there’s no kernel, let alone a filesystem. But you can write to the screen, which is nifty. I though it would be cool to add some more stuff to Go’s tiny runtime, for example a way to receive hardware interrupts over a channel. What’s really interesting about Go on raw hardware is that Go already provides a few of the things you need to write an OS (threads, scheduling, bounded arrays — a form of memory protection). I’m not proposing to write “the Go OS”. But I think it would be interesting to be able to run some kinds of servers on raw hardware; if you don’t need/want an OS under you, why should you have to have one?

So far, I’ve got a whole heap of code to compile when GOOS=tiny, including exp/draw and all its dependencies. My current goal is to put a white rectangle onto the VGA screen using exp/draw and a new backend that I’m calling exp/draw/svga.go. But something is going wrong, I’m getting a page fault from running off into bad memory. To debug that, it’s time to set up proper trap handlers. A nifty way to do that, borrowed from Plan 9, looks like this:

TEXT  intr0(SB),7,$0
  PUSHL $0
  PUSHL $0
  JMP intrcommon
TEXT  intr1(SB),7,$0
  PUSHL $0
  PUSHL $1
  JMP intrcommon
...
intrcommon:
  PUSHL DS
  PUSHL ES
  PUSHL FS
  PUSHL GS
  PUSHAL
  ...
  IRETL

What, you don’t know assembly? Please, I think you can figure out what’s happening here, but I’ll help anyway. For each type of interrupt, we emit a bit of code that sets up the stack right, then uses a common piece of code. This is something you can do in assembly that you can’t do in Go: jumping from inside of one routine to a label inside of another one. Also, you can’t futz with specific registers and the hardware stack, which is what you have to do in an interrupt service routine, and which explains why this code is assembler and not Go code.

When you run this code in Bochs and single step it, you find out what’s actually running on the processor is something more like this:

TEXT  intr0(SB),7,$0
  PUSHL $0
  PUSHL $0
  PUSHL DS
  PUSHL ES
  PUSHL FS
  PUSHL GS
  PUSHAL
  ... (rest of intrcommon here) ...
  IRETL
TEXT  intr1(SB),7,$0
  PUSHL $0
  PUSHL $1
  ...
 loop:
  JMP loop

Thus if you happen to hit interrupt 0, things work ok. But if you happen to hit interrupt 1, you are stuck in endless loop hell.

It seemed to me like 8a was incorrectly compiling this. So I added a bunch of debugging code to it, and I convinced myself that it was working right. One problem is that there is no debugging flag to help you understand the output .8 file you get after doing “8a test.s”. If 8a is not compiling it wrong, who else could be? Well, my interrupt service routines had to make it through one other piece of code before getting on the raw hardware: 8l. The problem must be in there, but I haven’t found it yet.

There’s way more than my bug in 8l, however. This is really where a lot of the magic of the Go system come from. It implements the split stacks Go uses to support thousands of threads. It also implements dependency tracking, so that the right object files are pulled in to satisfy all the imports. In contrast, the C linker is traditionally implemented as a rather stupid thing that just gathers .o files together. One thing you find in libraries meant for wide reuse is that all the functions are in their own file, because cc and ld are not smart enough to exclude unneeded code from the final binary except on the level of the .o file. But not all linkers are stupid: The GNU tool chain is smarter than this, at least on some platforms. Take a look at ld-s -gc-sections options if you are interested. And this article in MSDN talks about how .NET uses link time code generation for some optimizations.

I’ve confirmed that the instruction stream is coming into 8l correctly. The undocumented “-W” flag tells 8l to show you the contents of every .8 file it is reading. Take this little sample .s file:

TEXT foo1(SB), 7, $0
	NOP
	JMP bar
TEXT foo2(SB), 7, $0
	NOP
bar:
	NOP
	RET

Compile it with “8a test.s”, then look at it using “8l -W test.8”:

        ANAME   /
        ANAME   home
        ANAME   jra
        ANAME   go
        ANAME   src
        ANAME   cmd
        ANAME   8a
        ANAME   foo.s
(1)     HISTORY ,
(9)     HISTORY ,
        ANAME   foo1
(1)     TEXT    foo1+0(SB),7,$0-0
(2)     NOP     ,
        ANAME   bar
(3)     JMP     ,5(PC)
        ANAME   foo2
(4)     TEXT    foo2+0(SB),7,$0-0
(5)     NOP     ,
(7)     NOP     ,
(8)     RET     ,
(8)     END     ,
... and then 1000's of lines of dumps from other files here...

The output from -W shows that test.8 knows it is supposed to be jumping up into the next routine. My current guess at what’s happening is that 8l processes things one function at a time, never expecting dependencies between functions like this (since it is most definitely not something it should expect to be coming from 8g).

Bonus trivia nugget: The Go system defines a set of flags on functions that lets 8g and 8a communicate with 8l and tell it what to do. If a routine cannot, or should not check the stack size, then it needs to use a text line with 7 in it. If you leave out the 7, 8l will add in the stack checking code (which would be an extremely crash-worthy idea in an ISR, don’t you think?). We’ll talk more about stack checking one week when I dig into how split stacks work. For now, you can read Ian’s article on them.

Who said life is fair? The Go scheduler certainly didn’t…

In my last post, I showed a program that had a strange behavior that caught my eye. I was trying to look at how Go handles shared access to globals, but the program also had the unintended effect of measuring the “fairness” of the Go scheduler. Here’s the program again, for context:

package main

import "runtime"

var x int
var ch = make(chan bool)

func f(inc int) {
	for {
		println("f: ", inc)
		x += inc
		ch <- true
	}
}

func main() {
	go f(1)
	go f(-1)
	runtime.Gosched()

	for {
		_ = <- ch
		_ = <- ch
		println("main sees x = ", x)
	}
}

When you run this with GOMAXPROCS=3, and you are on a machine with at least three cores, you get precisely what you'd expect: three system threads, each one running one goroutine, and total fairness: f(1) happens once for every f(-1), and the long-run value of x is around 0, but sometimes higher or lower.

When you run this with GOMAXPROCS=1, you see something else. x goes to infinity because the goroutines are called in a unchanging cycle like this:

* f(1)
* f(1)
* f(-1)
* main

Talk about two steps forward, one step back!

Now, no one ever said life was fair, nor that the Go scheduler is fair. But in the spirit of digging down into Go more, let's go find out why not.

Taking a look at the assembly for main, we can see that each read on the channel results in a call into runtime.readchan1. It is this call into the runtime that causes main() to yield control to the scheduler and let the f(1), f(1), f(-1) cycle begin again, so it should be interesting. If we go take a look at that function (see $GOROOT/src/pkg/runtime/chan.c) it's not easy to understand right away how marking the sending goroutine as unblocked and ready results in the behavior we see. But this does give us a chance to talk about the Go runtime's debugging feature, which is useful.

This snippet from chan.c shows how the Go runtime does debugging:

static  int32 debug = 0;

...

void
chanrecv(Hchan* c, byte *ep, bool* pres)
{
	...
	f(debug)
    printf("chanrecv: chan=%p\n", c);

Not exactly earth-shattering, I know, but this is what's nice about Go's code: it is the simplest possible thing that work.

If you change debug = 0 to debug = 1, then recompile (gomake -C $GOMAKE/src/pkg/runtime clean install) your runtime will now spam out messages about channels. This pattern works for other modules, like slices and the scheduler (found in proc.c). Once we have debugging turned on in our runtime, we recompile and relink the test program, and now we have this output:

 1: makechan: chan=0xb7f1d000; elemsize=1; elemalg=0; elemalign=1; dataqsiz=0
 2: f:  1
 3: chansend: chan=0xb7f1d000; elem=1
 4: f:  -1
 5: chansend: chan=0xb7f1d000; elem=1
 6: chanrecv: chan=0xb7f1d000
 7: chanrecv: chan=0xb7f1d000
 8: main sees x =  0
 9: chanrecv: chan=0xb7f1d000
10: f:  1
11: chansend: chan=0xb7f1d000; elem=1
12: f:  1
13: chansend: chan=0xb7f1d000; elem=1
14: f:  -1
15: chansend: chan=0xb7f1d000; elem=1
16: chanrecv: chan=0xb7f1d000
17: main sees x =  1
18: chanrecv: chan=0xb7f1d000
19: chanrecv: chan=0xb7f1d000

Hidden in there is the answer to our question, but we'll have to take it line by line to find it. I have added line numbers to help talk about the output.

In lines 2-5, f(1) and f(-1) do their thing and then block themselves by writing to the channel. Once they are both blocked, and now that the channel has got two values waiting on it, main wakes up and receives them both (lines 6-8). In line 9, main's just gone to sleep waiting on the channel at the top of its loop, on the second time through the loop.

Now that main is asleep, the scheduler looks around for something to do. f(1) and f(-1) are runnable because they are no longer blocked on their channel writes. In lines 10 and 11, f(1) gets a chance to increment x, and then sends on the channel. What we expect to happen now (on average, anyway) is that f(-1) will get a chance to run. Instead f(1) gets to run again (lines 12-13). Why? Because main was already sleeping on the channel when f(1) wrote to it. Instead of being put to sleep, f(1) completes it's write and keeps running. On the second write to the channel, it ends up asleep.

This points out the key lesson about unbuffered channels from this post: if the receiver is blocked on a read, and the sender does a send, he zips right through the send without yielding control to the Go scheduler.

After f(1) takes the second trip through the loop and finds that the reader is now waiting to wake up on the value he wrote last time through, now he finally goes to sleep. In $GOROOT/src/pkg/runtime/chan.c, in function chansend, the difference between the first time through f(1) and the second time through f(1) is the difference between the dequeue returns non-nil case and the code that follows, where a new pseudo-g is allocated and filled in, then gosched() is called, resulting in the f(1) goroutine yielding.

The Go runtime needs to choose the next goroutine to run. It has now got two choices, main and f(-1). It takes the routine that was marked runnable longest ago and runs it; in this case f(-1). Why? Probably to ensure that goroutines cannot be starved for time to run. f(-1) decrements and then tries to send on the channel, but is discovers that there's no waiting channel reader (remember, f(1) filled up main's outstanding read request before) and goes to sleep. main gets to run and reads f(1)'s first value (line 9, which happens after line 15), then it's second (line 16). It prints, loops, finds the message from f(-1) and then goes to sleep again.

So it's clear that due to the behavior of writes onto channels with pending readers, this program is not doing what we wanted at all. A solution is to use two channels instead of one shared channel:

package main

import "runtime"

var x int

func f(inc int, ch chan bool) {
	for {
		println("f: ", inc)
		x += inc
		ch <- true
	}
}

func main() {
	ch1 := make(chan bool)
	ch2 := make(chan bool)

	go f(1, ch1)
	go f(-1, ch2)
	runtime.Gosched()

	for {
		_ = <- ch1
		_ = <- ch2
		println("main sees x = ", x)
	}
}

The results are satisfying:

makechan: chan=0xb7e47000; elemsize=1; elemalg=0; elemalign=1; dataqsiz=0
makechan: chan=0xb7e477c0; elemsize=1; elemalg=0; elemalign=1; dataqsiz=0
f:  1
chansend: chan=0xb7e47000; elem=1
f:  -1
chansend: chan=0xb7e477c0; elem=1
chanrecv: chan=0xb7e47000
chanrecv: chan=0xb7e477c0
main sees x =  0
chanrecv: chan=0xb7e47000
f:  1
chansend: chan=0xb7e47000; elem=1
f:  1
chansend: chan=0xb7e47000; elem=1
f:  -1
chansend: chan=0xb7e477c0; elem=1
chanrecv: chan=0xb7e477c0
main sees x =  1
chanrecv: chan=0xb7e47000
chanrecv: chan=0xb7e477c0
f:  -1
chansend: chan=0xb7e477c0; elem=1
f:  -1
chansend: chan=0xb7e477c0; elem=1
f:  1
chansend: chan=0xb7e47000; elem=1
main sees x =  0

Note that there's a new interesting pattern, f(1), f(1), f(-1), main, f(-1), f(-1), f(1), main. But that one will have to wait for another post, the baby wants to play... 🙂

What’s happening here? And when?

A while ago, I posted to the Go users list about what seemed like a problem in how Go was choosing registers versus global variables. Roger’s answer was “go RTFM“, which was precisely the right thing to do. However, it took reading it twice (I’d read it before) and some hard pondering to connect what I was reading to what I was seeing.

In order to save you, the reader, from the same experience, here’s a more detailed explanation of how “happens before” applies to programs where coroutines are writing to and reading from globals. Disclaimer: You really shouldn’t be doing this. In Go, you “share memory by communicating, not communicate by sharing memory”. Asking questions about “what happens when two coroutines write to a global?” means you are already thinking about communicating by sharing, and you are in for a world of hurt. But in this case, I was exploring the concurrency model in coroutines in the absence of calls into the runtime (i.e. cooperative or not?) and so I wanted to avoid channels.

OK, on to the (broken) code:

package main
import "runtime"

var i int

func f(inc int) {
        for {
                i += inc
        }
}

func main() {
        runtime.GOMAXPROCS(3)
        go f(1)
        go f(-1)
        runtime.Gosched()

        for j := 0; ; j++ {
                if (j % 1e9) == 0 {
                        print(i, "\n")
                }
        }
} 

The idea here is to get two uncooperative goroutines (i.e. grabbing the CPU and never giving it back) onto two cores, and then watch what happens from the third goroutine. “Watch what happens” is secret code for “communicate by sharing memory” and we’ve already established that’s a bad idea. Thus it should be no surprise to the reader at this point that the output of main is all zeros, instead of the occasional -1, -2 or 100 or whatever you might have expected.

Looking at the assembly (using “8g -S”) we can see why main() never sees the results from f()’s work:

--- prog list "f" ---
0000 (hog.go:7) TEXT    f+0(SB),$0-4
0001 (hog.go:7) MOVL    inc+0(FP),CX
0002 (hog.go:7) MOVL    i+0(SB),AX
0003 (hog.go:8) JMP     ,5
0004 (hog.go:8) JMP     ,7
0005 (hog.go:9) ADDL    CX,AX
0006 (hog.go:8) JMP     ,5
0007 (hog.go:11) RET     ,

The problem is that the global i is loaded into AX and incremented there, without ever being written back into the memory location where i came from. So main() can’t see it going up and down, because i exists in three different places, the AX for f(1), the AX for f(-1) and the global that main() is looking at.

Is this a bug in Go’s register allocation? No, it is the result of the “happens before” analysis described in The Go Memory Model. Go read that now, you need to understand that to understand the rest…

The examples in The Go Memory Model do not clearly cover this case, and I found it a bit hard to reason about it until I started unrolling the loops in my head so that the program looks like this:

package main
var i int

func f(inc int) {
  i += inc  // 1
  i += inc  // 2
  i += inc  // 3
  ...
}

func main() {
  go f(1);
  go f(-1);

  // in practice, you'd need this to get them running, but it's
  // not important for this analysis
  //runtime.Gosched()

  println(i)  // 4
  println(i)  // 5
  ...
}

The happens-before relation for line 1 and line 2 is governed by the rule that inside a goroutine, program order determines happens-before. The happens-before relation for 4 and 5 are likewise defined. The problem comes when you consider the happens-before order of lines 1 and 4. It is not defined, because the only definition of happens-before that spans goroutines is that implied by sends and receives on a channel.

In fact, by adding a channel to the program, you can make the register assignment I originally complained about go away because you are fixing the happens-before order to be something that the Go compiler can only guarantee by writing to a global and not to a register:

package main
import "runtime"

var x int
var ch = make(chan bool)

func f(inc int) {
  for {
    x += inc
    ch <- true
    println("f: ", inc)
  }
}

func main() {
  go f(1)
  go f(-1)
  runtime.Gosched()

  for {
    _ = <- ch
    println(x)
  }
}

Looking at f() in assembly now gives the right answer:

--- prog list "f" ---
0000 (hog8.go:8) TEXT    f+0(SB),$24-4
0001 (hog8.go:9) JMP     ,3
0002 (hog8.go:9) JMP     ,23
0003 (hog8.go:10) MOVL    inc+0(FP),BX
0004 (hog8.go:10) ADDL    BX,x+0(SB)
... send on channel is here, not interesting for our needs ...
0022 (hog8.go:9) JMP     ,3
0023 (hog8.go:14) RET     ,

Notice how the compiler realized that, because there is a channel send coming up, that the add must happen out in memory (ADDL BX,x+0(SB)) and not in the register as before.

This program uncovers a new strange behavior, however. With GOMAXPROCS=1, it runs f(1),f(1),f(-1),main in a loop, resulting in i incrementing without bound. With GOMAXPROCS=3, it does the expected thing, where i stays around zero, but sometimes goes up and down. I will try looking into that little mystery another time, stay tuned!