A trip down the (split) rabbithole

Note: This post is out of date, and will become increasingly out of date when Go’s new contiguous stacks are implemented. I’m leaving it here because it is still interesting, even if out of date.

Go uses split stacks (also called segmented stacks in the literature) in order to allow thousands of stacks in the space that would normally be taken by hundreds of C-style contiguous stacks. There’s a discussion of how to add split stacks to GCC here. Note: it was written by the author of gccgo, around the time he started porting Go to Gcc, so it’s clear that he’s adding split stacks to Gcc to help Go. But if it’s helpful to C programmers as well, then so much the better!

Ian’s article is dedicated to how he planned to add split stacks to GCC. I want to dig down to the lowest level and trace every single step of the process of maintaining split stacks in the native Go runtime. I hope you, dear reader, do too. Here we go…

First, we need a program that’s going to use up all it’s stack. Since the point of this article is not to learn about Go’s stack-based object allocation we’ll use up the stack with return addresses, specifically to main:

package main

func main() {
	main()
}

Now, don’t go running that just yet, or else you’ll have the out of memory killer on your back. Instead, go run that under gdb, setting a breakpoint on runtime.morestack:

$ gdb 8.out
(gdb) b runtime.morestack
Breakpoint 1 at 0x80494de: file /home/jra/go/src/pkg/runtime/386/asm.s, line 150.
(gdb) r
Starting program: /home/jra/go-stuff/8.out

Breakpoint 1, runtime.morestack ()
    at /home/jra/go/src/pkg/runtime/386/asm.s:150
150		get_tls(CX)
Current language:  auto; currently asm
(gdb) bt
#0  runtime.morestack () at /home/jra/go/src/pkg/runtime/386/asm.s:150
#1  0x08048c17 in main.main () at /home/jra/go-stuff/stack.go:3
#2  0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
#3  0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
#4  0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
... same thing for lots and lots of pages ...
#1014 0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
#1015 0x08049473 in runtime.mainstart ()
    at /home/jra/go/src/pkg/runtime/386/asm.s:85
#1016 0x0804e5cd in runtime.initdone ()
    at /home/jra/go/src/pkg/runtime/proc.c:145

Interesting stuff happens before and after this, so let’s try to sort through it. First, why did morestack get called? Because we are out of stack space. But how did we know? Ian’s explanation of split stacks lays this out pretty well: on the entry into every subroutine, we need to check how much space is left, and if it’s “not enough”, get a new stack to run on.

When you take a look at the assembly listing for main as it leaves 8g, there’s nothing to see:

$ 8g -S stack.go | head -10
--- prog list "main" ---
0000 (stack.go:3) TEXT    main+0(SB),$0-0
0001 (stack.go:4) CALL    ,main+0(SB)
0002 (stack.go:5) RET     ,

But when you take a look at the assembly listing for main generated by 8l, you see something else:

$ 8l -a stack.8 | head -8
codeblk [0x8048c00,0x805332a) at offset 0xc00
8048c00	main.main            | (3)	TEXT	main.main+0(SB),$0
8048c00	658b0d00000000       | (3)	MOVL	(GS),CX
8048c07	8b49f8               | (3)	MOVL	-8(CX),CX
8048c0a	3b21                 | (3)	CMPL	SP,(CX)
8048c0c	7709                 | (3)	JHI	,8048c17
8048c0e	31d2                 | (3)	MOVL	$0,DX
8048c10	31c0                 | (3)	MOVL	$0,AX
8048c12	e8c7080000           | (3)	CALL	,80494de+runtime.morestack
8048c17	e8e4ffffff           | (4)	CALL	,8048c00+main.main
8048c1c	c3                   | (5)	RET	,

So already we’ve found something interesting, which is that part of split stacks is implemented in the linker, not in the compiler. I don’t know why this is, but it’s probably another nice example of Broken Abstractions in Go.

And what’s that code doing? It says, “find the current goroutine’s struct G, and compare the current stack pointer to g.stackguard”. In the Go runtime (Linux, x86), the GS segment register points to a segment descriptor table entry which was set using modify_ldt(2). The segment descriptor arranges that references like 0(GS) refer to the base of the struct Gobuf, which in turn has a pointer to the current struct G for the goroutine. The comparison is checking to see if we are about to run off the end of the stack. But what are the actual contents of these things? Let’s have a look, with a newly started copy of the program:

$ gdb 8.out
(gdb) b main.main
Breakpoint 1 at 0x8048c00: file /home/jra/go-stuff/stack.go, line 3.
(gdb) r
Starting program: /home/jra/go-stuff/8.out

Breakpoint 1, main.main () at /home/jra/go-stuff/stack.go:3
3	func main() {
Current language:  auto; currently minimal
(gdb) disas
Dump of assembler code for function main.main:
0x08048c00 :	mov    %gs:0x0,%ecx
0x08048c07 :	mov    -0x8(%ecx),%ecx
0x08048c0a :	cmp    (%ecx),%esp
0x08048c0c :	ja     0x8048c17
0x08048c0e :	xor    %edx,%edx
0x08048c10 :	xor    %eax,%eax
0x08048c12 :	call   0x80494de
0x08048c17 :	call   0x8048c00
0x08048c1c :	ret
End of assembler dump.
(gdb) si
0x08048c07	3	func main() {
(gdb) si
0x08048c0a	3	func main() {
(gdb) info reg
eax            0x0	0
ecx            0xb7dc3000	-1210306560
edx            0x1	1
ebx            0x806a55c	134653276
esp            0xb7dc50d4	0xb7dc50d4
ebp            0xbf8feedc	0xbf8feedc
esi            0xb7dc3060	-1210306464
edi            0xb7dd2100	-1210244864
eip            0x8048c0a	0x8048c0a
eflags         0x200296	[ PF AF SF IF ID ]
cs             0x73	115
ss             0x7b	123
ds             0x7b	123
es             0x7b	123
fs             0x0	0
gs             0x3f	63
(gdb) x 0xb7dc3000              <-- find out what (cx) is (i.e. stack guard)
0xb7dc3000:	0xb7dc4100

Voila! This time through, we are not going to end up in runtime.morestack because the current stack pointer (0xb7dc50d4) is greater than the stack guard (0xb7dc4100).

And how far above the end of the stack is the stack guard? This is explained in a giant comment in runtime.h, at least giant for the Go team’s standards, which is “less is better, even for comments”. :) We can take a look at it for ourselves as well. According to the definition of struct G, the stack base is the next pointer down from the stackguard:

(gdb) x 0xb7dc3004               <-- cx + 4
0xb7dc3004:	0xb7dc50dc   <-- the stack base

Now, stack base in this context is “the place where the stack starts growing down from”. So we have the following interesting places:

0xb7dc50dc: stack base
0xb7dc50d4: current stack pointer
0xb7dc4100: stack guard
0xb7dc4000: end of the stack (see the constant StackGuard defined in runtime.h)

If we fast forward a few thousand iterations of main, the current stack pointer is going to move down, one return address at a time, until the current stack pointer is less than the stack guard. At that point, we know we need more stack. We still have 128 bytes to work with, but some routines in the Go runtime can’t or don’t want to check for stack underflow, so we set the stack guard high enough to leave room for them to run. On Windows, the user-space exception handler could always get called, so the stack guard is much higher (2048 bytes) to leave room for it to work, since we don’t get a chance to hook it and teach it about split stacks.

As an aside, when you are reading the sources and you see “#pragma textflag 7” or when you see an assembly routine that starts like “TEXT runtimeÂ·exit(SB),7,$0”, it’s a routine that will not be checking for possible stack underflow. So it, and any sub-calls, had better require a maximum of 128 bytes of stack. This “7 flag” is a note from the compiler (via the assembler) to the linker asking it to refrain from adding the stack underflow check.

To take a look at the next step of the process, we need to turn our attention to morestack, which is found in runtime/386/asm.s. Things get complicated here, but there’s two steps that are important for our investigation: first, morestack puts some information away in the current struct M so that it can use it later including the size of the current stack frame and argument list of the current function call and the return address of the caller of the function that just got pre-empted. It is going to need this information to link the new stack to the old one.

Note that morestack is in runtime/386/asm.s. It is processor-dependent, because it is grabbing stuff off the stack in a processor-dependent way. But it is not OS-dependent, which is interesting. On the other hand, the code that sets the segment registers (mentioned above) is OS dependent, because changing the global descriptor table requires ring-0 privileges.

When morestack has saved away what it needs, it changes stacks to the scheduler’s stack, and then transfers control to runtime.newstack. Allocating a new stack is a sizeable chunk of work, because stacks are allocated using exactly the same mechanism as any other piece of memory allocated by the Go runtime. So we need to make sure we have a full stack available for runtime.mallocgc to run on. Note that morestack leaves via a CALL, but kills itself if it returns, because it knows that newstack is going to retun in a funky way (keep reading…). We’ll have to remember to take a look at newstack to find out how the return address pushed by CALL gets cleaned up.

Newstack has some complexity related to reflection, so before we start let’s just agree to ignore that ok? We’ll come back to reflection low-level details some other day. Once we ignore the reflection stuff, newstack ends up relatively simple: allocate a new range of memory, decorate the top of it with something called a “struct Stktop”, then exit in a funky way.

The allocation step is fairly straightforward: calculate the size of the new stack, then . The size calculation is a bit interesting: the new stack will be the maximum of StackBig or of the amount of space needed by this function. StackBig is 4096 (8192 on Windows, see the comments for why). So if you’ve got one function in a call-chain that wants more stack space than the stack currently has, and even more than the stack is normally extended, it will get a stack of it’s own, and then when it calls another function, it may cause a second trip through morestack. Some magic happens inside of stackalloc to decide if it is currently safe to allocate a new stack with mallocgc, or if the new stack should come from a fixed-size allocator instead. Also, since the most commonly used stack is StackBig + StackExtra bytes long, there’s a shurtcut to avoid the heavier mallocgc route for that as well. So while the shorthand explanation of split stacks is that it is a technique to make stacks as easy to deal with as garbage collected objects, it’s not true that all stacks come directly from and go back into, the garbage collected pool.

Once the new stack is allocated, newstack “decorates” the top of it. It fills in a struct Stktop with information that will be used later when we are cleaning up the stack and transferring stacks from this one back to the last one. Next, newstack moves the frame and the arguments from the call that was preempted onto the new stack. With that we are ready to put the new stack to use.

This is where the funky exit comes. newstack uses gogocall, which is basically how the Go scheduler implements context switches. It switches onto the new stack and then arranges that the next RET that we hit (i.e. the RET at the end of the function that was preempted) will result in a jump to a certain function, called runtime.lessstack. gogocall exits with a JMP to the preempted function, so as not to mess up the stack it’s just put in order, and the preempted function starts running again on it’s new stack!

lessstack, as the name implies, is the counterpart to morestack. It runs in the context of the old stack, but its job is to save things and prepare for the call to oldstack (the counterpart to newstack). It needs to save the return value, then switch onto the scheduling stack, for the same reason that morestack needed to. oldstack, as the counterpart to newstack does what you’d expect: it gets us off of the current stack (freeing it) and starts executing on the old stack. We need to carry two pieces of state from the returning function, the return value (in AX) and the “return arguments”, i.e. the results of the computation that was left on the stack. I haven’t looked into the Go call semantics enough to understand the difference between AX and the stuff returned on the stack, but regardless of wether I understand it or not, oldstack saves it. Then oldstack pulls a trick like newstack, exiting via gogo, which moves to the old stack by setting SP and PC to the values that morestack staved from the caller way back when.

Open question: if morestack is called twice before oldstack is called, why isn’t the original caller’s SP/PC stomped?

So, now you’ve been all the way down the rabbit hole with me. Did you have fun? Got questions, or answers to my open question?