A trip down the (split) rabbithole

Note: This post is out of date, and will become increasingly out of date when Go’s new contiguous stacks are implemented. I’m leaving it here because it is still interesting, even if out of date.

Go uses split stacks (also called segmented stacks in the literature) in order to allow thousands of stacks in the space that would normally be taken by hundreds of C-style contiguous stacks. There’s a discussion of how to add split stacks to GCC here. Note: it was written by the author of gccgo, around the time he started porting Go to Gcc, so it’s clear that he’s adding split stacks to Gcc to help Go. But if it’s helpful to C programmers as well, then so much the better!

Ian’s article is dedicated to how he planned to add split stacks to GCC. I want to dig down to the lowest level and trace every single step of the process of maintaining split stacks in the native Go runtime. I hope you, dear reader, do too. Here we go…

First, we need a program that’s going to use up all it’s stack. Since the point of this article is not to learn about Go’s stack-based object allocation we’ll use up the stack with return addresses, specifically to main:

package main

func main() {
	main()
}

Now, don’t go running that just yet, or else you’ll have the out of memory killer on your back. Instead, go run that under gdb, setting a breakpoint on runtime.morestack:

$ gdb 8.out
(gdb) b runtime.morestack
Breakpoint 1 at 0x80494de: file /home/jra/go/src/pkg/runtime/386/asm.s, line 150.
(gdb) r
Starting program: /home/jra/go-stuff/8.out 

Breakpoint 1, runtime.morestack ()
    at /home/jra/go/src/pkg/runtime/386/asm.s:150
150		get_tls(CX)
Current language:  auto; currently asm
(gdb) bt
#0  runtime.morestack () at /home/jra/go/src/pkg/runtime/386/asm.s:150
#1  0x08048c17 in main.main () at /home/jra/go-stuff/stack.go:3
#2  0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
#3  0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
#4  0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
... same thing for lots and lots of pages ...
#1014 0x08048c1c in main.main () at /home/jra/go-stuff/stack.go:4
#1015 0x08049473 in runtime.mainstart ()
    at /home/jra/go/src/pkg/runtime/386/asm.s:85
#1016 0x0804e5cd in runtime.initdone ()
    at /home/jra/go/src/pkg/runtime/proc.c:145

Interesting stuff happens before and after this, so let’s try to sort through it. First, why did morestack get called? Because we are out of stack space. But how did we know? Ian’s explanation of split stacks lays this out pretty well: on the entry into every subroutine, we need to check how much space is left, and if it’s “not enough”, get a new stack to run on.

When you take a look at the assembly listing for main as it leaves 8g, there’s nothing to see:

$ 8g -S stack.go | head -10
--- prog list "main" ---
0000 (stack.go:3) TEXT    main+0(SB),$0-0
0001 (stack.go:4) CALL    ,main+0(SB)
0002 (stack.go:5) RET     ,

But when you take a look at the assembly listing for main generated by 8l, you see something else:

$ 8l -a stack.8 | head -8
codeblk [0x8048c00,0x805332a) at offset 0xc00
8048c00	main.main            | (3)	TEXT	main.main+0(SB),$0
8048c00	658b0d00000000       | (3)	MOVL	(GS),CX
8048c07	8b49f8               | (3)	MOVL	-8(CX),CX
8048c0a	3b21                 | (3)	CMPL	SP,(CX)
8048c0c	7709                 | (3)	JHI	,8048c17
8048c0e	31d2                 | (3)	MOVL	$0,DX
8048c10	31c0                 | (3)	MOVL	$0,AX
8048c12	e8c7080000           | (3)	CALL	,80494de+runtime.morestack
8048c17	e8e4ffffff           | (4)	CALL	,8048c00+main.main
8048c1c	c3                   | (5)	RET	,

So already we’ve found something interesting, which is that part of split stacks is implemented in the linker, not in the compiler. I don’t know why this is, but it’s probably another nice example of Broken Abstractions in Go.

And what’s that code doing? It says, “find the current goroutine’s struct G, and compare the current stack pointer to g.stackguard”. In the Go runtime (Linux, x86), the GS segment register points to a segment descriptor table entry which was set using modify_ldt(2). The segment descriptor arranges that references like 0(GS) refer to the base of the struct Gobuf, which in turn has a pointer to the current struct G for the goroutine. The comparison is checking to see if we are about to run off the end of the stack. But what are the actual contents of these things? Let’s have a look, with a newly started copy of the program:

$ gdb 8.out
(gdb) b main.main
Breakpoint 1 at 0x8048c00: file /home/jra/go-stuff/stack.go, line 3.
(gdb) r
Starting program: /home/jra/go-stuff/8.out 

Breakpoint 1, main.main () at /home/jra/go-stuff/stack.go:3
3	func main() {
Current language:  auto; currently minimal
(gdb) disas
Dump of assembler code for function main.main:
0x08048c00 :	mov    %gs:0x0,%ecx
0x08048c07 :	mov    -0x8(%ecx),%ecx
0x08048c0a :	cmp    (%ecx),%esp
0x08048c0c :	ja     0x8048c17
0x08048c0e :	xor    %edx,%edx
0x08048c10 :	xor    %eax,%eax
0x08048c12 :	call   0x80494de
0x08048c17 :	call   0x8048c00
0x08048c1c :	ret
End of assembler dump.
(gdb) si
0x08048c07	3	func main() {
(gdb) si
0x08048c0a	3	func main() {
(gdb) info reg
eax            0x0	0
ecx            0xb7dc3000	-1210306560
edx            0x1	1
ebx            0x806a55c	134653276
esp            0xb7dc50d4	0xb7dc50d4
ebp            0xbf8feedc	0xbf8feedc
esi            0xb7dc3060	-1210306464
edi            0xb7dd2100	-1210244864
eip            0x8048c0a	0x8048c0a
eflags         0x200296	[ PF AF SF IF ID ]
cs             0x73	115
ss             0x7b	123
ds             0x7b	123
es             0x7b	123
fs             0x0	0
gs             0x3f	63
(gdb) x 0xb7dc3000              <-- find out what (cx) is (i.e. stack guard)
0xb7dc3000:	0xb7dc4100

Voila! This time through, we are not going to end up in runtime.morestack because the current stack pointer (0xb7dc50d4) is greater than the stack guard (0xb7dc4100).

And how far above the end of the stack is the stack guard? This is explained in a giant comment in runtime.h, at least giant for the Go team’s standards, which is “less is better, even for comments”. 🙂 We can take a look at it for ourselves as well. According to the definition of struct G, the stack base is the next pointer down from the stackguard:

(gdb) x 0xb7dc3004               <-- cx + 4
0xb7dc3004:	0xb7dc50dc   <-- the stack base

Now, stack base in this context is “the place where the stack starts growing down from”. So we have the following interesting places:

  • 0xb7dc50dc: stack base
  • 0xb7dc50d4: current stack pointer
  • 0xb7dc4100: stack guard
  • 0xb7dc4000: end of the stack (see the constant StackGuard defined in runtime.h)

If we fast forward a few thousand iterations of main, the current stack pointer is going to move down, one return address at a time, until the current stack pointer is less than the stack guard. At that point, we know we need more stack. We still have 128 bytes to work with, but some routines in the Go runtime can’t or don’t want to check for stack underflow, so we set the stack guard high enough to leave room for them to run. On Windows, the user-space exception handler could always get called, so the stack guard is much higher (2048 bytes) to leave room for it to work, since we don’t get a chance to hook it and teach it about split stacks.

As an aside, when you are reading the sources and you see “#pragma textflag 7” or when you see an assembly routine that starts like “TEXT runtime·exit(SB),7,$0”, it’s a routine that will not be checking for possible stack underflow. So it, and any sub-calls, had better require a maximum of 128 bytes of stack. This “7 flag” is a note from the compiler (via the assembler) to the linker asking it to refrain from adding the stack underflow check.

To take a look at the next step of the process, we need to turn our attention to morestack, which is found in runtime/386/asm.s. Things get complicated here, but there’s two steps that are important for our investigation: first, morestack puts some information away in the current struct M so that it can use it later including the size of the current stack frame and argument list of the current function call and the return address of the caller of the function that just got pre-empted. It is going to need this information to link the new stack to the old one.

Note that morestack is in runtime/386/asm.s. It is processor-dependent, because it is grabbing stuff off the stack in a processor-dependent way. But it is not OS-dependent, which is interesting. On the other hand, the code that sets the segment registers (mentioned above) is OS dependent, because changing the global descriptor table requires ring-0 privileges.

When morestack has saved away what it needs, it changes stacks to the scheduler’s stack, and then transfers control to runtime.newstack. Allocating a new stack is a sizeable chunk of work, because stacks are allocated using exactly the same mechanism as any other piece of memory allocated by the Go runtime. So we need to make sure we have a full stack available for runtime.mallocgc to run on. Note that morestack leaves via a CALL, but kills itself if it returns, because it knows that newstack is going to retun in a funky way (keep reading…). We’ll have to remember to take a look at newstack to find out how the return address pushed by CALL gets cleaned up.

Newstack has some complexity related to reflection, so before we start let’s just agree to ignore that ok? We’ll come back to reflection low-level details some other day. Once we ignore the reflection stuff, newstack ends up relatively simple: allocate a new range of memory, decorate the top of it with something called a “struct Stktop”, then exit in a funky way.

The allocation step is fairly straightforward: calculate the size of the new stack, then . The size calculation is a bit interesting: the new stack will be the maximum of StackBig or of the amount of space needed by this function. StackBig is 4096 (8192 on Windows, see the comments for why). So if you’ve got one function in a call-chain that wants more stack space than the stack currently has, and even more than the stack is normally extended, it will get a stack of it’s own, and then when it calls another function, it may cause a second trip through morestack. Some magic happens inside of stackalloc to decide if it is currently safe to allocate a new stack with mallocgc, or if the new stack should come from a fixed-size allocator instead. Also, since the most commonly used stack is StackBig + StackExtra bytes long, there’s a shurtcut to avoid the heavier mallocgc route for that as well. So while the shorthand explanation of split stacks is that it is a technique to make stacks as easy to deal with as garbage collected objects, it’s not true that all stacks come directly from and go back into, the garbage collected pool.

Once the new stack is allocated, newstack “decorates” the top of it. It fills in a struct Stktop with information that will be used later when we are cleaning up the stack and transferring stacks from this one back to the last one. Next, newstack moves the frame and the arguments from the call that was preempted onto the new stack. With that we are ready to put the new stack to use.

This is where the funky exit comes. newstack uses gogocall, which is basically how the Go scheduler implements context switches. It switches onto the new stack and then arranges that the next RET that we hit (i.e. the RET at the end of the function that was preempted) will result in a jump to a certain function, called runtime.lessstack. gogocall exits with a JMP to the preempted function, so as not to mess up the stack it’s just put in order, and the preempted function starts running again on it’s new stack!

lessstack, as the name implies, is the counterpart to morestack. It runs in the context of the old stack, but its job is to save things and prepare for the call to oldstack (the counterpart to newstack). It needs to save the return value, then switch onto the scheduling stack, for the same reason that morestack needed to. oldstack, as the counterpart to newstack does what you’d expect: it gets us off of the current stack (freeing it) and starts executing on the old stack. We need to carry two pieces of state from the returning function, the return value (in AX) and the “return arguments”, i.e. the results of the computation that was left on the stack. I haven’t looked into the Go call semantics enough to understand the difference between AX and the stuff returned on the stack, but regardless of wether I understand it or not, oldstack saves it. Then oldstack pulls a trick like newstack, exiting via gogo, which moves to the old stack by setting SP and PC to the values that morestack staved from the caller way back when.

Open question: if morestack is called twice before oldstack is called, why isn’t the original caller’s SP/PC stomped?

So, now you’ve been all the way down the rabbit hole with me. Did you have fun? Got questions, or answers to my open question?

How to control your HTTP transactions in Go

The Go http pacakge has http.Get and http.Post, which make it easy to do GET and POST operations. They are meant for client use. They implement things from the point of view of a naïve client, one that just wants to give a URL and get back the results. They don’t want to chase redirects, they don’t want to set their headers specially, they just want to, in one line, get the results.

When you are writing an HTTP proxy server, you can’t be so naïve. For example, the HTTP client in a proxy server must not do redirect chasing. Instead, it needs to get the redirect status code and Location header, and then send it back down to the proxy’s client. If it wants to chase the redirect, fine.

If you read the guts of http.Get and http.Post, they prepare http.Request structures with the right stuff inside, then pass them to an internal function called send. So you might think you should do something like this as well, for instance writing something like what’s inside the http.send function yourself. However, this is the wrong way to go, there’s a better way.

The trick is to make an http.ClientConn, then feed the http.Request into it, and read the http.Response out of it. If you want to keep the connection open, you can (set req.Close to false), then you can do more HTTP requests on the same connection later. The ServeHTTP function in my proxy demonstrates how to do this, in the case where you close the connection. Creating a connection pool is on the to-do list, though my first guess on how to do it is to let a single goroutine act as the warehouse of open connections, checking them out and back in to the central cache in response to messages on a channel.

An HTTP proxy needs to copy through the headers from the proxy client to the origin server. However, some headers, like those related to the transfer format and connection closure should not be copied through. This is sort of a weakness in the HTTP protocol, that it confuses transport and content like this, but it is what it is and we’ve got to deal with it.

One thing that gave me a bit of trouble was oreq.RawURL. I originally set it to creq.RawURL, then used http.ParseURL(oreq.RawURL) to fill in oreq.URL (with an error check, of course). The problem is that http.Request.Write() will prefer the contents of RawURL to URL. That means that you send the full URL to the origin server, instead of only the path part. With some servers, they don’t care. With others, it results in wildly incorrect redirects, for instance fetching “http://blog.nella.org” (no trailing slash: should result in a redirect to /) gives a Location of “blog.nella.org/”, which in turn results in your browser “http://blog.nella.org/blog.nella.org/”. Redundantly, repetitively false and incorrect.

I also wondered if it even makes sense to re-parse the URL

With this addition to the proxy code mentioned yesterday, you’ve got a basic HTTP proxy that I’ve used to surf many different static and dynamic sites. It seems to work, at least well enough as a base to experiment with. Next stop, the cachability check, and adding the right code to ServeHTTP to redirect cachable requests to the object on disk instead of getting it from the origin server.

A rate-limiting HTTP proxy in Go

Hello Go-fans. Missed a week due to a nice little ski vacation, but I promise I was dreaming of Go while riding the ski lifts, so I’ve got something interesting to share with you this week.

I’ve worked in Africa and Indonesia in the past. There, I saw first-hand the possibilities of the Internet, but also the difficulties of using it in remote areas, over slow links, etc. I came up with ideas years ago for what I’d like in a proxy in the field that would make the limited bandwidth on a satellite connection go further. I’ve never had a chance to implement it until now, because hacking on Squid or other proxies just involved too much C hacking and core dumping to suit my patience, especially for a prototype to test ideas on. But Go, on the other hand; it’s just made for playing with this kind of thing!

So today what I have to show you is an HTTP proxy that simulates what happens when all the requests are squeezed through a tiny little channel. It does not simulate things down to the TCP level (i.e. congestion detection and backoff); nor does it simulate the long round-trip times that make satellite connections difficult to optimize. But it simulates low bandwidth, and when I later add features related to semi-offline use, it will help me show the benefits of them better, I think.

The HTTP package in Go seems to be a just a web client and a web server. The tutorials show how to use it for both. Making a server is easy: you call http.ListenAndServe(“:1234”, nil) and you end up with a web server on port 1234. But did you ever wonder what that extra argument is for? The nil at the end? It’s the secret to how to make a proxy, and it’s the first stop on our journey.

http.Handler

The second argument is an http.Handler. When you set it to nil, the http package provides a default, called http.ServeMux. It is designed to let you register callbacks for various sub-URLs inside one server. There’s also the http.FileServer handler, which finds files on disk matching the URL, and serves them up.

The trick to writing an HTTP proxy with the Go http package is to make your own http.Handler. Here’s something like what my first shot at it was:

type Proxy struct {
}

func NewProxy() *Proxy { return &Proxy{} }

func (p *Proxy) ServeHTTP(wr http.ResponseWriter, r *http.Request) {
  var resp *http.Response
  var err os.Error

  switch r.Method {
  default: {
    log.Print("Cannot handle method ", r.Method)
    http.Error(wr, "501 I only handle GET and POST", http.StatusNotImplemented)
    return
  }
  case "GET": {
    log.Printf("getting %v", r.RawURL)
    resp, _, err = http.Get(r.RawURL)
  }
  case "POST": {
    resp, err = http.Post(r.RawURL, r.Header["Content-Type"], r.Body)
    r.Body.Close()
  }
  }

  // combined for GET/POST
  if err != nil {
    http.Error(wr, err.String(), http.StatusInternalServerError)
    loghit(r, http.StatusInternalServerError, false)
    return
  }
  wr.SetHeader("Content-Type", resp.Header["Content-Type"])
  wr.WriteHeader(resp.StatusCode)

  io.Copy(wr, resp.Body)

  resp.Body.Close()
  loghit(r, resp.StatusCode, false)
}

And I use it like this:

func main() {
  proxy := NewProxy()
  err := http.ListenAndServe(":12345", proxy)
  if err != nil {
    log.Exit("ListenAndServe: ", err.String())
  }
}

Note that my new type, Proxy, doesn’t have anything in it. Though it might pick up more stuff later, for now, it’s job is to have the ServeHTTP method attached to it, so that *Proxy can satisfy the http.Handler interface. The real proxy behavior happens in ServeHTTP, where we turn incoming GETs and POSTs into outgoing GETs and POSTs, then arrange that the answers from the origin server go back to the requester. The key line that does that is io.Copy, where we copy the bytes arriving on the response from the origin server onto the request that came from the HTTP client (i.e. the web browser configured with localhost:12345 as it’s proxy server).

Now, it doesn’t take a rocket scientist to see that we can implement a caching proxy by looking at the request arriving in the http.Request, and either take something off of disk or get something from the network to fill in the http.ResponseWriter. That’s one direction we can expand this proxy, and I hope to do that in coming weeks, to implement a proxy that can use packages of offline content to dramatically reduce the amount of network traffic.

It does, however, take a rocket scientist to get HTTP proxying and caching right. For caching, there are several headers that have to be parsed and consulted, including Expires, E-tag, Cache-control and If-modified-since. Even getting proxying right, without caching is not so easy. To get anything working at all, the code above had to copy across the Content-Type header. But by not copying headers from the client inbound to the proxy, and by not copying all the headers back from the origin server, we are breaking cookies and HTTP redirect.

In fact, this issue of broken redirects lead me on a big side-trip where I learned more about the http package, and improved the proxy. I’ll write about that in a second post, so that we can keep going here to the next interesting thing.

You can see the source to the proxy here, but beware, ServeHTTP has morphed quite a bit from the super-simple GET/POST thing above. But you’ll have to read the next article to find out why.

linkio

The next thing I have to show you, the reader, is something called linkio. This is a package I wrote which wraps an io.Reader so that when you read from it, you get the bytes back at the same speed they would arrive if they were coming over a fixed capacity link to you, sharing bandwidth with all the other traffic on the link.

linkio shows two interesting Go design patterns. First, it is a wrapper for the io.Reader interface. The constructor takes an io.Reader as input, then stores it away, along with some other information. The object returned by the constructor satisfies io.Reader, so it can be plugged in exactly where the previous io.Reader was. Take a look at the code that needs to change to make use of linkio:

import (
  ...
  "jra-go.googlecode.com/hg/linkio"
)

var gLink *linkio.Link

func init() {
  gLink = linkio.NewLink(56 /* kbps */)
}

...

// then in ServeHttp, change this:
  io.Copy(wr, resp.Body)
// to this:
  io.Copy(wr, gLink.NewLinkReader(resp.Body))

Note how the real change in the program is one line! That’s because all the timing stuff, and the way to multiplex all the transactions onto a single link is put inside of linkio, and the goroutine launched inside the NewLink constructor.

The second interesting thing is that linkio uses a buffered channel to simulate the link itself. Each LinkReader wants to release it’s data in dribs and drabs to the reader, according to the current shared state of the link. The defining characteristic of a link is that it can send data for only one stream at a time (when we zoom in the the level of individual packets, anyway). So we simulate it with a goroutine that loops on the input channel (representing the interleaved flow of packets) and sleeps according to how long each packet would occupy the link in real life. If the link is unoccupied, then the loop gets blocked on the channel read. Once a simulated packet arrives (called a linkRequest in the code), the loop then becomes blocked on the sleep. When the sleep is done, the link goroutine needs to tell the stream that requested that this packet be simulated that the sleep is done. It does that by sending a bool on the channel that arrived with the simulated packet. (This pattern of sending a channel with a request to another goroutine is the Go idiom for asynchronous notification. A very useful trick: learn it!)

This scheme only works if the LinkReaders let loose their data a little at a time. If one sent in a simulated packet that was 10 megabytes, it would ruin the effect. That one “packet” would hog the link to the exclusion of all others, making that one stream perceive a bit rate of 100% of the link, instead of it’s equal share. The place where the LinkReaders enforce this fairness is in func (* linkio.LinkReader)Read. The prototype is Read([]byte) n int, os.Error err. When len(buf) is 1000, there’s no requirement to actually return 1000 bytes. You can return any number of bytes less than or equal to cap(buf), using n to tell the caller how many of the bytes in buf are valid, new bytes that they should read.

So, there you have it. A nice little example of a multi-threaded, rate-limiting HTTP proxy in Go. Stay tuned to the next article, where I explain why http.Get and http.Post are not really what you want in an HTTP proxy, and how to get what you DO want from the http package.