A simpler way to embed data

In my post about how to efficiently put data into a Go binary, I mentioned that strings are immutable, and can be accessed without causing the Go runtime to copy them. This turns out to be the key to a simpler way to achieve what I wanted to do.

By simpler I mean, “no cgo”. That’s a nice simplification, because up until recently, your final static binary image linked to the cgo code dynamically, and that made using my technique impossible in the context of the Tiny runtime, where there is no dynamic linker. Recently cgo has changed, but at the same time, I’ve discovered how to use native strings to do what I want, so let’s see how it works.

I shied away from strings at first because I understood them to be “unicode strings”, and thus not eligible to hold arbitrary bytes (i.e. bytes which turn out to create a non-valid unicode rune). That’s not true at all. In Go, the string type is in some ways an alias for “an immutable array of 8-bit octets”, i.e. an […]byte. True, many of the built-in functions that operate on string expect what’s inside of it to be valid UTF-8, and might malfunction if you give them random bytes. But there’s nothing to keep you from putting bad UTF-8 in, then never using the functions that expect good UTF-8.

So, that’s the plan. We put our individual bytes into the string, then we do brain surgery to turn them into a []byte:

package main

import (
        "unsafe"
        "reflect"
)

var empty [0]byte
var str1 string = "the string"
var str2 string = "another string"

func fix(s string) (b []byte) {
        sx := (*reflect.StringHeader)(unsafe.Pointer(&s))
        b = empty[:]
        bx := (*reflect.SliceHeader)(unsafe.Pointer(&b))
        bx.Data = sx.Data
        bx.Len = len(s)
        bx.Cap = len(s)
        return
}

func main() {
        b := fix(str1)
        println(b[0])
        b = fix(str2)
        println(b[0])
        b[0] = 'x'              // crash: write to ro segment
}

When you read the assembly of that program, there’s not a memcpy to be seen. The []byte you get points directly at the original bytes. You could also see that by taking the address of str[0] and b[0] and seeing that they are the same byte in memory.

The last line shows why Go is going to so much trouble to prevent me from doing this: the memory that is now underlying my []byte is read-only. At link time, the linker put it into a read-only segment, and now when I write to it, I get this (the equivalent of a segfault in Go):

unexpected fault address 0x80640f8
throw: fault

panic PC=0xf765b048
runtime.throw+0x3e /home/jeffall/go/src/pkg/runtime/runtime.c:73
	runtime.throw(0x80a3916, 0x80640f8)
runtime.sigpanic+0xc7 /home/jeffall/go/src/pkg/runtime/linux/thread.c:288
	runtime.sigpanic()
main.main+0xd8 /home/jeffall/go-stuff/str.go:27
	main.main()
runtime.mainstart+0xf /home/jeffall/go/src/pkg/runtime/386/asm.s:84
	runtime.mainstart()
runtime.goexit /home/jeffall/go/src/pkg/runtime/proc.c:148
	runtime.goexit()

Working on this has made me ask myself a few times, “why am I so intent on turning read-only memory into a []byte, thereby corrupting Go’s type safety?” I’m still grappling with that, stay tuned. (One reason why is that this whole idea came from working in the Tiny Go environment, where there’s currently almost no memory protection offered anyway. But that’s a dumb reason; if the non-existent OS can’t save you from yourself, you certainly should NOT stop the compiler from saving you!) Maybe there’s a third version coming which manages to keep it type safe and still do what I want. I suspect it’s going to have something to do with changing the interface of my filesystem object to keep the string itself internal, and only expose a method that returns an io.Reader.