Go’s place between C and Python in terms of abstraction and garbage collection memory management model has made it attractive to programmers looking for a fast but reasonably high level language. However, there is no free lunch. Go’s abstractions, especially with regards to allocation, come with a cost. This article will show ways to measure and reduce this cost.
Measuring
On posts about performance on StackOverflow and similar forums, you will often hear some variation of the Knuth quote on optimization: “premature optimization is the root of all evil.” This simplification leaves out some important context present in the original quote:
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
Knuth emphasizes that instead of looking for small performance gains that harm readability so much that they aren’t even worth it, we should look for the large (97%) gains that can be found by improving the critical code. And how do we identify that critical code? With profiling tools. Fortunately, Go has better profiling tools than most languages. For instance, let’s say we have the following (somewhat contrived) code:
package main
import (
"bytes"
"crypto/sha256"
"fmt"
"math/rand"
"strconv"
"strings"
)
func foo(n int) string {
var buf bytes.Buffer
for i := 0; i < 100000; i++ {
buf.WriteString(strconv.Itoa(n))
}
sum := sha256.Sum256(buf.Bytes())
var b []byte
for i := 0; i < int(sum[0]); i++ {
x := sum[(i*7+1)%len(sum)] ^ sum[(i*5+3)%len(sum)]
c := strings.Repeat("abcdefghijklmnopqrstuvwxyz", 10)[x]
b = append(b, c)
}
return string(b)
}
func main() {
// ensure function output is accurate
if foo(12345) == "aajmtxaattdzsxnukawxwhmfotnm" {
fmt.Println("Test PASS")
} else {
fmt.Println("Test FAIL")
}
for i := 0; i < 100; i++ {
foo(rand.Int())
}
}
To add CPU profiling, we just need to change it to
package main
import (
"bytes"
"crypto/sha256"
"fmt"
"math/rand"
"os"
"runtime/pprof"
"strconv"
"strings"
)
func foo(n int) string {
var buf bytes.Buffer
for i := 0; i < 100000; i++ {
buf.WriteString(strconv.Itoa(n))
}
sum := sha256.Sum256(buf.Bytes())
var b []byte
for i := 0; i < int(sum[0]); i++ {
x := sum[(i*7+1)%len(sum)] ^ sum[(i*5+3)%len(sum)]
c := strings.Repeat("abcdefghijklmnopqrstuvwxyz", 10)[x]
b = append(b, c)
}
return string(b)
}
func main() {
cpufile, err := os.Create("cpu.pprof")
if err != nil {
panic(err)
}
err = pprof.StartCPUProfile(cpufile)
if err != nil {
panic(err)
}
defer cpufile.Close()
defer pprof.StopCPUProfile()
// ensure function output is accurate
if foo(12345) == "aajmtxaattdzsxnukawxwhmfotnm" {
fmt.Println("Test PASS")
} else {
fmt.Println("Test FAIL")
}
for i := 0; i < 100; i++ {
foo(rand.Int())
}
}
After compiling and running this program, the profile will be written to ./cpu.pprof
. We can read this file using go tool pprof
:
$ go tool pprof cpu.pprof
We are now in the pprof interactive tool. We can see what our program is spending most of its time on with top10
(top1
, top2
, top99
, …, topn
also work). top10
gives us:
Showing nodes accounting for 2010ms, 86.27% of 2330ms total
Dropped 48 nodes (cum <= 11.65ms)
Showing top 10 nodes out of 54
flat flat% sum% cum cum%
930ms 39.91% 39.91% 930ms 39.91% crypto/sha256.block
360ms 15.45% 55.36% 720ms 30.90% strconv.formatBits
180ms 7.73% 63.09% 390ms 16.74% runtime.mallocgc
170ms 7.30% 70.39% 170ms 7.30% runtime.memmove
100ms 4.29% 74.68% 100ms 4.29% runtime.memclrNoHeapPointers
80ms 3.43% 78.11% 340ms 14.59% bytes.(*Buffer).WriteString
60ms 2.58% 80.69% 60ms 2.58% runtime.nextFreeFast (inline)
50ms 2.15% 82.83% 360ms 15.45% runtime.slicebytetostring
40ms 1.72% 84.55% 2070ms 88.84% main.foo
40ms 1.72% 86.27% 760ms 32.62% strconv.FormatInt
[Note: “Allocations” is used in this article to refer to heap allocations. Stack allocations are allocations too but they are not nearly as costly or important in the context of performance.]
It looks like we are spending a lot of time using sha256, strconv, allocating, and garbage collecting. Now we know what we need to improve. Since we are not doing any sort of complex computations (except maybe sha256), it would seem likely that most of our performance problems are caused by heap allocations. We can measure allocations precisely by replacing
for i := 0; i < 100; i++ {
foo(rand.Int())
}
with
fmt.Println("Allocs:", int(testing.AllocsPerRun(100, func() {
foo(rand.Int())
})))
and importing the testing
package. Now, when we run the program, we should see output like:
Test PASS
Allocs: 100158
A few months ago I was given a challenge to get this number to 0 by my employer. This seemed very daunting at first because 100k is a lot of allocations. In this article, I’ll reveal the thought process and thinking I had while I was reducing the number of allocations in this code.
Pareto Principle
Observe that the amount of allocations is approximately 100,000. We can recall from the code
for i := 0; i < 100000; i++ {
buf.WriteString(strconv.Itoa(n))
}
that we write the integer argument n
as a string to the buffer 100,000 times. The benchmark also indicated that strconv.formatBits
was taking a lot of time too. But the value of strconv.Itoa(n)
is not changing every iteration. We can simply calculate it once:
x := strconv.Itoa(n)
for i := 0; i < 100000; i++ {
buf.WriteString(x)
}
. Which brings our allocations significantly down:
Test PASS
Allocs: 159
Finding opportunities like this should be a top priority when optimizing. We have gotten rid of most of the allocating, but 159 allocations is still a lot for such a simple function. Also, if we look at the CPU profiling results, we have only improved run time by about 50%. If we could eke out the remaining allocations, we could probably get run time to < 1s (started with ~2.3s and are currently at ~1.2s, at least on my system).
Gradual Improvement
We’ve made a lot of progress, but we still have some ways to go. Let’s continue to look at foo
from top to bottom.
sum := sha256.Sum256(buf.Bytes())
If we look at the sha256 package, it would seem that Sum256 is some kind of shortcut function. But if we look at the implementation:
func Sum256(data []byte) [Size]byte {
var d digest
d.Reset()
d.Write(data)
return d.checkSum()
}
It seems we would have to do this anyways if we were to use the New
function to create a new hash object. If there was a way to reuse the digest object, maybe this could be useful (spoiler: there is), but for now there is probably more low hanging fruit available.
If we continue on to the loop,
var b []byte
for i := 0; i < int(sum[0]); i++ {
x := sum[(i*7+1)%len(sum)] ^ sum[(i*5+3)%len(sum)]
c := strings.Repeat("abcdefghijklmnopqrstuvwxyz", 10)[x]
b = append(b, c)
}
we can see that allocations are probably occuring when b
is appended to and the buffer is grown and when strings.Repeat
is called because that likely involves some sort of copying. The line assigning x
is definitely not allocating because it is merely reading individual integers from the array and performing some calculations on them. Let’s see if there’s anything we can do about the strings.Repeat
. Well, we are repeating a constant, so we could just unroll the repetition, like:
c := "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"[x]
But that feels sort of hacky. What if there was a more fundamental approach? Let’s say x
equals 53. If we find the 53rd character of the string, we see it’s ‘a’. And the 54th is ‘b’. And the 79th, 105th, (1+26k)th are ‘a’ too. Notice a pattern here? Since the alphabet has 26 characters in it, we can just mod x
by 26 to get the same thing
c := "abcdefghijklmnopqrstuvwxyz"[x%26]
Now we are down to 23 allocations.
Test PASS
Allocs: 23
Let’s look back at the byte slice b
. It is created empty and then appended to repeatedly. Because the new space has to be allocated when it is needed, this creates more allocations than just allocating it once initially. This can be done with:
b := make([]byte, 0, int(sum[0]))
for i := 0; i < int(sum[0]); i++ {
x := sum[(i*7+1)%len(sum)] ^ sum[(i*5+3)%len(sum)]
c := "abcdefghijklmnopqrstuvwxyz"[x%26]
b = append(b, c)
}
We’re down to 19 allocations.
All that’s left now is the return string(b)
statement. Since this copies the byte slice into a string, it allocates. There’s a trick we can use to prevent this from allocating:
return *(*string)(unsafe.Pointer(&b))
This brings to 18 allocations.
Looking at the Function Again
Now that we’ve done our initial pass through the function, let’s look at it again to see if there’s any other allocations we can get rid of.
var buf bytes.Buffer
x := strconv.Itoa(n)
for i := 0; i < 100000; i++ {
buf.WriteString(x)
}
It seems like instead of creating a bytes.Buffer, we could just create a byte slice and append to it. This way we allocate a fixed amount once instead of having to deal with the reallocating we mentioned earlier.
So we can replace this with:
x := strconv.Itoa(n)
buf := make([]byte, 0, 100000*len(x))
for i := 0; i < 100000; i++ {
buf = append(buf, x...)
}
sum := sha256.Sum256(buf)
Only 3 allocations now. If I had to guess, these would be the creation of buf
, the creation of b
, and the creation of the digest object in sha256.Sum256
. So how do we get rid of these more (seemingly intractable) allocations? Welcome sync.Pool
.
sync.Pool
[Note: The following section borders on what Knuth would probably describe as excessive. It is primarily meant to be a demonstration of sync.Pool. It is, however, worth noting that while the difference between 3 allocations and 0 allocations may not seem large, in practice it can be because it means the garbage collector will not have to eventually be called (at least as a result of this function).]
sync.Pool
is a feature in Go that allows you to amortize allocation cost. With our current code, when we allocate the buffers, we do so for the lifetime of the foo
function. But what if we were able to allocate it once, store it in a global variable, then just reuse it infinitely (resetting the contents every time while keeping the capacity)? Doing this manually would be a little complicated and also wouldn’t work concurrently. That’s what sync.Pool
is for. sync.Pool
allows you to Get
an already allocated object and Put
it back when you are done with it. If there are no objects availabile in the pool and you request one, it will call the New
function you define to allocate one.
Let’s create a pool for the hash object and one for the hash sum (256 byte slice). Put the following above foo
:
var bufPool = sync.Pool{
New: func() interface{} {
// length of a sha256 hash
b := make([]byte, 256)
return &b
},
}
var hashPool = sync.Pool{
New: func() interface{} {
return sha256.New()
},
}
Now we need to update our code to use these instead of the ones we create in the function. Add the following to the top of foo
:
// get buffer from pool
bufptr := bufPool.Get().(*[]byte)
defer bufPool.Put(bufptr)
buf := *bufptr
// reset buf
buf = buf[:0]
// get hash object from pool
h := hashPool.Get().(hash.Hash)
defer hashPool.Put(h)
h.Reset()
We use sync.Pool.Get
to retrieve an object from the pool and we (when the function returns) will put it back in the pool with sync.Pool.Put
. Since we now have a hasher object, we can write to that directly instead of an intermediate buffer. Ideally, we could do something like
x := strconv.Itoa(n)
for i := 0; i < 100000; i++ {
h.Write(x)
}
Unfortunately, hash objects in Go do not have a WriteString method so we need to use strconv.AppendInt
to get a byte slice. Additionally, using AppendInt saves us an allocation since its writes to our buffer we got from bufPool
instead of allocating a new string like strconv.Itoa
x := strconv.AppendInt(buf, int64(n), 10)
for i := 0; i < 100000; i++ {
h.Write(x)
}
Now we can get the hash and put it in buf
:
// reset whatever strconv.AppendInt put in the buf
buf = buf[:0]
sum := h.Sum(buf)
In the next for loop, we iterate from 0 to sum[0]
, perform some calculations, and put the result in b
. Since sum[0]
will never be more than 256 because is a byte
which ranges from 0-255, we can simply say
b := make([]byte, 0, 256)
and leave the loop contents the same. The compiler can reason with 256 at compile time, but can’t with sum[0]
. This saves us another allocation.
We’re finally down to 1 allocation. Replace the final return statement with the following:
sum = sum[:0] // reset sum
sum = append(sum, b...)
return *(*string)(unsafe.Pointer(&sum))
And you should see
Test PASS
Allocs: 0
Why does this work? Well, building what we had originally with verbose escape analysis information may be helpful:
$ go build -gcflags='-m -m' a.go
...
./a.go:55:11: make([]byte, 0, 256) escapes to heap:
./a.go:55:11: flow: b = &{storage for make([]byte, 0, 256)}:
./a.go:55:11: from make([]byte, 0, 256) (spill) at ./a.go:55:11
./a.go:55:11: from b := make([]byte, 0, 256) (assign) at ./a.go:55:4
./a.go:55:11: flow: ~r1 = b:
./a.go:55:11: from &b (address-of) at ./a.go:65:35
./a.go:55:11: from *(*string)(unsafe.Pointer(&b)) (indirection) at ./a.go:65:9
./a.go:55:11: from return *(*string)(unsafe.Pointer(&b)) (return) at ./a.go:65:2
./a.go:55:11: make([]byte, 0, 256) escapes to heap
...
The new code’s output:
...
./a.go:55:11: make([]byte, 0, 256) does not escape
...
If we keep the slice in the function and do not return it, it will be stack allocated. As any C programmer knows, it is not possible to return stack allocated memory. When the Go compiler sees the return statement, it makes the slice heap allocated instead. But that doesn’t really explain why transferring the contents to sum
doesn’t allocate. Isn’t sum
also heap allocated? Yes, but sync.Pool
already performed that heap allocation for us.
Takeaways/Tips
unsafe
trick to cast strings/byte slices- string -> byte slice:
*(*[]byte)(unsafe.Pointer(&s))
- the buffer can not be modified or go will panic!
- byte slice -> string:
*(*string)(unsafe.Pointer(&buf))
- string -> byte slice:
- do not run allocation heavy code in a loop if it does not depend on state updated in the loop
- use
sync.Pool
when working with large allocated objects- for byte slices, bytebufferpool may be simpler to implement and more performant
- reuse buffers if you can
- reset buffer
bytes.Buffer.Reset
buf = buf[:0]
- reset buffer
- prefer allocating fixed size arrays initially rather than leaving the size unspecified and leaving it up to the resize factor
- make sure to actually benchmark your code to see if a change improves performance / weigh against readability concerns
Conclusion
Performance does matter, and it is not always hard to improve. Although this code was somewhat artificial, there are several lessons that can be learned from improving its performance.