I was recently benchmarking various implementations of an algorithm and noticed that the recursive implementation of an algorithm performed worse than its inline equivalent. I didn’t know if it made sense to attribute this overhead to the cost of the additional function calls in the recursive implementation. I set out to see if I could see behind the scenes of a Go function call and determine just how expensive each function call is.
TLDR: The cost of a function call? I still don’t know. The best I can offer is that it depends, well that and an exploration of Go assembler.
In this post, I take you through the analysis I did and highlight what I learned along the way.
A simple program
Rather than use a complex algorithm, I sought to use the simplest possible program that demonstrated the behaviour I wanted to investigate. This Go program takes an int
, in this case 1000
, and increments it by 1
inside a loop that executes 1000
iterations. The result, 2000
, is printed to the screen.
package main
func main() {
n := inc(1000)
println(n)
}
func inc(n int) int {
for i := 0; i < 1000; i++ {
n = n + 1
}
return n
}
$ go run inline.go
2000
I included the loop to slow down the function and allow me to measure benchmarks. Without the loop, benchmark results are in the low single digit nanoseconds which fall within the margin of error of simple benchmarks.
$ go test -run=X -bench=.
goos: darwin
goarch: amd64
pkg: github.com/billglover/recursion/inline
BenchmarkInc-8 5000000 331 ns/op
PASS
ok github.com/billglover/recursion/inline 1.971s
Adding function calls
Now that I had a baseline, I wanted to see how the addition of a single function call would change the benchmark. I modified the inc(n int) int
function as show below.
package main
func main() {
n := inc(1000)
println(n)
}
func inc(n int) int {
for i := 0; i < 1000; i++ {
n = add(n)
}
return n
}
func add(n int) int {
return n + 1
}
Instead of increment it the value of n
directly, we do so via an additional function call to add()
. With the additional function call, I expected to see an increase in execution time. The benchmark comparison between the approaches can be seen below.
$ go test -run=X -bench=. ./... | grep -i bench
BenchmarkIncFunction-8 5000000 289 ns/op
BenchmarkIncInline-8 5000000 289 ns/op
I ran the benchmark again several more times and results were consistent to within a couple of nanoseconds. Given the differences in recursive function performance I’d been seeing, I was expecting the additional function call to result in a noticeable change in performance. In these examples it appeared to make no difference at all. It was time to dig deeper.
Behind the scenes
I had a suspicion that the compiler was performing some optimisation on my example code and the resulting code was similar if not identical between the two examples. Luckily, the Go toolchain provides an easy way to view the output of the compiler.
$ go tool compile -S inline.go
Note: this results in an intermediary assembler and is not the code that actually ends up running on the CPU. I ignored this technicality as I continued to explore.
Looking at the assembly output for the first example, it was fairly easy to trace through the function execution. Helpfully, the assembler output includes references back to the file and line number for the original Go sourcecode.
0x0000 00000 (inline.go:8) TEXT "".inc(SB), NOSPLIT, $0-16
0x0000 00000 (inline.go:8) FUNCDATA $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (inline.go:8) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (inline.go:8) FUNCDATA $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (inline.go:9) PCDATA $2, $0
0x0000 00000 (inline.go:9) PCDATA $0, $0
0x0000 00000 (inline.go:9) MOVQ "".n+8(SP), AX
0x0005 00005 (inline.go:9) XORL CX, CX
0x0007 00007 (inline.go:9) JMP 15
0x0009 00009 (inline.go:9) INCQ CX
0x000c 00012 (inline.go:10) INCQ AX
0x000f 00015 (inline.go:9) CMPQ CX, $1000
0x0016 00022 (inline.go:9) JLT 9
0x0018 00024 (inline.go:12) MOVQ AX, "".~r1+16(SP)
0x001d 00029 (inline.go:12) RET
The FUNCDATA
and PCDATA
instructions relate to garbage collection and are inserted by the compiler. I’ve removed these from the following (and future) listings to aid readability. The TEXT
directives define the function names and indicate the stack allocation required for the function to execute. This one line took the most time to analyse. I’ve broken out an explanation for each of the components below.
00000 TEXT "".inc(SB), NOSPLIT, $0-16
""
is a placeholder that will be filled in later by the linker.incInline
is a symbol used to reference the function location in memory.NOSPLIT
I don’t understand this fully, but I think this means that some stack management code can be left out.$0
indicates that no additional space on the stack is required for this function.$16
indicates that 16-bytes are required for the function parameters and return values (in this case; one 64-bit integer parameter, and one 64-bit integer return value).
A simplified view of the assembly language version of our first example is shown below.
; inc(n int) int
00000 TEXT "".inc(SB), NOSPLIT, $0-16 ; function definition
00000 MOVQ "".n+8(SP), AX ; move parameter from stack to AX
00005 XORL CX, CX ; zero out register CX
00007 JMP 15 ; go to line 15
00009 INCQ CX ; increment register CX
00012 INCQ AX ; increment register AX
00015 CMPQ CX, $1000 ; compare register CX with 1000
00022 JLT 9 ; go to line 9
00024 MOVQ AX, "".~r1+16(SP) ; move AX to stack return value
00029 RET ; return from function
A quick glance at the documentation (X86 Architecture) for Intel CPUs tells us a little more about the registers being used here.
AX
: Accumulator register (AX) – used in arithmetic operations.CX
: Counter register (CX) – used in shift/rotate instructions and loops.
The Go documentation (A Quick Guide to Go’s Assembler) defines the following psuedo-registers. These aren’t hardware registers but are virtual registers maintained by the Go tool chain. I’ll treat them as if they were hardware registers for this investigation.
SB
: Static base pointer – used to reference global symbols.SP
: Stack pointer – used to reference the top of stack.
This shows that register CX
is being used to hold the value of our loop variable i
. And register AX
is being used to hold the value of our local variable n
.
Now that I understood the inline version of our programme, I took a look at the output of the compiler for our second example. Again, I’ve removed the FUNCDATA
and PCDATA
instructions and added comments to each line.
; inc(n int) int
00000 TEXT "".inc(SB), NOSPLIT, $0-16 ; function definition
00000 MOVQ "".n+8(SP), AX ; move parameter from stack to AX
00005 XORL CX, CX ; zero out register CX
00007 JMP 15 ; go to line 15
00009 INCQ CX ; increment register CX
00012 INCQ AX ; increment register AX
00015 CMPQ CX, $1000 ; compare register CX with 1000
00022 JLT 9 ; go to line 9
00024 MOVQ AX, "".~r1+16(SP) ; move AX to stack return value
00029 RET ; return from function
; add(n int) int
00000 TEXT "".add(SB), NOSPLIT, $0-16 ; function definition
00000 MOVQ "".n+8(SP), AX ; move parameter from stack to AX
00005 INCQ AX ; increment register AX
00008 MOVQ AX, "".~r1+16(SP) ; move AX to stack return value
00013 RET ; return from function
In this listing we can clearly see the two function definitions. I’ll leave you to walk through the implementation of the add(n int) int
function. Far more interesting is that the listing for the inc(n int) int
is identical to the listing in our first example. Although we have defined a new function, this function is never called. The end result is that the code path followed in the two examples is identical. This explains the benchmark comparisons seen earlier.
The compiler has determined that it can move the function add()
inline and save on a function call. This is known as “inlining” and is a common compiler trick used to improve performance at the expense of binary size (see Wikipedia). This explains why I was seeing identical benchmarks for both approaches, but not why my recursive functions were seeing slower performance than inline counterparts.
It turns out that the compiler doesn’t move functionality like this inline for all cases. One example where it is unable to do so is recursion as it doesn’t know at compile time how many times the recursive function is going to be called.
I was now convinced that these function calls would demonstrate a performance overhead (otherwise why would they be optimised out), but I was no closer to being able to measure what that overhead was. I needed a way to view the cost of function calls without the compiler optimising them away.
After some digging, I discovered a little known Go pragma directive to disable inlining for particular functions. This should never be used in normal code, but is very useful here for demonstrating the cost of function calls.
package main
func main() {
n := inc(1000)
println(n)
}
func inc(n int) int {
for i := 0; i < 1000; i++ {
n = add(n)
}
return n
}
//go:noinline
func add(n int) int {
return n + 1
}
With the addition of the noinline
pragma directive, I re-ran the benchmark.
$ go test -run=X -bench=. -benchmem ./... | grep -i bench
BenchmarkIncFunction-8 1000000 2289 ns/op
BenchmarkIncInline-8 5000000 286 ns/op
This now shows the inline call is ten times faster than the approach using function calls. But, before you panic and start avoiding function calls altogether, remember I’m making 1,000 function calls during each benchmark operation. The overhead of any individual function call is tiny.
Now that I was able to reproduce the overhead associated with function calls, I went back to the assembly to see what was going on behind the scenes.
; inc(n int) int
00000 TEXT "".inc(SB), $32-16 ; function definition
00000 MOVQ (TLS), CX ; move TLS to register CX
00009 CMPQ SP, 16(CX) ; compare SP with part of CX
00013 JLS 88 ; go to line 88 if 'less than'
00015 SUBQ $32, SP ; increase stack size by 32-bytes
00019 MOVQ BP, 24(SP) ; move BP to the stack
00024 LEAQ 24(SP), BP ; update BP to new location
; --- ---
00029 XORL AX, AX ; zero out register AX
00031 MOVQ "".n+40(SP), CX ; move parameter from stack to CX
00036 JMP 65 ; go to line 65
00038 MOVQ AX, "".i+16(SP) ; move AX to the stack
00043 MOVQ CX, (SP) ; move CX to the stack
00047 CALL "".add(SB) ; call the add(int) function
00052 MOVQ "".i+16(SP), AX ; move stack value to AX
00057 INCQ AX ; increment AX
00060 MOVQ 8(SP), CX ; move stack value to CX
00065 CMPQ AX, $1000 ; compare register AX with 1000
00071 JLT 38 ; go to line 38
00073 MOVQ CX, "".~r1+48(SP) ; move CX to stack return value
00078 MOVQ 24(SP), BP ; restore BP from the stack
00083 ADDQ $32, SP ; reduce stack size by 32-bytes
00087 RET ; return from function
; --- ---
00088 NOP ; do nothing
00088 CALL runtime.morestack_noctxt(SB); ask runtime for more stack
00093 JMP 0 ; go to line 0
; add(n int) int
00000 TEXT "".add(SB), NOSPLIT, $0-16 ; function definition
00000 MOVQ "".n+8(SP), AX ; move parameter from stack to AX
00005 INCQ AX ; increment register AX
00008 MOVQ AX, "".~r1+16(SP) ; move AX to stack return value
00013 RET ; return from function
The good news is that the inc()
function now makes a call to add()
on line 47. However I was not expecting the additional complexity before and after the main function body. There appears to be a lot more going on than a simple CALL
to add()
. Much of this additional complexity comes from managing the values on the stack and it was becoming increasingly difficult to trace how values were moving between registers as the functions were called. I turned to pen and paper to work this through.
If you want to follow this through yourself, grab a large mug of your favourite brew and several sheets of paper, it’ll take a while.
I traced through the execution of the programme and learned the following:
- Values are passed to functions via the stack.
- Values are returned from functions via the stack.
- Calling functions are responsible for reserving the required stack space.
- Function calls increase the stack size by one to accommodate the return address.
- Functions make calls to the runtime to grow the stack as necessary.
I now understood a little more about why function calls incur a cost. Allocating stack space and passing parameter and return values around all costs CPU cycles. But I still wasn’t able to articulate how big that cost was.
So how expensive is a function call
Comparing our two approaches through benchmarking we can see that our solution involving a function call took 2289 ns/op whereas our inline solution took only 286 ns/op. You could make the argument that adding the function call takes an additional 2003 ns/op. For 1,000 function calls in each benchmarking operation, this equates to 2 ns/function call.
An alternate way of looking at it would be to look at the increase in the number of instructions executed to make our function call. Our inline method requires 10 lines of assembler. Our method involving a function call requires the execution of an additional 20 instructions. Figuring out how long each instruction takes to execute is non-trivial and so I decided to stop digging further. If you have any thoughts on how I might be able to approximate this I’d love to hear from you.
In conclusion
Whilst I can’t definitively articulate the cost of a function call in Go, I do have a better understanding of why it incurs a cost. To some, this may be obvious, but the journey above thought me a lot more about the Go, the associated toolchain, and general programme execution.
I’m still of the opinion that the best way to determine the difference between two approaches is to benchmark them. If you find yourself questioning the results of your benchmarks, it’s time to dig further. The Go toolchain provides a number of options for further investigation.