04/23/2020 profiling pprof benchmark ab cpu allocation remote
GO has rich profiling tools from the very beginning — pprof
package
and go tool pprof
.
Let's discuss why profiling is good, how to work with it and what is new here.
While using debugger one can find logic mistakes. It is very basic debugging.
Program could work logically correctly, but following side effects could appear:
All above is not a bug, until program is still functioning. But high resources consumption is expensive for service owner — to support such service new hardware resources are required from time to time.
If service external load is increasing, all these "weak points" could became the reason of unexpected fault — if service is out of memory it will just exit; if there is not enough CPU, the server will become unavailable even for ssh connections.
Profiler is extremely helpful to inspect the program in order to find such issues.
GO standard profiler tool — pprof
,
is completely good.
Let's use pprof
then.
Profiling is actually two activities:
In order to collect the profiling data one should:
pprof
package, or:net/http/pprof
endpoint,
which provides profiling data exporting;On the start it is simpler to use option 2 from above.
To start profiling endpoint one should:
import _ "net/http/pprof"
import to program code;net/http
server start in program — nothing more is required;
go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) }()
According to net/http/pprof
source, it becomes
obvious what that import does:
func init() { http.HandleFunc("/debug/pprof/", Index) http.HandleFunc("/debug/pprof/cmdline", Cmdline) http.HandleFunc("/debug/pprof/profile", Profile) http.HandleFunc("/debug/pprof/symbol", Symbol) http.HandleFunc("/debug/pprof/trace", Trace) }
I'm going to profile some web service started at my localhost:8090.
I've imported net/http/pprof
and added http-server start.
First, i'm going to load my service by using apache benchmark
(ab
) tool:
ab -n 1000 -c10 http://localhost:8090/
ab
prints it's own statistics about request times, but currently i don't need it.
I'm going to analyze profiling data for now.
While service is under load, i request CPU profiling data for the following 5 seconds:
curl http://localhost:6060/debug/profile/?seconds=5 > ./profile
As a result the profile file is downloaded. It is all for getting profiling data. Let's analyze it.
Let's open profiling data i've just downloaded with go tool pprof
:
tool pprof ./profile Type: cpu Time: Apr 19, 2020 at 9:28pm (+03) Duration: 5.40s, Total samples = 400ms ( 7.41%) Entering interactive mode (type "help" for commands, "o" for options)
We can print data as image:
(pprof) web
This requires graphviz
tool installation.
As a result i got functions scheme, where the most long working functions are bigger than others.
As we can see, the most time is spent on html/template
package calls,
which involves us to the GO sources.
It is the profiler standard behaviour — doesn't matter if function is from
application code or from the sources of GO.
Let's get back to the pprof
terminal and print top 15 functions by work time:
(pprof) top15 Showing nodes accounting for 0, 0% of 400ms total Showing top 15 nodes out of 147 flat flat% sum% cum cum% 0 0% 0% 260ms 65.00% net/http.(*conn).serve 0 0% 0% 240ms 60.00% github.com/labstack/echo.(*Echo).ServeHTTP 0 0% 0% 240ms 60.00% github.com/labstack/echo.(*Echo).ServeHTTP.func1 0 0% 0% 240ms 60.00% github.com/labstack/echo.(*Echo).add.func1 0 0% 0% 240ms 60.00% github.com/labstack/echo.(*context).Render 0 0% 0% 240ms 60.00% github.com/labstack/echo/middleware.AddTrailingSlashWithConfig.func1.1 0 0% 0% 240ms 60.00% github.com/labstack/echo/middleware.CORSWithConfig.func1.1 0 0% 0% 240ms 60.00% net/http.serverHandler.ServeHTTP 0 0% 0% 220ms 55.00% html/template.(*Template).Execute 0 0% 0% 220ms 55.00% text/template.(*Template).Execute 0 0% 0% 220ms 55.00% text/template.(*Template).execute 0 0% 0% 220ms 55.00% text/template.(*state).walk 0 0% 0% 220ms 55.00% text/template.(*state).walkTemplate
There is also template rendering in the top.
We can print any function source to see which part of is taking which time:
(pprof) list text/template.\(\*state\).walk$ Total: 400ms ROUTINE ======================== text/template.(*state).walk in /usr/local/Cellar/go/1.14/libexec/src/text/template/exec.go 0 830ms (flat, cum) 207.50% of Total . . 250: s.at(node) . . 251: switch node := node.(type) { . . 252: case *parse.ActionNode: . . 253: // Do not pop variables so they persist until next end. . . 254: // Also, if the action declares variables, don't print the result. . 160ms 255: val := s.evalPipeline(dot, node.Pipe) . . 256: if len(node.Pipe.Decl) == 0 { . 20ms 257: s.printValue(node, val) . . 258: } . . 259: case *parse.IfNode: . 40ms 260: s.walkIfOrWith(parse.NodeIf, dot, node.Pipe, node.List, node.ElseList) . . 261: case *parse.ListNode: . . 262: for _, node := range node.Nodes { . 220ms 263: s.walk(dot, node) . . 264: } . . 265: case *parse.RangeNode: . 170ms 266: s.walkRange(dot, node) . . 267: case *parse.TemplateNode: . 220ms 268: s.walkTemplate(dot, node) . . 269: case *parse.TextNode: . . 270: if _, err := s.wr.Write(node.Text); err != nil { . . 271: s.writeError(err) . . 272: } . . 273: case *parse.WithNode:
Conclusion:
reflect
package calls, which concerned to be "heavy";
I can't change GO source code.
But i can change my service code that calls GO standard packages.
Currently there is template call on every client request in my service.
But, the templates are static, constant.
The output of template call depends on request data.
I'm going to cache the template call result —
render is going to be called once for each request data combination.
On first request there is no data in the cache.
In that case the template call is executed.
The result of template call is saved into the cache with key, generated from request data.
On every request the service checks if there is a data in cache by certain key.
And service is use cache data if there is one.
I changed the service code and going to run profiling again. The results are below:
(pprof) top15 Showing nodes accounting for 90ms, 50.00% of 180ms total Showing top 15 nodes out of 78 flat flat% sum% cum cum% 0 0% 0% 100ms 55.56% net/http.(*conn).serve 90ms 50.00% 50.00% 90ms 50.00% syscall.syscall 0 0% 50.00% 60ms 33.33% bufio.(*Writer).Flush 0 0% 50.00% 60ms 33.33% internal/poll.(*FD).Write 0 0% 50.00% 60ms 33.33% net.(*conn).Write 0 0% 50.00% 60ms 33.33% net.(*netFD).Write 0 0% 50.00% 60ms 33.33% net/http.(*response).finishRequest 0 0% 50.00% 60ms 33.33% net/http.checkConnErrorWriter.Write 0 0% 50.00% 60ms 33.33% syscall.Write 0 0% 50.00% 60ms 33.33% syscall.write 0 0% 50.00% 40ms 22.22% runtime.findrunnable 0 0% 50.00% 40ms 22.22% runtime.mcall 0 0% 50.00% 40ms 22.22% runtime.schedule 0 0% 50.00% 30ms 16.67% github.com/labstack/echo.(*Echo).Start
As we can see there are syscall Write
in the top now.
And template calls are not in the top anymore.
This is perfect! Because the goal of our changes is to make application behaviour as fast as we can, to make basic OS operations — reading, sending data by the network in the top.
Now let's compare checkout the ab
results, — "Requests per second" row:
Before: Requests per second: 14.13 [#/sec] (mean)
After: Requests per second: 638.38 [#/sec] (mean)
As we can see, the service is performing much faster now.
One can open net/http/prof
endpoint a browser.
It is the same endpoint we curl'ed to get profiling data.
The page is looks as following:
/debug/pprof/ Types of profiles available: Count Profile 9 allocs 0 block 0 cmdline 9 goroutine 9 heap 0 mutex 0 profile 18 threadcreate 0 trace
We can grab some other metrics:
The most useful here are CPU and memory profiling.
Profiling pprof
endpoint, which we've enabled above can
be accessed from external network:
// server listens every network interface (not only localhost), port 6060 log.Println(http.ListenAndServe(":6060", nil))
But one should be careful to restrict it for developers only!
It is very useful to profile production services, but the profiling endpoint available for external client requests is a vulnerability — profiler calls slow down theservice.
Profiling data providing approach covers any situation and is not going to be changed.
It is the data analysis can be improved here. A lot of displaying tools are released.
To me the most helpful is flamegraph.
There is a GO pofiling data display possibilities in Goland IDE.
But go tool pprof
also provides rich filtering and sorting functionality.
There is no such functionality involved in this article
there are some examples in GO blog.