Profiling

04/23/2020 profiling pprof benchmark ab cpu allocation remote


Profiling

GO has rich profiling tools from the very beginning — pprof package and go tool pprof.
Let's discuss why profiling is good, how to work with it and what is new here.

Why profiling is good

While using debugger one can find logic mistakes. It is very basic debugging.

Program could work logically correctly, but following side effects could appear:

  • long work in some cases;
  • high memory consumption;
  • high processor load;

All above is not a bug, until program is still functioning. But high resources consumption is expensive for service owner — to support such service new hardware resources are required from time to time.

If service external load is increasing, all these "weak points" could became the reason of unexpected fault — if service is out of memory it will just exit; if there is not enough CPU, the server will become unavailable even for ssh connections.

Profiler is extremely helpful to inspect the program in order to find such issues.

What tool is good for profiling

GO standard profiler tool — pprof, is completely good. Let's use pprof then.

How to profile

Profiling is actually two activities:

  • 1. to collect profiling data from service;
  • 2. to visualize and analyze the data;

In order to collect the profiling data one should:

  • to implement manual metrics collection and exporting with pprof package, or:
  • to start net/http/pprof endpoint, which provides profiling data exporting;

On the start it is simpler to use option 2 from above.

To start profiling endpoint one should:

  • add import _ "net/http/pprof" import to program code;
  • if there is already net/http server start in program — nothing more is required;
  • if not, one should add server start code:
go func() {
	log.Println(http.ListenAndServe("localhost:6060", nil))
}()

According to net/http/pprof source, it becomes obvious what that import does:

func init() {
	http.HandleFunc("/debug/pprof/", Index)
	http.HandleFunc("/debug/pprof/cmdline", Cmdline)
	http.HandleFunc("/debug/pprof/profile", Profile)
	http.HandleFunc("/debug/pprof/symbol", Symbol)
	http.HandleFunc("/debug/pprof/trace", Trace)
}

Profiling

I'm going to profile some web service started at my localhost:8090.

I've imported net/http/pprof and added http-server start.

Getting profile data

First, i'm going to load my service by using apache benchmark (ab) tool:

ab -n 1000 -c10 http://localhost:8090/

ab prints it's own statistics about request times, but currently i don't need it. I'm going to analyze profiling data for now.

While service is under load, i request CPU profiling data for the following 5 seconds:

curl http://localhost:6060/debug/profile/?seconds=5 > ./profile

As a result the profile file is downloaded. It is all for getting profiling data. Let's analyze it.

CPU profile analysis

Let's open profiling data i've just downloaded with go tool pprof:

tool pprof ./profile 
Type: cpu
Time: Apr 19, 2020 at 9:28pm (+03)
Duration: 5.40s, Total samples = 400ms ( 7.41%)
Entering interactive mode (type "help" for commands, "o" for options)

We can print data as image:

(pprof) web

This requires graphviz tool installation.

As a result i got functions scheme, where the most long working functions are bigger than others.

As we can see, the most time is spent on html/template package calls, which involves us to the GO sources.
It is the profiler standard behaviour — doesn't matter if function is from application code or from the sources of GO.

Let's get back to the pprof terminal and print top 15 functions by work time:

(pprof) top15
Showing nodes accounting for 0, 0% of 400ms total
Showing top 15 nodes out of 147
flat  flat%   sum%        cum   cum%
 0     0%     0%      260ms 65.00%  net/http.(*conn).serve
 0     0%     0%      240ms 60.00%  github.com/labstack/echo.(*Echo).ServeHTTP
 0     0%     0%      240ms 60.00%  github.com/labstack/echo.(*Echo).ServeHTTP.func1
 0     0%     0%      240ms 60.00%  github.com/labstack/echo.(*Echo).add.func1
 0     0%     0%      240ms 60.00%  github.com/labstack/echo.(*context).Render
 0     0%     0%      240ms 60.00%  github.com/labstack/echo/middleware.AddTrailingSlashWithConfig.func1.1
 0     0%     0%      240ms 60.00%  github.com/labstack/echo/middleware.CORSWithConfig.func1.1
 0     0%     0%      240ms 60.00%  net/http.serverHandler.ServeHTTP
 0     0%     0%      220ms 55.00%  html/template.(*Template).Execute
 0     0%     0%      220ms 55.00%  text/template.(*Template).Execute
 0     0%     0%      220ms 55.00%  text/template.(*Template).execute
 0     0%     0%      220ms 55.00%  text/template.(*state).walk
 0     0%     0%      220ms 55.00%  text/template.(*state).walkTemplate

There is also template rendering in the top.

We can print any function source to see which part of is taking which time:

(pprof) list text/template.\(\*state\).walk$
Total: 400ms
ROUTINE ======================== text/template.(*state).walk in /usr/local/Cellar/go/1.14/libexec/src/text/template/exec.go
         0      830ms (flat, cum) 207.50% of Total
         .          .    250:	s.at(node)
         .          .    251:	switch node := node.(type) {
         .          .    252:	case *parse.ActionNode:
         .          .    253:		// Do not pop variables so they persist until next end.
         .          .    254:		// Also, if the action declares variables, don't print the result.
         .      160ms    255:		val := s.evalPipeline(dot, node.Pipe)
         .          .    256:		if len(node.Pipe.Decl) == 0 {
         .       20ms    257:			s.printValue(node, val)
         .          .    258:		}
         .          .    259:	case *parse.IfNode:
         .       40ms    260:		s.walkIfOrWith(parse.NodeIf, dot, node.Pipe, node.List, node.ElseList)
         .          .    261:	case *parse.ListNode:
         .          .    262:		for _, node := range node.Nodes {
         .      220ms    263:			s.walk(dot, node)
         .          .    264:		}
         .          .    265:	case *parse.RangeNode:
         .      170ms    266:		s.walkRange(dot, node)
         .          .    267:	case *parse.TemplateNode:
         .      220ms    268:		s.walkTemplate(dot, node)
         .          .    269:	case *parse.TextNode:
         .          .    270:		if _, err := s.wr.Write(node.Text); err != nil {
         .          .    271:			s.writeError(err)
         .          .    272:		}
         .          .    273:	case *parse.WithNode:

Conclusion:

  • there is a recursion here (line 263);
  • there is a reflect package calls, which concerned to be "heavy";

I can't change GO source code.
But i can change my service code that calls GO standard packages.
Currently there is template call on every client request in my service. But, the templates are static, constant.
The output of template call depends on request data.

I'm going to cache the template call result — render is going to be called once for each request data combination.
On first request there is no data in the cache. In that case the template call is executed. The result of template call is saved into the cache with key, generated from request data.
On every request the service checks if there is a data in cache by certain key. And service is use cache data if there is one.

I changed the service code and going to run profiling again. The results are below:

(pprof) top15
Showing nodes accounting for 90ms, 50.00% of 180ms total
Showing top 15 nodes out of 78
  flat  flat%   sum%        cum   cum%
	 0     0%     0%      100ms 55.56%  net/http.(*conn).serve
  90ms 50.00% 50.00%       90ms 50.00%  syscall.syscall
	 0     0% 50.00%       60ms 33.33%  bufio.(*Writer).Flush
	 0     0% 50.00%       60ms 33.33%  internal/poll.(*FD).Write
	 0     0% 50.00%       60ms 33.33%  net.(*conn).Write
	 0     0% 50.00%       60ms 33.33%  net.(*netFD).Write
	 0     0% 50.00%       60ms 33.33%  net/http.(*response).finishRequest
	 0     0% 50.00%       60ms 33.33%  net/http.checkConnErrorWriter.Write
	 0     0% 50.00%       60ms 33.33%  syscall.Write
	 0     0% 50.00%       60ms 33.33%  syscall.write
	 0     0% 50.00%       40ms 22.22%  runtime.findrunnable
	 0     0% 50.00%       40ms 22.22%  runtime.mcall
	 0     0% 50.00%       40ms 22.22%  runtime.schedule
	 0     0% 50.00%       30ms 16.67%  github.com/labstack/echo.(*Echo).Start

As we can see there are syscall Write in the top now. And template calls are not in the top anymore.

This is perfect! Because the goal of our changes is to make application behaviour as fast as we can, to make basic OS operations — reading, sending data by the network in the top.

Now let's compare checkout the ab results, — "Requests per second" row:
Before: Requests per second: 14.13 [#/sec] (mean)
After: Requests per second: 638.38 [#/sec] (mean)

As we can see, the service is performing much faster now.

Conclusion

One can open net/http/prof endpoint a browser. It is the same endpoint we curl'ed to get profiling data. The page is looks as following:

/debug/pprof/

Types of profiles available:
Count	Profile
9	allocs
0	block
0	cmdline
9	goroutine
9	heap
0	mutex
0	profile
18	threadcreate
0	trace

We can grab some other metrics:

  • about memory allocation in every function;
  • about sync.Mutex calls;
  • about new threads creation;
and many other.

The most useful here are CPU and memory profiling.

Remote profiling

Profiling pprof endpoint, which we've enabled above can be accessed from external network:

// server listens every network interface (not only localhost), port 6060
log.Println(http.ListenAndServe(":6060", nil))

But one should be careful to restrict it for developers only!

It is very useful to profile production services, but the profiling endpoint available for external client requests is a vulnerability — profiler calls slow down theservice.

What's new in the profiling topic

Profiling data providing approach covers any situation and is not going to be changed.

It is the data analysis can be improved here. A lot of displaying tools are released.

To me the most helpful is flamegraph.

There is a GO pofiling data display possibilities in Goland IDE.
But go tool pprof also provides rich filtering and sorting functionality. There is no such functionality involved in this article there are some examples in GO blog.

Related articles