Jun 22, 2025

Go Profiling 教程：以优化 VictoriaMetrics 为例

VictoriaMetrics 是一个全球化的团队，我们的同事居家办公于世界各地，所以大部分技术交流都在线上进行。这篇博客是团队内一次 Go Profiling 教学的实录，来自于 Aliaksandr Valialkin（GitHub ID: valyala）。

在 VictoriaMetrics 开发者笔记说到，我们调研以 VictoriaLogs 为基础构建 Distributed Tracing 存储，效果还算理想，不过它的性能还没有经过任何优化。

现在，我们以 commit 42c931b 为基础，介绍如何对它进行 Profiling 和优化。截至这个 commit，VictoriaLogs 已经提供了 /insert/opentelemetry/v1/traces 接口，接收 OTLP 请求，写入 Traces 数据。我们的目标是对这个接口的性能进行优化。

这会是一次完全可重复、可实操的课程，我们欢迎读者坐到电脑前，动起手来一起进行尝试。

1. Preparation

我们假定你已经有一台带有 Go 开发环境的电脑，并且已经有准备好了 pprof。你可以通过如下命令检查它们：

# jiekun@jiekun:~/repo$ go version
go version go1.24.4 linux/amd64
# jiekun@jiekun:~/repo$ go tool | grep pprof
pprof

接下来，Clone VictoriaMetrics 项目，切换到待优化的 commit，并尝试启动 VictoriaLogs：

# jiekun@jiekun:~/repo$ git clone https://github.com/VictoriaMetrics/VictoriaMetrics.git
# jiekun@jiekun:~/repo$ cd VictoriaMetrics
# jiekun@jiekun:~/repo/VictoriaMetrics$ git checkout 42c931b8cca12964cb8d7269c59a8756ee9b21cd

# jiekun@jiekun:~/repo/VictoriaMetrics$ go run app/victoria-logs/main.go

如果一切顺利，你将会看到一些启动的日志输出，显示它监听在 :9428 端口，并提示你它的 pprof 地址：

2025-06-22T05:47:15.818Z        info    lib/httpserver/httpserver.go:144        started server at http://0.0.0.0:9428/
2025-06-22T05:47:15.818Z        info    lib/httpserver/httpserver.go:146        pprof handlers are exposed at http://0.0.0.0:9428/debug/pprof/

最后，还需要准备一个 Traces 数据的生成器，给 VictoriaLogs 施加压力。在这一步我们有很多选择，包括：

这里我们选择使用 tracegen，它的运行需要 Docker，这个 docker-compose.yaml 将与宿主机共用网络运行 tracegen，并且启用 4 个 Worker 持续 5 分钟向 VictoriaLogs 发送 Traces 数据。

services:
  tracegen:
    network_mode: "host"
    image: jaegertracing/jaeger-tracegen:latest@sha256:5beb38414f914a1894e72997dc2da491a32590a8e4bb59a29e9c3e1e182a7bc4
    environment:
      - OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://127.0.0.1:9428/insert/opentelemetry/v1/traces
    command: ["-duration", "300s", "-workers", "4", "-pause", "1ms"]

在执行 docker compose up 后，如果没有意外，你将能从 VictoriaLogs 的 UI 上观察到有数据涌入：

准备阶段到此就告一段落了，接下来我们开始采集 Profiles 进行分析。

2. Profiling & Optimization

我们需要通过 pprof 接口获取 heap 和 profile，其中，heap 是实时的内存分配情况，而 profile 指的是 cpu profile，是描述一段时间内 cpu 都在做些什么。

我们应该确保 VictoriaLogs 正在接收来自 tracegen 的请求，然后采集 profile，在 profile 结束时再采集瞬时的 heap:

# jiekun@jiekun:~/repo$ curl http://127.0.0.1:9428/debug/pprof/profile > cpu.profile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10524    0 10524    0     0    350      0 --:--:--  0:00:30 --:--:--  2647

# jiekun@jiekun:~/repo$ curl http://127.0.0.1:9428/debug/pprof/heap > mem.heap
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29896    0 29896    0     0  9581k      0 --:--:-- --:--:-- --:--:-- 9731k

接下来我们看看这些文件都能给我们什么提示，先从 cpu profile 开始。执行 pprof cpu.profile，然后输入 top 回车：

# jiekun@jiekun:~/repo$ pprof cpu.profile 
File: main
Build ID: e5619416752e0f6f0534ed55ecd5e80c2877a7a7
Type: cpu
Time: 2025-06-22 15:12:28 CST
Duration: 30.07s, Total samples = 5.05s (16.80%)
Entering interactive mode (type "help" for commands, "o" for options)

# (pprof) top
Showing nodes accounting for 2410ms, 47.72% of 5050ms total
Dropped 205 nodes (cum <= 25.25ms)
Showing top 10 nodes out of 233
      flat  flat%   sum%        cum   cum%
     560ms 11.09% 11.09%      580ms 11.49%  runtime.cgocall
     500ms  9.90% 20.99%      500ms  9.90%  runtime.memclrNoHeapPointers
     290ms  5.74% 26.73%      290ms  5.74%  runtime.memmove
     240ms  4.75% 31.49%      760ms 15.05%  runtime.scanobject
     170ms  3.37% 34.85%      230ms  4.55%  runtime.typePointers.next
     160ms  3.17% 38.02%      160ms  3.17%  github.com/cespare/xxhash/v2.Sum64
     150ms  2.97% 40.99%      200ms  3.96%  runtime.findObject
     130ms  2.57% 43.56%      130ms  2.57%  runtime.madvise
     110ms  2.18% 45.74%      110ms  2.18%  internal/runtime/syscall.Syscall6
     100ms  1.98% 47.72%      100ms  1.98%  runtime.nextFreeFast

flat: 在这个 function 中执行所耗费的时间。
cum: 在这个 function 及其调用的 function 中执行所耗费的总时间。

关于 `flat` 和 `cum`

我们先借一个例子解释一下 top 中每一列的含义。假设有如下 foo function：

func foo() {
	bar() // foo 调用了 bar，bar 执行耗费 1 秒
	...   // foo 做了一些事情（假定其中不存在进一步的 function 调用），耗费 2 秒
	baz() // foo 调用 baz，baz 执行耗费 3 秒
}

这个例子中，foo 的 flat 就是它做事的 2 秒，而 cum 就是 foo 本身做事的耗时以及它调用其他 function 的耗时，共计 6 秒。

我们可以看到，没有占用大量时间的 low-hanging fruit（容易实现的目标），也就是想通过优化一个 function 就能大幅提升性能是不太可能的。出现最频繁的是 runtime.*，而不是应用的逻辑代码。这说明应用的逻辑已经写得很完美，无须优化。

开玩笑的，我们继续来解释怎么理解这些数字。runtime.cgocall 的 flat% 是 11.09%，说明它的执行时间占了总样本时间（5.05s）的 11.09%。（理论上）如果我们把这个方法优化到 0%，那恭喜你，CPU 使用率最多可以降低 11%。

在 runtime.cgocall 之下的 4 行，是与垃圾回收相关的内容，可以看到它们一共占了 9.9% + 5.74% + 4.75% + 3.37% = 23.76% 时间，类似地，表示优化最多可以使 CPU 使用率降低 23%。

      flat  flat%   sum%        cum   cum%
     500ms  9.90% 20.99%      500ms  9.90%  runtime.memclrNoHeapPointers
     290ms  5.74% 26.73%      290ms  5.74%  runtime.memmove
     240ms  4.75% 31.49%      760ms 15.05%  runtime.scanobject
     170ms  3.37% 34.85%      230ms  4.55%  runtime.typePointers.next

// 施工中

1. Preparation

2. Profiling & Optimization

See Also