Categorygithub.com/hodgesds/perf-utils
modulepackage
0.7.0
Repository: https://github.com/hodgesds/perf-utils.git
Documentation: pkg.go.dev

# README

Perf

GoDoc

This package is a Go library for interacting with the perf subsystem in Linux. I had trouble finding a golang perf library so I decided to write this by using the linux's perf as a reference. This library allows you to do things like see how many CPU instructions a function takes (roughly), profile a process for various hardware events, and other interesting things. Note that because the Go scheduler can schedule a goroutine across many OS threads it becomes rather difficult to get an exact profile of an individual goroutine. However, a few tricks can be used; first a call to runtime.LockOSThread to lock the current goroutine to an OS thread. Second a call to unix.SchedSetaffinity, with a CPU set mask set. Note that if the pid argument is set 0 the calling thread is used (the thread that was just locked). Before using this library you should probably read the perf_event_open man page which this library uses heavily. See this kernel guide for a tutorial how to use perf and some of the limitations.

Use Cases

If you are looking to interact with the perf subsystem directly with perf_event_open syscall than this library is most likely for you. A large number of the utility methods in this package should only be used for testing and/or debugging performance issues. This is due to the nature of the Go runtime being extremely tricky to profile on the goroutine level, with the exception of a long running worker goroutine locked to an OS thread. Eventually this library could be used to implement many of the features of perf but in pure Go. Currently this library is used in node_exporter as well as perf_exporter, which is a Prometheus exporter for perf related metrics.

Caveats

  • Some utility functions will call runtime.LockOSThread for you, they will also unlock the thread after profiling. Note using these utility functions will incur significant overhead (~4ms).
  • Overflow handling is not implemented.

Setup

Most likely you will need to tweak some system settings unless you are running as root. From man perf_event_open:

   perf_event related configuration files
       Files in /proc/sys/kernel/

           /proc/sys/kernel/perf_event_paranoid
                  The perf_event_paranoid file can be set to restrict access to the performance counters.

                  2   allow only user-space measurements (default since Linux 4.6).
                  1   allow both kernel and user measurements (default before Linux 4.6).
                  0   allow access to CPU-specific data but not raw tracepoint samples.
                  -1  no restrictions.

                  The existence of the perf_event_paranoid file is the official method for determining if a kernel supports perf_event_open().

           /proc/sys/kernel/perf_event_max_sample_rate
                  This sets the maximum sample rate.  Setting this too high can allow users to sample at a rate that impacts overall machine performance and potentially lock up the machine.  The default value is 100000  (samples  per
                  second).

           /proc/sys/kernel/perf_event_max_stack
                  This file sets the maximum depth of stack frame entries reported when generating a call trace.

           /proc/sys/kernel/perf_event_mlock_kb
                  Maximum number of pages an unprivileged user can mlock(2).  The default is 516 (kB).

Example

Say you wanted to see how many CPU instructions a particular function took:

package main

import (
	"fmt"
	"log"
	"github.com/hodgesds/perf-utils"
)

func foo() error {
	var total int
	for i:=0;i<1000;i++ {
		total++
	}
	return nil
}

func main() {
	profileValue, err := perf.CPUInstructions(foo)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("CPU instructions: %+v\n", profileValue)
}

Benchmarks

To profile a single function call there is an overhead of ~0.4ms.

$ go test  -bench=BenchmarkCPUCycles .
goos: linux
goarch: amd64
pkg: github.com/hodgesds/perf-utils
BenchmarkCPUCycles-8        3000            397924 ns/op              32 B/op          1 allocs/op
PASS
ok      github.com/hodgesds/perf-utils  1.255s

The Profiler interface has low overhead and suitable for many use cases:

$ go test  -bench=BenchmarkProfiler .
goos: linux
goarch: amd64
pkg: github.com/hodgesds/perf-utils
BenchmarkProfiler-8      3000000               488 ns/op              32 B/op          1 allocs/op
PASS
ok      github.com/hodgesds/perf-utils  1.981s

The RunBenchmarks helper function can be used to run as function as a benchmark and report results from PerfEventAttrs:

func BenchmarkRunBenchmarks(b *testing.B) {

	eventAttrs := []unix.PerfEventAttr{
		CPUInstructionsEventAttr(),
		CPUCyclesEventAttr(),
	}
	RunBenchmarks(
		b,
		func(b *testing.B) {
			for n := 1; n < b.N; n++ {
				a := 42
				for i := 0; i < 1000; i++ {
					a += i
				}
			}
		},
		BenchLock|BenchStrict,
		eventAttrs...,
	)
}

go test  -bench=BenchmarkRunBenchmarks
goos: linux
goarch: amd64
pkg: github.com/hodgesds/iouring-go/go/src/github.com/hodgesds/perf-utils
BenchmarkRunBenchmarks-8         3119304               388 ns/op              1336 hw_cycles/op             3314 hw_instr/op            0 B/op          0 allocs/op

If you want to run a benchmark tracepoints (ie perf list or cat /sys/kernel/debug/tracing/available_events) you can use the BenchmarkTracepoints helper:

func BenchmarkBenchmarkTracepoints(b *testing.B) {
	tracepoints := []string{
		"syscalls:sys_enter_getrusage",
	}
	BenchmarkTracepoints(
		b,
		func(b *testing.B) {
			for n := 1; n < b.N; n++ {
				unix.Getrusage(0, &unix.Rusage{})
			}
		},
		BenchLock|Benchtrict,
		tracepoints...,
	)
}

go test -bench=.
goos: linux
goarch: amd64
pkg: github.com/hodgesds/perf-utils
BenchmarkProfiler-8                              1983320               596 ns/op              32 B/op          1 allocs/op
BenchmarkCPUCycles-8                                2335            484068 ns/op              32 B/op          1 allocs/op
BenchmarkThreadLocking-8                        253319848                4.70 ns/op            0 B/op          0 allocs/op
BenchmarkRunBenchmarks-8                         1906320               627 ns/op              1023 hw_cycles/op       3007 hw_instr/op
BenchmarkRunBenchmarksLocked-8                   1903527               632 ns/op              1025 hw_cycles/op       3007 hw_instr/op
BenchmarkBenchmarkTracepointsLocked-8             986607              1221 ns/op                 2.00 syscalls:sys_enter_getrusage/op          0 B/op          0 allocs/op
BenchmarkBenchmarkTracepoints-8                   906022              1258 ns/op                 2.00 syscalls:sys_enter_getrusage/op          0 B/op          0 allocs/op

BPF Support

BPF is supported by using the BPFProfiler which is available via the ProfileTracepoint function. To use BPF you need to create the BPF program and then call AttachBPF with the file descriptor of the BPF program.

Misc

Originally I set out to use go generate to build Go structs that were compatible with perf, I found a really good article on how to do so. Eventually, after digging through some of the /x/sys/unix code I found pretty much what I was needed. However, I think if you are interested in interacting with the kernel it is a worthwhile read.

# Packages

No description provided by the author

# Functions

AlignmentFaults is used to profile a function and return the number of alignment faults.
AlignmentFaultsEventAttr returns a unix.PerfEventAttr configured for AlignmentFaults.
AvailableEvents returns a mapping of available subsystems and their corresponding list of available events.
AvailablePMUs returns a mapping of available PMUs from /sys/bus/event_sources/devices to the PMU event type (number).
AvailableSubsystems returns a slice of available subsystems.
AvailableTracers returns the list of available tracers.
BenchmarkTracepoints runs benchmark and counts the.
BPU is used to profile a function for the Branch Predictor Unit.
BPUEventAttr returns a unix.PerfEventAttr configured for BPU events.
BusCycles is used to profile a function and return the number of bus cycles.
BusCyclesEventAttr returns a unix.PerfEventAttr configured for BusCycles.
CacheMiss is used to profile a function and return the number of cache misses.
CacheMissEventAttr returns a unix.PerfEventAttr configured for CacheMisses.
CacheRef is used to profile a function and return the number of cache references.
CacheRefEventAttr returns a unix.PerfEventAttr configured for CacheRef.
ContextSwitches is used to profile a function and return the number of context switches.
ContextSwitchesEventAttr returns a unix.PerfEventAttr configured for ContextSwitches.
CPUClock is used to profile a function and return the CPU clock timer.
CPUClockEventAttr returns a unix.PerfEventAttr configured for CPUClock.
CPUCycles is used to profile a function and return the number of CPU cycles.
CPUCyclesEventAttr returns a unix.PerfEventAttr configured for CPUCycles.
CPUInstructions is used to profile a function and return the number of CPU instructions.
CPUInstructionsEventAttr returns a unix.PerfEventAttr configured for CPUInstructions.
CPUMigrations is used to profile a function and return the number of times the thread has been migrated to a new CPU.
CPUMigrationsEventAttr returns a unix.PerfEventAttr configured for CPUMigrations.
CPURefCycles is used to profile a function and return the number of CPU references cycles which are not affected by frequency scaling.
CPURefCyclesEventAttr returns a unix.PerfEventAttr configured for CPURefCycles.
CPUTaskClock is used to profile a function and return the CPU clock timer for the running task.
CPUTaskClockEventAttr returns a unix.PerfEventAttr configured for CPUTaskClock.
CurrentTracer returns the current tracer.
DataTLB is used to profile the data TLB.
DataTLBEventAttr returns a unix.PerfEventAttr configured for DataTLB.
DebugFSMount returns the first found mount point of a debugfs file system.
EmulationFaults is used to profile a function and return the number of emulation faults.
EmulationFaultsEventAttr returns a unix.PerfEventAttr configured for EmulationFaults.
EventAttrString returns a short string representation of a unix.PerfEventAttr.
GetFSMount is a helper function to get a mount file system type.
GetTracepointConfig is used to get the configuration for a trace event.
InstructionTLB is used to profile the instruction TLB.
InstructionTLBEventAttr returns a unix.PerfEventAttr configured for InstructionTLB.
L1Data is used to profile a function and the L1 data cache faults.
L1DataEventAttr returns a unix.PerfEventAttr configured for L1Data.
L1Instructions is used to profile a function for the instruction level L1 cache.
L1InstructionsEventAttr returns a unix.PerfEventAttr configured for L1Instructions.
LLCache is used to profile a function and return the number of emulation PERF_COUNT_HW_CACHE_OP_READ, PERF_COUNT_HW_CACHE_OP_WRITE, or PERF_COUNT_HW_CACHE_OP_PREFETCH for the opt and PERF_COUNT_HW_CACHE_RESULT_ACCESS or PERF_COUNT_HW_CACHE_RESULT_MISS for the result.
LLCacheEventAttr returns a unix.PerfEventAttr configured for LLCache.
LockThread locks an goroutine to an OS thread and then sets the affinity of the thread to a processor core.
MajorPageFaults is used to profile a function and return the number of major page faults.
MajorPageFaultsEventAttr returns a unix.PerfEventAttr configured for MajorPageFaults.
MaxOpenFiles returns the RLIMIT_NOFILE from getrlimit.
MinorPageFaults is used to profile a function and return the number of minor page faults.
MinorPageFaultsEventAttr returns a unix.PerfEventAttr configured for MinorPageFaults.
MSRPaths returns the set of MSR paths.
MSRs attemps to return all available MSRs.
NewAlignFaultsProfiler returns a Profiler that profiles the number of alignment faults.
NewBPUProfiler returns a Profiler that profiles the BPU (branch prediction unit).
NewBranchInstrProfiler returns a Profiler that profiles branch instructions.
NewBranchMissesProfiler returns a Profiler that profiles branch misses.
NewBusCyclesProfiler returns a Profiler that profiles bus cycles.
NewCacheMissesProfiler returns a Profiler that profiles cache misses.
NewCacheProfiler returns a new cache profiler.
NewCacheRefProfiler returns a Profiler that profiles cache references.
NewCPUClockProfiler returns a Profiler that profiles CPU clock speed.
NewCPUCycleProfiler returns a Profiler that profiles CPU cycles.
NewCPUMigrationsProfiler returns a Profiler that profiles the number of times the process has migrated to a new CPU.
NewCtxSwitchesProfiler returns a Profiler that profiles the number of context switches.
NewDataTLBProfiler returns a Profiler that profiles the data TLB.
NewEmulationFaultsProfiler returns a Profiler that profiles the number of alignment faults.
NewGroupProfiler returns a GroupProfiler.
NewHardwareProfiler returns a new hardware profiler.
NewInstrProfiler returns a Profiler that profiles CPU instructions.
NewInstrTLBProfiler returns a Profiler that profiles the instruction TLB.
NewL1DataProfiler returns a Profiler that profiles L1 cache data.
NewL1InstrProfiler returns a Profiler that profiles L1 instruction data.
NewLLCacheProfiler returns a Profiler that profiles last level cache.
NewMajorFaultsProfiler returns a Profiler that profiles the number of major page faults.
NewMinorFaultsProfiler returns a Profiler that profiles the number of minor page faults.
NewMSR returns a MSR.
NewNodeCacheProfiler returns a Profiler that profiles the node cache accesses.
NewPageFaultProfiler returns a Profiler that profiles the number of page faults.
NewProfiler creates a new hardware profiler.
NewRefCPUCyclesProfiler returns a Profiler that profiles CPU cycles, it is not affected by frequency scaling.
NewSoftwareProfiler returns a new software profiler.
NewStalledCyclesBackProfiler returns a Profiler that profiles stalled backend cycles.
NewStalledCyclesFrontProfiler returns a Profiler that profiles stalled frontend cycles.
NewTaskClockProfiler returns a Profiler that profiles clock count of the running task.
NodeCache is used to profile a function for NUMA operations.
NodeCacheEventAttr returns a unix.PerfEventAttr configured for NUMA cache operations.
PageFaults is used to profile a function and return the number of page faults.
PageFaultsEventAttr returns a unix.PerfEventAttr configured for PageFaults.
ProfileTracepoint is used to profile a kernel tracepoint event for a specific PID.
RunBenchmarks runs a series of benchmarks for a set of PerfEventAttrs.
StalledBackendCycles is used to profile a function and return the number of stalled backend cycles.
StalledBackendCyclesEventAttr returns a unix.PerfEventAttr configured for StalledBackendCycles.
StalledFrontendCycles is used to profile a function and return the number of stalled frontend cycles.
StalledFrontendCyclesEventAttr returns a unix.PerfEventAttr configured for StalledFrontendCycles.
TraceFSMount returns the first found mount point of a tracefs file system.
TracepointEventAttr is used to return an PerfEventAttr for a trace event.

# Constants

No description provided by the author
AllCacheProfilers is used to try to configure all cache profilers.
No description provided by the author
No description provided by the author
BenchLock is used to lock a benchmark to a goroutine.
BenchStrict is used to fail a benchmark if one or more events can be profiled.
BPUReadHit is a constant...
No description provided by the author
BPUReadMiss is a constant...
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
DataTLBReadHit is a constant...
No description provided by the author
DataTLBReadMiss is a constant...
No description provided by the author
DataTLBWriteHit is a constant...
No description provided by the author
DataTLBWriteMiss is a constant...
No description provided by the author
DebugFS is the filesystem type for debugfs.
No description provided by the author
InstrTLBReadHit is a constant...
No description provided by the author
InstrTLBReadMiss is a constant...
No description provided by the author
L1DataReadHit is a constant...
No description provided by the author
L1DataReadMiss is a constant...
No description provided by the author
L1DataWriteHit is a constant...
No description provided by the author
L1InstrReadHit is a constant...
No description provided by the author
L1InstrReadMiss is a constant...
No description provided by the author
LLReadHit is a constant...
No description provided by the author
LLReadMiss is a constant...
No description provided by the author
LLWriteHit is a constant...
No description provided by the author
LLWriteMiss is a constant...
No description provided by the author
No description provided by the author
No description provided by the author
MSRBaseDir is the base dir for MSRs.
NodeCacheReadHit is a constant...
No description provided by the author
NodeCacheReadMiss is a constant...
No description provided by the author
NodeCacheWriteHit is a constant...
No description provided by the author
NodeCacheWriteMiss is a constant...
No description provided by the author
No description provided by the author
PERF_IOC_FLAG_GROUP is not defined in x/sys/unix.
PERF_SAMPLE_IDENTIFIER is not defined in x/sys/unix.
PERF_TYPE_TRACEPOINT is a kernel tracepoint.
PerfMaxContexts is a sysfs mount that contains the max perf contexts.
PerfMaxStack is the mount point for the max perf event size.
No description provided by the author
ProcMounts is the mount point for file systems in procfs.
No description provided by the author
No description provided by the author
No description provided by the author
SyscallsDir is a constant of the default tracing event syscalls directory.
No description provided by the author
TraceFS is the filesystem type for tracefs.
TracingDir is a constant of the default tracing directory.

# Variables

ErrNoLeader is returned when a leader of a GroupProfiler is not defined.
ErrNoMount is when there is no such mount.
ErrNoProfiler is returned when no profiler is available for profiling.
EventAttrSize is the size of a PerfEventAttr.
ProfileValuePool is a sync.Pool of ProfileValue structs.

# Structs

CacheProfile is returned by a CacheProfiler.
GroupProfileValue is returned from a GroupProfiler.
HardwareProfile is returned by a HardwareProfiler.
MSR represents a Model Specific Register.
ProfileValue is a value returned by a profiler.
SoftwareProfile is returned by a SoftwareProfiler.

# Interfaces

BPFProfiler is a Profiler that allows attaching a Berkeley Packet Filter (BPF) program to an existing kprobe tracepoint event.
CacheProfiler is a cache profiler.
GroupProfiler is used to setup a group profiler.
HardwareProfiler is a hardware profiler.
Profiler is a profiler.
SoftwareProfiler is a software profiler.

# Type aliases

BenchOpt is a benchmark option.
No description provided by the author
No description provided by the author
No description provided by the author