162306a36Sopenharmony_ciperf-arm-spe(1)
262306a36Sopenharmony_ci================
362306a36Sopenharmony_ci
462306a36Sopenharmony_ciNAME
562306a36Sopenharmony_ci----
662306a36Sopenharmony_ciperf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
762306a36Sopenharmony_ci
862306a36Sopenharmony_ciSYNOPSIS
962306a36Sopenharmony_ci--------
1062306a36Sopenharmony_ci[verse]
1162306a36Sopenharmony_ci'perf record' -e arm_spe//
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciDESCRIPTION
1462306a36Sopenharmony_ci-----------
1562306a36Sopenharmony_ci
1662306a36Sopenharmony_ciThe SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
1762306a36Sopenharmony_ci events down to individual instructions. Rather than being interrupt-driven, it picks an
1862306a36Sopenharmony_ciinstruction to sample and then captures data for it during execution. Data includes execution time
1962306a36Sopenharmony_ciin cycles. For loads and stores it also includes data address, cache miss events, and data origin.
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciThe sampling has 5 stages:
2262306a36Sopenharmony_ci
2362306a36Sopenharmony_ci  1. Choose an operation
2462306a36Sopenharmony_ci  2. Collect data about the operation
2562306a36Sopenharmony_ci  3. Optionally discard the record based on a filter
2662306a36Sopenharmony_ci  4. Write the record to memory
2762306a36Sopenharmony_ci  5. Interrupt when the buffer is full
2862306a36Sopenharmony_ci
2962306a36Sopenharmony_ciChoose an operation
3062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
3162306a36Sopenharmony_ci
3262306a36Sopenharmony_ciThis is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
3362306a36Sopenharmony_ciarchitectural instructions or all micro-ops. Sampling happens at a programmable interval. The
3462306a36Sopenharmony_ciarchitecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
3562306a36Sopenharmony_cisample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
3662306a36Sopenharmony_ciperturbation is also added to the sampling interval by default.
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ciCollect data about the operation
3962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciProgram counter, PMU events, timings and data addresses related to the operation are recorded.
4262306a36Sopenharmony_ciSampling ensures there is only one sampled operation is in flight.
4362306a36Sopenharmony_ci
4462306a36Sopenharmony_ciOptionally discard the record based on a filter
4562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4662306a36Sopenharmony_ci
4762306a36Sopenharmony_ciBased on programmable criteria, choose whether to keep the record or discard it. If the record is
4862306a36Sopenharmony_cidiscarded then the flow stops here for this sample.
4962306a36Sopenharmony_ci
5062306a36Sopenharmony_ciWrite the record to memory
5162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciThe record is appended to a memory buffer
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciInterrupt when the buffer is full
5662306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5762306a36Sopenharmony_ci
5862306a36Sopenharmony_ciWhen the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
5962306a36Sopenharmony_ciPerf saves the raw data in the perf.data file.
6062306a36Sopenharmony_ci
6162306a36Sopenharmony_ciOpening the file
6262306a36Sopenharmony_ci----------------
6362306a36Sopenharmony_ci
6462306a36Sopenharmony_ciUp until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
6562306a36Sopenharmony_cirecorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
6662306a36Sopenharmony_cithe data, Perf generates "synthetic samples" as if these were generated at the time of the
6762306a36Sopenharmony_cirecording. These samples are the same as if normal sampling was done by Perf without using SPE,
6862306a36Sopenharmony_cialthough they may have more attributes associated with them. For example a normal sample may have
6962306a36Sopenharmony_cijust the instruction pointer, but an SPE sample can have data addresses and latency attributes.
7062306a36Sopenharmony_ci
7162306a36Sopenharmony_ciWhy Sampling?
7262306a36Sopenharmony_ci-------------
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ci - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
7562306a36Sopenharmony_ci hardware. Only one sampled operation is in flight at a time.
7662306a36Sopenharmony_ci
7762306a36Sopenharmony_ci - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
7862306a36Sopenharmony_ci addresses.
7962306a36Sopenharmony_ci
8062306a36Sopenharmony_ci - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
8162306a36Sopenharmony_ci indicates which particular cache was hit, but the meaning is implementation defined because
8262306a36Sopenharmony_ci different implementations can have different cache configurations.)
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ciHowever, SPE does not provide any call-graph information, and relies on statistical methods.
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ciCollisions
8762306a36Sopenharmony_ci----------
8862306a36Sopenharmony_ci
8962306a36Sopenharmony_ciWhen an operation is sampled while a previous sampled operation has not finished, a collision
9062306a36Sopenharmony_cioccurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
9162306a36Sopenharmony_cishould be set to avoid collisions.
9262306a36Sopenharmony_ci
9362306a36Sopenharmony_ciThe 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
9462306a36Sopenharmony_cicount is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
9562306a36Sopenharmony_cinumber for samples dropped that would have made it through the filter, but can be a rough
9662306a36Sopenharmony_ciguide.
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciThe effect of microarchitectural sampling
9962306a36Sopenharmony_ci-----------------------------------------
10062306a36Sopenharmony_ci
10162306a36Sopenharmony_ciIf an implementation samples micro-operations instead of instructions, the results of sampling must
10262306a36Sopenharmony_cibe weighted accordingly.
10362306a36Sopenharmony_ci
10462306a36Sopenharmony_ciFor example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
10562306a36Sopenharmony_cibecomes twice as likely to appear in the sample population.
10662306a36Sopenharmony_ci
10762306a36Sopenharmony_ciThe coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
10862306a36Sopenharmony_ciestimated from the 'sample_pop' and 'inst_retired' PMU events.
10962306a36Sopenharmony_ci
11062306a36Sopenharmony_ciKernel Requirements
11162306a36Sopenharmony_ci-------------------
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_ciThe ARM_SPE_PMU config must be set to build as either a module or statically.
11462306a36Sopenharmony_ci
11562306a36Sopenharmony_ciDepending on CPU model, the kernel may need to be booted with page table isolation disabled
11662306a36Sopenharmony_ci(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
11762306a36Sopenharmony_ciinaccessible. Try passing 'kpti=off' on the kernel command line".
11862306a36Sopenharmony_ci
11962306a36Sopenharmony_ciCapturing SPE with perf command-line tools
12062306a36Sopenharmony_ci------------------------------------------
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ciYou can record a session with SPE samples:
12362306a36Sopenharmony_ci
12462306a36Sopenharmony_ci  perf record -e arm_spe// -- ./mybench
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ciThe sample period is set from the -c option, and because the minimum interval is used by default
12762306a36Sopenharmony_ciit's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
12862306a36Sopenharmony_ci
12962306a36Sopenharmony_ciConfig parameters
13062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ciThese are placed between the // in the event and comma separated. For example '-e
13362306a36Sopenharmony_ciarm_spe/load_filter=1,min_latency=10/'
13462306a36Sopenharmony_ci
13562306a36Sopenharmony_ci  branch_filter=1     - collect branches only (PMSFCR.B)
13662306a36Sopenharmony_ci  event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
13762306a36Sopenharmony_ci  jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
13862306a36Sopenharmony_ci  load_filter=1       - collect loads only (PMSFCR.LD)
13962306a36Sopenharmony_ci  min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
14062306a36Sopenharmony_ci  pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
14162306a36Sopenharmony_ci  pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
14262306a36Sopenharmony_ci  store_filter=1      - collect stores only (PMSFCR.ST)
14362306a36Sopenharmony_ci  ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
14462306a36Sopenharmony_ci
14562306a36Sopenharmony_ci+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
14662306a36Sopenharmony_cithan only the execution latency.
14762306a36Sopenharmony_ci
14862306a36Sopenharmony_ciOnly some events can be filtered on; these include:
14962306a36Sopenharmony_ci
15062306a36Sopenharmony_ci  bit 1     - instruction retired (i.e. omit speculative instructions)
15162306a36Sopenharmony_ci  bit 3     - L1D refill
15262306a36Sopenharmony_ci  bit 5     - TLB refill
15362306a36Sopenharmony_ci  bit 7     - mispredict
15462306a36Sopenharmony_ci  bit 11    - misaligned access
15562306a36Sopenharmony_ci
15662306a36Sopenharmony_ciSo to sample just retired instructions:
15762306a36Sopenharmony_ci
15862306a36Sopenharmony_ci  perf record -e arm_spe/event_filter=2/ -- ./mybench
15962306a36Sopenharmony_ci
16062306a36Sopenharmony_cior just mispredicted branches:
16162306a36Sopenharmony_ci
16262306a36Sopenharmony_ci  perf record -e arm_spe/event_filter=0x80/ -- ./mybench
16362306a36Sopenharmony_ci
16462306a36Sopenharmony_ciViewing the data
16562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~
16662306a36Sopenharmony_ci
16762306a36Sopenharmony_ciBy default perf report and perf script will assign samples to separate groups depending on the
16862306a36Sopenharmony_ciattributes/events of the SPE record. Because instructions can have multiple events associated with
16962306a36Sopenharmony_cithem, the samples in these groups are not necessarily unique. For example perf report shows these
17062306a36Sopenharmony_cigroups:
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ci  Available samples
17362306a36Sopenharmony_ci  0 arm_spe//
17462306a36Sopenharmony_ci  0 dummy:u
17562306a36Sopenharmony_ci  21 l1d-miss
17662306a36Sopenharmony_ci  897 l1d-access
17762306a36Sopenharmony_ci  5 llc-miss
17862306a36Sopenharmony_ci  7 llc-access
17962306a36Sopenharmony_ci  2 tlb-miss
18062306a36Sopenharmony_ci  1K tlb-access
18162306a36Sopenharmony_ci  36 branch-miss
18262306a36Sopenharmony_ci  0 remote-access
18362306a36Sopenharmony_ci  900 memory
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ciThe arm_spe// and dummy:u events are implementation details and are expected to be empty.
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ciTo get a full list of unique samples that are not sorted into groups, set the itrace option to
18862306a36Sopenharmony_cigenerate 'instruction' samples. The period option is also taken into account, so set it to 1
18962306a36Sopenharmony_ciinstruction unless you want to further downsample the already sampled SPE data:
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ci  perf report --itrace=i1i
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ciMemory access details are also stored on the samples and this can be viewed with:
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ci  perf report --mem-mode
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ciCommon errors
19862306a36Sopenharmony_ci~~~~~~~~~~~~~
19962306a36Sopenharmony_ci
20062306a36Sopenharmony_ci - "Cannot find PMU `arm_spe'. Missing kernel support?"
20162306a36Sopenharmony_ci
20262306a36Sopenharmony_ci   Module not built or loaded, KPTI not disabled (see above), or running on a VM
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ci - "Arm SPE CONTEXT packets not found in the traces."
20562306a36Sopenharmony_ci
20662306a36Sopenharmony_ci   Root privilege is required to collect context packets. But these only increase the accuracy of
20762306a36Sopenharmony_ci   assigning PIDs to kernel samples. For userspace sampling this can be ignored.
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ci - Excessively large perf.data file size
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ci   Increase sampling interval (see above)
21262306a36Sopenharmony_ci
21362306a36Sopenharmony_ci
21462306a36Sopenharmony_ciSEE ALSO
21562306a36Sopenharmony_ci--------
21662306a36Sopenharmony_ci
21762306a36Sopenharmony_cilinkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
21862306a36Sopenharmony_cilinkperf:perf-inject[1]
219