162306a36Sopenharmony_ciperf-arm-spe(1) 262306a36Sopenharmony_ci================ 362306a36Sopenharmony_ci 462306a36Sopenharmony_ciNAME 562306a36Sopenharmony_ci---- 662306a36Sopenharmony_ciperf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools 762306a36Sopenharmony_ci 862306a36Sopenharmony_ciSYNOPSIS 962306a36Sopenharmony_ci-------- 1062306a36Sopenharmony_ci[verse] 1162306a36Sopenharmony_ci'perf record' -e arm_spe// 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciDESCRIPTION 1462306a36Sopenharmony_ci----------- 1562306a36Sopenharmony_ci 1662306a36Sopenharmony_ciThe SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and 1762306a36Sopenharmony_ci events down to individual instructions. Rather than being interrupt-driven, it picks an 1862306a36Sopenharmony_ciinstruction to sample and then captures data for it during execution. Data includes execution time 1962306a36Sopenharmony_ciin cycles. For loads and stores it also includes data address, cache miss events, and data origin. 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ciThe sampling has 5 stages: 2262306a36Sopenharmony_ci 2362306a36Sopenharmony_ci 1. Choose an operation 2462306a36Sopenharmony_ci 2. Collect data about the operation 2562306a36Sopenharmony_ci 3. Optionally discard the record based on a filter 2662306a36Sopenharmony_ci 4. Write the record to memory 2762306a36Sopenharmony_ci 5. Interrupt when the buffer is full 2862306a36Sopenharmony_ci 2962306a36Sopenharmony_ciChoose an operation 3062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 3162306a36Sopenharmony_ci 3262306a36Sopenharmony_ciThis is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all 3362306a36Sopenharmony_ciarchitectural instructions or all micro-ops. Sampling happens at a programmable interval. The 3462306a36Sopenharmony_ciarchitecture provides a mechanism for the SPE driver to infer the minimum interval at which it should 3562306a36Sopenharmony_cisample. This minimum interval is used by the driver if no interval is specified. A pseudo-random 3662306a36Sopenharmony_ciperturbation is also added to the sampling interval by default. 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ciCollect data about the operation 3962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciProgram counter, PMU events, timings and data addresses related to the operation are recorded. 4262306a36Sopenharmony_ciSampling ensures there is only one sampled operation is in flight. 4362306a36Sopenharmony_ci 4462306a36Sopenharmony_ciOptionally discard the record based on a filter 4562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4662306a36Sopenharmony_ci 4762306a36Sopenharmony_ciBased on programmable criteria, choose whether to keep the record or discard it. If the record is 4862306a36Sopenharmony_cidiscarded then the flow stops here for this sample. 4962306a36Sopenharmony_ci 5062306a36Sopenharmony_ciWrite the record to memory 5162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~ 5262306a36Sopenharmony_ci 5362306a36Sopenharmony_ciThe record is appended to a memory buffer 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ciInterrupt when the buffer is full 5662306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ciWhen the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. 5962306a36Sopenharmony_ciPerf saves the raw data in the perf.data file. 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ciOpening the file 6262306a36Sopenharmony_ci---------------- 6362306a36Sopenharmony_ci 6462306a36Sopenharmony_ciUp until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the 6562306a36Sopenharmony_cirecorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding 6662306a36Sopenharmony_cithe data, Perf generates "synthetic samples" as if these were generated at the time of the 6762306a36Sopenharmony_cirecording. These samples are the same as if normal sampling was done by Perf without using SPE, 6862306a36Sopenharmony_cialthough they may have more attributes associated with them. For example a normal sample may have 6962306a36Sopenharmony_cijust the instruction pointer, but an SPE sample can have data addresses and latency attributes. 7062306a36Sopenharmony_ci 7162306a36Sopenharmony_ciWhy Sampling? 7262306a36Sopenharmony_ci------------- 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ci - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for 7562306a36Sopenharmony_ci hardware. Only one sampled operation is in flight at a time. 7662306a36Sopenharmony_ci 7762306a36Sopenharmony_ci - Allows precise attribution data, including: Full PC of instruction, data virtual and physical 7862306a36Sopenharmony_ci addresses. 7962306a36Sopenharmony_ci 8062306a36Sopenharmony_ci - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source 8162306a36Sopenharmony_ci indicates which particular cache was hit, but the meaning is implementation defined because 8262306a36Sopenharmony_ci different implementations can have different cache configurations.) 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ciHowever, SPE does not provide any call-graph information, and relies on statistical methods. 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ciCollisions 8762306a36Sopenharmony_ci---------- 8862306a36Sopenharmony_ci 8962306a36Sopenharmony_ciWhen an operation is sampled while a previous sampled operation has not finished, a collision 9062306a36Sopenharmony_cioccurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate 9162306a36Sopenharmony_cishould be set to avoid collisions. 9262306a36Sopenharmony_ci 9362306a36Sopenharmony_ciThe 'sample_collision' PMU event can be used to determine the number of lost samples. Although this 9462306a36Sopenharmony_cicount is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact 9562306a36Sopenharmony_cinumber for samples dropped that would have made it through the filter, but can be a rough 9662306a36Sopenharmony_ciguide. 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ciThe effect of microarchitectural sampling 9962306a36Sopenharmony_ci----------------------------------------- 10062306a36Sopenharmony_ci 10162306a36Sopenharmony_ciIf an implementation samples micro-operations instead of instructions, the results of sampling must 10262306a36Sopenharmony_cibe weighted accordingly. 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ciFor example, if a given instruction A is always converted into two micro-operations, A0 and A1, it 10562306a36Sopenharmony_cibecomes twice as likely to appear in the sample population. 10662306a36Sopenharmony_ci 10762306a36Sopenharmony_ciThe coarse effect of conversions, and, if applicable, sampling of speculative operations, can be 10862306a36Sopenharmony_ciestimated from the 'sample_pop' and 'inst_retired' PMU events. 10962306a36Sopenharmony_ci 11062306a36Sopenharmony_ciKernel Requirements 11162306a36Sopenharmony_ci------------------- 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_ciThe ARM_SPE_PMU config must be set to build as either a module or statically. 11462306a36Sopenharmony_ci 11562306a36Sopenharmony_ciDepending on CPU model, the kernel may need to be booted with page table isolation disabled 11662306a36Sopenharmony_ci(kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer 11762306a36Sopenharmony_ciinaccessible. Try passing 'kpti=off' on the kernel command line". 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ciCapturing SPE with perf command-line tools 12062306a36Sopenharmony_ci------------------------------------------ 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ciYou can record a session with SPE samples: 12362306a36Sopenharmony_ci 12462306a36Sopenharmony_ci perf record -e arm_spe// -- ./mybench 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ciThe sample period is set from the -c option, and because the minimum interval is used by default 12762306a36Sopenharmony_ciit's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL. 12862306a36Sopenharmony_ci 12962306a36Sopenharmony_ciConfig parameters 13062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~ 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ciThese are placed between the // in the event and comma separated. For example '-e 13362306a36Sopenharmony_ciarm_spe/load_filter=1,min_latency=10/' 13462306a36Sopenharmony_ci 13562306a36Sopenharmony_ci branch_filter=1 - collect branches only (PMSFCR.B) 13662306a36Sopenharmony_ci event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below 13762306a36Sopenharmony_ci jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) 13862306a36Sopenharmony_ci load_filter=1 - collect loads only (PMSFCR.LD) 13962306a36Sopenharmony_ci min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) 14062306a36Sopenharmony_ci pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege 14162306a36Sopenharmony_ci pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege 14262306a36Sopenharmony_ci store_filter=1 - collect stores only (PMSFCR.ST) 14362306a36Sopenharmony_ci ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) 14462306a36Sopenharmony_ci 14562306a36Sopenharmony_ci+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather 14662306a36Sopenharmony_cithan only the execution latency. 14762306a36Sopenharmony_ci 14862306a36Sopenharmony_ciOnly some events can be filtered on; these include: 14962306a36Sopenharmony_ci 15062306a36Sopenharmony_ci bit 1 - instruction retired (i.e. omit speculative instructions) 15162306a36Sopenharmony_ci bit 3 - L1D refill 15262306a36Sopenharmony_ci bit 5 - TLB refill 15362306a36Sopenharmony_ci bit 7 - mispredict 15462306a36Sopenharmony_ci bit 11 - misaligned access 15562306a36Sopenharmony_ci 15662306a36Sopenharmony_ciSo to sample just retired instructions: 15762306a36Sopenharmony_ci 15862306a36Sopenharmony_ci perf record -e arm_spe/event_filter=2/ -- ./mybench 15962306a36Sopenharmony_ci 16062306a36Sopenharmony_cior just mispredicted branches: 16162306a36Sopenharmony_ci 16262306a36Sopenharmony_ci perf record -e arm_spe/event_filter=0x80/ -- ./mybench 16362306a36Sopenharmony_ci 16462306a36Sopenharmony_ciViewing the data 16562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~ 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ciBy default perf report and perf script will assign samples to separate groups depending on the 16862306a36Sopenharmony_ciattributes/events of the SPE record. Because instructions can have multiple events associated with 16962306a36Sopenharmony_cithem, the samples in these groups are not necessarily unique. For example perf report shows these 17062306a36Sopenharmony_cigroups: 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ci Available samples 17362306a36Sopenharmony_ci 0 arm_spe// 17462306a36Sopenharmony_ci 0 dummy:u 17562306a36Sopenharmony_ci 21 l1d-miss 17662306a36Sopenharmony_ci 897 l1d-access 17762306a36Sopenharmony_ci 5 llc-miss 17862306a36Sopenharmony_ci 7 llc-access 17962306a36Sopenharmony_ci 2 tlb-miss 18062306a36Sopenharmony_ci 1K tlb-access 18162306a36Sopenharmony_ci 36 branch-miss 18262306a36Sopenharmony_ci 0 remote-access 18362306a36Sopenharmony_ci 900 memory 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ciThe arm_spe// and dummy:u events are implementation details and are expected to be empty. 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ciTo get a full list of unique samples that are not sorted into groups, set the itrace option to 18862306a36Sopenharmony_cigenerate 'instruction' samples. The period option is also taken into account, so set it to 1 18962306a36Sopenharmony_ciinstruction unless you want to further downsample the already sampled SPE data: 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ci perf report --itrace=i1i 19262306a36Sopenharmony_ci 19362306a36Sopenharmony_ciMemory access details are also stored on the samples and this can be viewed with: 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ci perf report --mem-mode 19662306a36Sopenharmony_ci 19762306a36Sopenharmony_ciCommon errors 19862306a36Sopenharmony_ci~~~~~~~~~~~~~ 19962306a36Sopenharmony_ci 20062306a36Sopenharmony_ci - "Cannot find PMU `arm_spe'. Missing kernel support?" 20162306a36Sopenharmony_ci 20262306a36Sopenharmony_ci Module not built or loaded, KPTI not disabled (see above), or running on a VM 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ci - "Arm SPE CONTEXT packets not found in the traces." 20562306a36Sopenharmony_ci 20662306a36Sopenharmony_ci Root privilege is required to collect context packets. But these only increase the accuracy of 20762306a36Sopenharmony_ci assigning PIDs to kernel samples. For userspace sampling this can be ignored. 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ci - Excessively large perf.data file size 21062306a36Sopenharmony_ci 21162306a36Sopenharmony_ci Increase sampling interval (see above) 21262306a36Sopenharmony_ci 21362306a36Sopenharmony_ci 21462306a36Sopenharmony_ciSEE ALSO 21562306a36Sopenharmony_ci-------- 21662306a36Sopenharmony_ci 21762306a36Sopenharmony_cilinkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], 21862306a36Sopenharmony_cilinkperf:perf-inject[1] 219