162306a36Sopenharmony_ciperf-c2c(1) 262306a36Sopenharmony_ci=========== 362306a36Sopenharmony_ci 462306a36Sopenharmony_ciNAME 562306a36Sopenharmony_ci---- 662306a36Sopenharmony_ciperf-c2c - Shared Data C2C/HITM Analyzer. 762306a36Sopenharmony_ci 862306a36Sopenharmony_ciSYNOPSIS 962306a36Sopenharmony_ci-------- 1062306a36Sopenharmony_ci[verse] 1162306a36Sopenharmony_ci'perf c2c record' [<options>] <command> 1262306a36Sopenharmony_ci'perf c2c record' [<options>] \-- [<record command options>] <command> 1362306a36Sopenharmony_ci'perf c2c report' [<options>] 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciDESCRIPTION 1662306a36Sopenharmony_ci----------- 1762306a36Sopenharmony_ciC2C stands for Cache To Cache. 1862306a36Sopenharmony_ci 1962306a36Sopenharmony_ciThe perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows 2062306a36Sopenharmony_ciyou to track down the cacheline contentions. 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ciOn Intel, the tool is based on load latency and precise store facility events 2362306a36Sopenharmony_ciprovided by Intel CPUs. On PowerPC, the tool uses random instruction sampling 2462306a36Sopenharmony_ciwith thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware 2562306a36Sopenharmony_cilimitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to 2662306a36Sopenharmony_cisample load and store operations, therefore hardware and kernel support is 2762306a36Sopenharmony_cirequired. See linkperf:perf-arm-spe[1] for a setup guide. Due to the 2862306a36Sopenharmony_cistatistical nature of Arm SPE sampling, not every memory operation will be 2962306a36Sopenharmony_cisampled. 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ciThese events provide: 3262306a36Sopenharmony_ci - memory address of the access 3362306a36Sopenharmony_ci - type of the access (load and store details) 3462306a36Sopenharmony_ci - latency (in cycles) of the load access 3562306a36Sopenharmony_ci 3662306a36Sopenharmony_ciThe c2c tool provide means to record this data and report back access details 3762306a36Sopenharmony_cifor cachelines with highest contention - highest number of HITM accesses. 3862306a36Sopenharmony_ci 3962306a36Sopenharmony_ciThe basic workflow with this tool follows the standard record/report phase. 4062306a36Sopenharmony_ciUser uses the record command to record events data and report command to 4162306a36Sopenharmony_cidisplay it. 4262306a36Sopenharmony_ci 4362306a36Sopenharmony_ci 4462306a36Sopenharmony_ciRECORD OPTIONS 4562306a36Sopenharmony_ci-------------- 4662306a36Sopenharmony_ci-e:: 4762306a36Sopenharmony_ci--event=:: 4862306a36Sopenharmony_ci Select the PMU event. Use 'perf c2c record -e list' 4962306a36Sopenharmony_ci to list available events. 5062306a36Sopenharmony_ci 5162306a36Sopenharmony_ci-v:: 5262306a36Sopenharmony_ci--verbose:: 5362306a36Sopenharmony_ci Be more verbose (show counter open errors, etc). 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ci-l:: 5662306a36Sopenharmony_ci--ldlat:: 5762306a36Sopenharmony_ci Configure mem-loads latency. Supported on Intel and Arm64 processors 5862306a36Sopenharmony_ci only. Ignored on other archs. 5962306a36Sopenharmony_ci 6062306a36Sopenharmony_ci-k:: 6162306a36Sopenharmony_ci--all-kernel:: 6262306a36Sopenharmony_ci Configure all used events to run in kernel space. 6362306a36Sopenharmony_ci 6462306a36Sopenharmony_ci-u:: 6562306a36Sopenharmony_ci--all-user:: 6662306a36Sopenharmony_ci Configure all used events to run in user space. 6762306a36Sopenharmony_ci 6862306a36Sopenharmony_ciREPORT OPTIONS 6962306a36Sopenharmony_ci-------------- 7062306a36Sopenharmony_ci-k:: 7162306a36Sopenharmony_ci--vmlinux=<file>:: 7262306a36Sopenharmony_ci vmlinux pathname 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ci-v:: 7562306a36Sopenharmony_ci--verbose:: 7662306a36Sopenharmony_ci Be more verbose (show counter open errors, etc). 7762306a36Sopenharmony_ci 7862306a36Sopenharmony_ci-i:: 7962306a36Sopenharmony_ci--input:: 8062306a36Sopenharmony_ci Specify the input file to process. 8162306a36Sopenharmony_ci 8262306a36Sopenharmony_ci-N:: 8362306a36Sopenharmony_ci--node-info:: 8462306a36Sopenharmony_ci Show extra node info in report (see NODE INFO section) 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ci-c:: 8762306a36Sopenharmony_ci--coalesce:: 8862306a36Sopenharmony_ci Specify sorting fields for single cacheline display. 8962306a36Sopenharmony_ci Following fields are available: tid,pid,iaddr,dso 9062306a36Sopenharmony_ci (see COALESCE) 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ci-g:: 9362306a36Sopenharmony_ci--call-graph:: 9462306a36Sopenharmony_ci Setup callchains parameters. 9562306a36Sopenharmony_ci Please refer to perf-report man page for details. 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_ci--stdio:: 9862306a36Sopenharmony_ci Force the stdio output (see STDIO OUTPUT) 9962306a36Sopenharmony_ci 10062306a36Sopenharmony_ci--stats:: 10162306a36Sopenharmony_ci Display only statistic tables and force stdio mode. 10262306a36Sopenharmony_ci 10362306a36Sopenharmony_ci--full-symbols:: 10462306a36Sopenharmony_ci Display full length of symbols. 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ci--no-source:: 10762306a36Sopenharmony_ci Do not display Source:Line column. 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ci--show-all:: 11062306a36Sopenharmony_ci Show all captured HITM lines, with no regard to HITM % 0.0005 limit. 11162306a36Sopenharmony_ci 11262306a36Sopenharmony_ci-f:: 11362306a36Sopenharmony_ci--force:: 11462306a36Sopenharmony_ci Don't do ownership validation. 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ci-d:: 11762306a36Sopenharmony_ci--display:: 11862306a36Sopenharmony_ci Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display 11962306a36Sopenharmony_ci and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode 12062306a36Sopenharmony_ci as default. 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ci--stitch-lbr:: 12362306a36Sopenharmony_ci Show callgraph with stitched LBRs, which may have more complete 12462306a36Sopenharmony_ci callgraph. The perf.data file must have been obtained using 12562306a36Sopenharmony_ci perf c2c record --call-graph lbr. 12662306a36Sopenharmony_ci Disabled by default. In common cases with call stack overflows, 12762306a36Sopenharmony_ci it can recreate better call stacks than the default lbr call stack 12862306a36Sopenharmony_ci output. But this approach is not foolproof. There can be cases 12962306a36Sopenharmony_ci where it creates incorrect call stacks from incorrect matches. 13062306a36Sopenharmony_ci The known limitations include exception handing such as 13162306a36Sopenharmony_ci setjmp/longjmp will have calls/returns not match. 13262306a36Sopenharmony_ci 13362306a36Sopenharmony_ci--double-cl:: 13462306a36Sopenharmony_ci Group the detection of shared cacheline events into double cacheline 13562306a36Sopenharmony_ci granularity. Some architectures have an Adjacent Cacheline Prefetch 13662306a36Sopenharmony_ci feature, which causes cacheline sharing to behave like the cacheline 13762306a36Sopenharmony_ci size is doubled. 13862306a36Sopenharmony_ci 13962306a36Sopenharmony_ciC2C RECORD 14062306a36Sopenharmony_ci---------- 14162306a36Sopenharmony_ciThe perf c2c record command setup options related to HITM cacheline analysis 14262306a36Sopenharmony_ciand calls standard perf record command. 14362306a36Sopenharmony_ci 14462306a36Sopenharmony_ciFollowing perf record options are configured by default: 14562306a36Sopenharmony_ci(check perf record man page for details) 14662306a36Sopenharmony_ci 14762306a36Sopenharmony_ci -W,-d,--phys-data,--sample-cpu 14862306a36Sopenharmony_ci 14962306a36Sopenharmony_ciUnless specified otherwise with '-e' option, following events are monitored by 15062306a36Sopenharmony_cidefault on Intel: 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ci cpu/mem-loads,ldlat=30/P 15362306a36Sopenharmony_ci cpu/mem-stores/P 15462306a36Sopenharmony_ci 15562306a36Sopenharmony_cifollowing on AMD: 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ci ibs_op// 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ciand following on PowerPC: 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ci cpu/mem-loads/ 16262306a36Sopenharmony_ci cpu/mem-stores/ 16362306a36Sopenharmony_ci 16462306a36Sopenharmony_ciUser can pass any 'perf record' option behind '--' mark, like (to enable 16562306a36Sopenharmony_cicallchains and system wide monitoring): 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ci $ perf c2c record -- -g -a 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_ciPlease check RECORD OPTIONS section for specific c2c record options. 17062306a36Sopenharmony_ci 17162306a36Sopenharmony_ciC2C REPORT 17262306a36Sopenharmony_ci---------- 17362306a36Sopenharmony_ciThe perf c2c report command displays shared data analysis. It comes in two 17462306a36Sopenharmony_cidisplay modes: stdio and tui (default). 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ciThe report command workflow is following: 17762306a36Sopenharmony_ci - sort all the data based on the cacheline address 17862306a36Sopenharmony_ci - store access details for each cacheline 17962306a36Sopenharmony_ci - sort all cachelines based on user settings 18062306a36Sopenharmony_ci - display data 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ciIn general perf report output consist of 2 basic views: 18362306a36Sopenharmony_ci 1) most expensive cachelines list 18462306a36Sopenharmony_ci 2) offsets details for each cacheline 18562306a36Sopenharmony_ci 18662306a36Sopenharmony_ciFor each cacheline in the 1) list we display following data: 18762306a36Sopenharmony_ci(Both stdio and TUI modes follow the same fields output) 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ci Index 19062306a36Sopenharmony_ci - zero based index to identify the cacheline 19162306a36Sopenharmony_ci 19262306a36Sopenharmony_ci Cacheline 19362306a36Sopenharmony_ci - cacheline address (hex number) 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ci Rmt/Lcl Hitm (Display with HITM types) 19662306a36Sopenharmony_ci - cacheline percentage of all Remote/Local HITM accesses 19762306a36Sopenharmony_ci 19862306a36Sopenharmony_ci Peer Snoop (Display with peer type) 19962306a36Sopenharmony_ci - cacheline percentage of all peer accesses 20062306a36Sopenharmony_ci 20162306a36Sopenharmony_ci LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types) 20262306a36Sopenharmony_ci - count of Total/Local/Remote load HITMs 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ci Load Peer - Total, Local, Remote (For display with peer type) 20562306a36Sopenharmony_ci - count of Total/Local/Remote load from peer cache or DRAM 20662306a36Sopenharmony_ci 20762306a36Sopenharmony_ci Total records 20862306a36Sopenharmony_ci - sum of all cachelines accesses 20962306a36Sopenharmony_ci 21062306a36Sopenharmony_ci Total loads 21162306a36Sopenharmony_ci - sum of all load accesses 21262306a36Sopenharmony_ci 21362306a36Sopenharmony_ci Total stores 21462306a36Sopenharmony_ci - sum of all store accesses 21562306a36Sopenharmony_ci 21662306a36Sopenharmony_ci Store Reference - L1Hit, L1Miss, N/A 21762306a36Sopenharmony_ci L1Hit - store accesses that hit L1 21862306a36Sopenharmony_ci L1Miss - store accesses that missed L1 21962306a36Sopenharmony_ci N/A - store accesses with memory level is not available 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ci Core Load Hit - FB, L1, L2 22262306a36Sopenharmony_ci - count of load hits in FB (Fill Buffer), L1 and L2 cache 22362306a36Sopenharmony_ci 22462306a36Sopenharmony_ci LLC Load Hit - LlcHit, LclHitm 22562306a36Sopenharmony_ci - count of LLC load accesses, includes LLC hits and LLC HITMs 22662306a36Sopenharmony_ci 22762306a36Sopenharmony_ci RMT Load Hit - RmtHit, RmtHitm 22862306a36Sopenharmony_ci - count of remote load accesses, includes remote hits and remote HITMs; 22962306a36Sopenharmony_ci on Arm neoverse cores, RmtHit is used to account remote accesses, 23062306a36Sopenharmony_ci includes remote DRAM or any upward cache level in remote node 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ci Load Dram - Lcl, Rmt 23362306a36Sopenharmony_ci - count of local and remote DRAM accesses 23462306a36Sopenharmony_ci 23562306a36Sopenharmony_ciFor each offset in the 2) list we display following data: 23662306a36Sopenharmony_ci 23762306a36Sopenharmony_ci HITM - Rmt, Lcl (Display with HITM types) 23862306a36Sopenharmony_ci - % of Remote/Local HITM accesses for given offset within cacheline 23962306a36Sopenharmony_ci 24062306a36Sopenharmony_ci Peer Snoop - Rmt, Lcl (Display with peer type) 24162306a36Sopenharmony_ci - % of Remote/Local peer accesses for given offset within cacheline 24262306a36Sopenharmony_ci 24362306a36Sopenharmony_ci Store Refs - L1 Hit, L1 Miss, N/A 24462306a36Sopenharmony_ci - % of store accesses that hit L1, missed L1 and N/A (no available) memory 24562306a36Sopenharmony_ci level for given offset within cacheline 24662306a36Sopenharmony_ci 24762306a36Sopenharmony_ci Data address - Offset 24862306a36Sopenharmony_ci - offset address 24962306a36Sopenharmony_ci 25062306a36Sopenharmony_ci Pid 25162306a36Sopenharmony_ci - pid of the process responsible for the accesses 25262306a36Sopenharmony_ci 25362306a36Sopenharmony_ci Tid 25462306a36Sopenharmony_ci - tid of the process responsible for the accesses 25562306a36Sopenharmony_ci 25662306a36Sopenharmony_ci Code address 25762306a36Sopenharmony_ci - code address responsible for the accesses 25862306a36Sopenharmony_ci 25962306a36Sopenharmony_ci cycles - rmt hitm, lcl hitm, load (Display with HITM types) 26062306a36Sopenharmony_ci - sum of cycles for given accesses - Remote/Local HITM and generic load 26162306a36Sopenharmony_ci 26262306a36Sopenharmony_ci cycles - rmt peer, lcl peer, load (Display with peer type) 26362306a36Sopenharmony_ci - sum of cycles for given accesses - Remote/Local peer load and generic load 26462306a36Sopenharmony_ci 26562306a36Sopenharmony_ci cpu cnt 26662306a36Sopenharmony_ci - number of cpus that participated on the access 26762306a36Sopenharmony_ci 26862306a36Sopenharmony_ci Symbol 26962306a36Sopenharmony_ci - code symbol related to the 'Code address' value 27062306a36Sopenharmony_ci 27162306a36Sopenharmony_ci Shared Object 27262306a36Sopenharmony_ci - shared object name related to the 'Code address' value 27362306a36Sopenharmony_ci 27462306a36Sopenharmony_ci Source:Line 27562306a36Sopenharmony_ci - source information related to the 'Code address' value 27662306a36Sopenharmony_ci 27762306a36Sopenharmony_ci Node 27862306a36Sopenharmony_ci - nodes participating on the access (see NODE INFO section) 27962306a36Sopenharmony_ci 28062306a36Sopenharmony_ciNODE INFO 28162306a36Sopenharmony_ci--------- 28262306a36Sopenharmony_ciThe 'Node' field displays nodes that accesses given cacheline 28362306a36Sopenharmony_cioffset. Its output comes in 3 flavors: 28462306a36Sopenharmony_ci - node IDs separated by ',' 28562306a36Sopenharmony_ci - node IDs with stats for each ID, in following format: 28662306a36Sopenharmony_ci Node{cpus %hitms %stores} (Display with HITM types) 28762306a36Sopenharmony_ci Node{cpus %peers %stores} (Display with peer type) 28862306a36Sopenharmony_ci - node IDs with list of affected CPUs in following format: 28962306a36Sopenharmony_ci Node{cpu list} 29062306a36Sopenharmony_ci 29162306a36Sopenharmony_ciUser can switch between above flavors with -N option or 29262306a36Sopenharmony_ciuse 'n' key to interactively switch in TUI mode. 29362306a36Sopenharmony_ci 29462306a36Sopenharmony_ciCOALESCE 29562306a36Sopenharmony_ci-------- 29662306a36Sopenharmony_ciUser can specify how to sort offsets for cacheline. 29762306a36Sopenharmony_ci 29862306a36Sopenharmony_ciFollowing fields are available and governs the final 29962306a36Sopenharmony_cioutput fields set for cacheline offsets output: 30062306a36Sopenharmony_ci 30162306a36Sopenharmony_ci tid - coalesced by process TIDs 30262306a36Sopenharmony_ci pid - coalesced by process PIDs 30362306a36Sopenharmony_ci iaddr - coalesced by code address, following fields are displayed: 30462306a36Sopenharmony_ci Code address, Code symbol, Shared Object, Source line 30562306a36Sopenharmony_ci dso - coalesced by shared object 30662306a36Sopenharmony_ci 30762306a36Sopenharmony_ciBy default the coalescing is setup with 'pid,iaddr'. 30862306a36Sopenharmony_ci 30962306a36Sopenharmony_ciSTDIO OUTPUT 31062306a36Sopenharmony_ci------------ 31162306a36Sopenharmony_ciThe stdio output displays data on standard output. 31262306a36Sopenharmony_ci 31362306a36Sopenharmony_ciFollowing tables are displayed: 31462306a36Sopenharmony_ci Trace Event Information 31562306a36Sopenharmony_ci - overall statistics of memory accesses 31662306a36Sopenharmony_ci 31762306a36Sopenharmony_ci Global Shared Cache Line Event Information 31862306a36Sopenharmony_ci - overall statistics on shared cachelines 31962306a36Sopenharmony_ci 32062306a36Sopenharmony_ci Shared Data Cache Line Table 32162306a36Sopenharmony_ci - list of most expensive cachelines 32262306a36Sopenharmony_ci 32362306a36Sopenharmony_ci Shared Cache Line Distribution Pareto 32462306a36Sopenharmony_ci - list of all accessed offsets for each cacheline 32562306a36Sopenharmony_ci 32662306a36Sopenharmony_ciTUI OUTPUT 32762306a36Sopenharmony_ci---------- 32862306a36Sopenharmony_ciThe TUI output provides interactive interface to navigate 32962306a36Sopenharmony_cithrough cachelines list and to display offset details. 33062306a36Sopenharmony_ci 33162306a36Sopenharmony_ciFor details please refer to the help window by pressing '?' key. 33262306a36Sopenharmony_ci 33362306a36Sopenharmony_ciCREDITS 33462306a36Sopenharmony_ci------- 33562306a36Sopenharmony_ciAlthough Don Zickus, Dick Fowles and Joe Mario worked together 33662306a36Sopenharmony_cito get this implemented, we got lots of early help from Arnaldo 33762306a36Sopenharmony_ciCarvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen. 33862306a36Sopenharmony_ci 33962306a36Sopenharmony_ciC2C BLOG 34062306a36Sopenharmony_ci-------- 34162306a36Sopenharmony_ciCheck Joe's blog on c2c tool for detailed use case explanation: 34262306a36Sopenharmony_ci https://joemario.github.io/blog/2016/09/01/c2c-blog/ 34362306a36Sopenharmony_ci 34462306a36Sopenharmony_ciSEE ALSO 34562306a36Sopenharmony_ci-------- 34662306a36Sopenharmony_cilinkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1] 347