18c2ecf20Sopenharmony_ci===================================================== 28c2ecf20Sopenharmony_ciMemory Resource Controller(Memcg) Implementation Memo 38c2ecf20Sopenharmony_ci===================================================== 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ciLast Updated: 2010/2 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciBase Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). 88c2ecf20Sopenharmony_ci 98c2ecf20Sopenharmony_ciBecause VM is getting complex (one of reasons is memcg...), memcg's behavior 108c2ecf20Sopenharmony_ciis complex. This is a document for memcg's internal behavior. 118c2ecf20Sopenharmony_ciPlease note that implementation details can be changed. 128c2ecf20Sopenharmony_ci 138c2ecf20Sopenharmony_ci(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst) 148c2ecf20Sopenharmony_ci 158c2ecf20Sopenharmony_ci0. How to record usage ? 168c2ecf20Sopenharmony_ci======================== 178c2ecf20Sopenharmony_ci 188c2ecf20Sopenharmony_ci 2 objects are used. 198c2ecf20Sopenharmony_ci 208c2ecf20Sopenharmony_ci page_cgroup ....an object per page. 218c2ecf20Sopenharmony_ci 228c2ecf20Sopenharmony_ci Allocated at boot or memory hotplug. Freed at memory hot removal. 238c2ecf20Sopenharmony_ci 248c2ecf20Sopenharmony_ci swap_cgroup ... an entry per swp_entry. 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_ci Allocated at swapon(). Freed at swapoff(). 278c2ecf20Sopenharmony_ci 288c2ecf20Sopenharmony_ci The page_cgroup has USED bit and double count against a page_cgroup never 298c2ecf20Sopenharmony_ci occurs. swap_cgroup is used only when a charged page is swapped-out. 308c2ecf20Sopenharmony_ci 318c2ecf20Sopenharmony_ci1. Charge 328c2ecf20Sopenharmony_ci========= 338c2ecf20Sopenharmony_ci 348c2ecf20Sopenharmony_ci a page/swp_entry may be charged (usage += PAGE_SIZE) at 358c2ecf20Sopenharmony_ci 368c2ecf20Sopenharmony_ci mem_cgroup_try_charge() 378c2ecf20Sopenharmony_ci 388c2ecf20Sopenharmony_ci2. Uncharge 398c2ecf20Sopenharmony_ci=========== 408c2ecf20Sopenharmony_ci 418c2ecf20Sopenharmony_ci a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by 428c2ecf20Sopenharmony_ci 438c2ecf20Sopenharmony_ci mem_cgroup_uncharge() 448c2ecf20Sopenharmony_ci Called when a page's refcount goes down to 0. 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ci mem_cgroup_uncharge_swap() 478c2ecf20Sopenharmony_ci Called when swp_entry's refcnt goes down to 0. A charge against swap 488c2ecf20Sopenharmony_ci disappears. 498c2ecf20Sopenharmony_ci 508c2ecf20Sopenharmony_ci3. charge-commit-cancel 518c2ecf20Sopenharmony_ci======================= 528c2ecf20Sopenharmony_ci 538c2ecf20Sopenharmony_ci Memcg pages are charged in two steps: 548c2ecf20Sopenharmony_ci 558c2ecf20Sopenharmony_ci - mem_cgroup_try_charge() 568c2ecf20Sopenharmony_ci - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() 578c2ecf20Sopenharmony_ci 588c2ecf20Sopenharmony_ci At try_charge(), there are no flags to say "this page is charged". 598c2ecf20Sopenharmony_ci at this point, usage += PAGE_SIZE. 608c2ecf20Sopenharmony_ci 618c2ecf20Sopenharmony_ci At commit(), the page is associated with the memcg. 628c2ecf20Sopenharmony_ci 638c2ecf20Sopenharmony_ci At cancel(), simply usage -= PAGE_SIZE. 648c2ecf20Sopenharmony_ci 658c2ecf20Sopenharmony_ciUnder below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. 668c2ecf20Sopenharmony_ci 678c2ecf20Sopenharmony_ci4. Anonymous 688c2ecf20Sopenharmony_ci============ 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_ci Anonymous page is newly allocated at 718c2ecf20Sopenharmony_ci - page fault into MAP_ANONYMOUS mapping. 728c2ecf20Sopenharmony_ci - Copy-On-Write. 738c2ecf20Sopenharmony_ci 748c2ecf20Sopenharmony_ci 4.1 Swap-in. 758c2ecf20Sopenharmony_ci At swap-in, the page is taken from swap-cache. There are 2 cases. 768c2ecf20Sopenharmony_ci 778c2ecf20Sopenharmony_ci (a) If the SwapCache is newly allocated and read, it has no charges. 788c2ecf20Sopenharmony_ci (b) If the SwapCache has been mapped by processes, it has been 798c2ecf20Sopenharmony_ci charged already. 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ci 4.2 Swap-out. 828c2ecf20Sopenharmony_ci At swap-out, typical state transition is below. 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ci (a) add to swap cache. (marked as SwapCache) 858c2ecf20Sopenharmony_ci swp_entry's refcnt += 1. 868c2ecf20Sopenharmony_ci (b) fully unmapped. 878c2ecf20Sopenharmony_ci swp_entry's refcnt += # of ptes. 888c2ecf20Sopenharmony_ci (c) write back to swap. 898c2ecf20Sopenharmony_ci (d) delete from swap cache. (remove from SwapCache) 908c2ecf20Sopenharmony_ci swp_entry's refcnt -= 1. 918c2ecf20Sopenharmony_ci 928c2ecf20Sopenharmony_ci 938c2ecf20Sopenharmony_ci Finally, at task exit, 948c2ecf20Sopenharmony_ci (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. 958c2ecf20Sopenharmony_ci 968c2ecf20Sopenharmony_ci5. Page Cache 978c2ecf20Sopenharmony_ci============= 988c2ecf20Sopenharmony_ci 998c2ecf20Sopenharmony_ci Page Cache is charged at 1008c2ecf20Sopenharmony_ci - add_to_page_cache_locked(). 1018c2ecf20Sopenharmony_ci 1028c2ecf20Sopenharmony_ci The logic is very clear. (About migration, see below) 1038c2ecf20Sopenharmony_ci 1048c2ecf20Sopenharmony_ci Note: 1058c2ecf20Sopenharmony_ci __remove_from_page_cache() is called by remove_from_page_cache() 1068c2ecf20Sopenharmony_ci and __remove_mapping(). 1078c2ecf20Sopenharmony_ci 1088c2ecf20Sopenharmony_ci6. Shmem(tmpfs) Page Cache 1098c2ecf20Sopenharmony_ci=========================== 1108c2ecf20Sopenharmony_ci 1118c2ecf20Sopenharmony_ci The best way to understand shmem's page state transition is to read 1128c2ecf20Sopenharmony_ci mm/shmem.c. 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_ci But brief explanation of the behavior of memcg around shmem will be 1158c2ecf20Sopenharmony_ci helpful to understand the logic. 1168c2ecf20Sopenharmony_ci 1178c2ecf20Sopenharmony_ci Shmem's page (just leaf page, not direct/indirect block) can be on 1188c2ecf20Sopenharmony_ci 1198c2ecf20Sopenharmony_ci - radix-tree of shmem's inode. 1208c2ecf20Sopenharmony_ci - SwapCache. 1218c2ecf20Sopenharmony_ci - Both on radix-tree and SwapCache. This happens at swap-in 1228c2ecf20Sopenharmony_ci and swap-out, 1238c2ecf20Sopenharmony_ci 1248c2ecf20Sopenharmony_ci It's charged when... 1258c2ecf20Sopenharmony_ci 1268c2ecf20Sopenharmony_ci - A new page is added to shmem's radix-tree. 1278c2ecf20Sopenharmony_ci - A swp page is read. (move a charge from swap_cgroup to page_cgroup) 1288c2ecf20Sopenharmony_ci 1298c2ecf20Sopenharmony_ci7. Page Migration 1308c2ecf20Sopenharmony_ci================= 1318c2ecf20Sopenharmony_ci 1328c2ecf20Sopenharmony_ci mem_cgroup_migrate() 1338c2ecf20Sopenharmony_ci 1348c2ecf20Sopenharmony_ci8. LRU 1358c2ecf20Sopenharmony_ci====== 1368c2ecf20Sopenharmony_ci Each memcg has its own private LRU. Now, its handling is under global 1378c2ecf20Sopenharmony_ci VM's control (means that it's handled under global pgdat->lru_lock). 1388c2ecf20Sopenharmony_ci Almost all routines around memcg's LRU is called by global LRU's 1398c2ecf20Sopenharmony_ci list management functions under pgdat->lru_lock. 1408c2ecf20Sopenharmony_ci 1418c2ecf20Sopenharmony_ci A special function is mem_cgroup_isolate_pages(). This scans 1428c2ecf20Sopenharmony_ci memcg's private LRU and call __isolate_lru_page() to extract a page 1438c2ecf20Sopenharmony_ci from LRU. 1448c2ecf20Sopenharmony_ci 1458c2ecf20Sopenharmony_ci (By __isolate_lru_page(), the page is removed from both of global and 1468c2ecf20Sopenharmony_ci private LRU.) 1478c2ecf20Sopenharmony_ci 1488c2ecf20Sopenharmony_ci 1498c2ecf20Sopenharmony_ci9. Typical Tests. 1508c2ecf20Sopenharmony_ci================= 1518c2ecf20Sopenharmony_ci 1528c2ecf20Sopenharmony_ci Tests for racy cases. 1538c2ecf20Sopenharmony_ci 1548c2ecf20Sopenharmony_ci9.1 Small limit to memcg. 1558c2ecf20Sopenharmony_ci------------------------- 1568c2ecf20Sopenharmony_ci 1578c2ecf20Sopenharmony_ci When you do test to do racy case, it's good test to set memcg's limit 1588c2ecf20Sopenharmony_ci to be very small rather than GB. Many races found in the test under 1598c2ecf20Sopenharmony_ci xKB or xxMB limits. 1608c2ecf20Sopenharmony_ci 1618c2ecf20Sopenharmony_ci (Memory behavior under GB and Memory behavior under MB shows very 1628c2ecf20Sopenharmony_ci different situation.) 1638c2ecf20Sopenharmony_ci 1648c2ecf20Sopenharmony_ci9.2 Shmem 1658c2ecf20Sopenharmony_ci--------- 1668c2ecf20Sopenharmony_ci 1678c2ecf20Sopenharmony_ci Historically, memcg's shmem handling was poor and we saw some amount 1688c2ecf20Sopenharmony_ci of troubles here. This is because shmem is page-cache but can be 1698c2ecf20Sopenharmony_ci SwapCache. Test with shmem/tmpfs is always good test. 1708c2ecf20Sopenharmony_ci 1718c2ecf20Sopenharmony_ci9.3 Migration 1728c2ecf20Sopenharmony_ci------------- 1738c2ecf20Sopenharmony_ci 1748c2ecf20Sopenharmony_ci For NUMA, migration is an another special case. To do easy test, cpuset 1758c2ecf20Sopenharmony_ci is useful. Following is a sample script to do migration:: 1768c2ecf20Sopenharmony_ci 1778c2ecf20Sopenharmony_ci mount -t cgroup -o cpuset none /opt/cpuset 1788c2ecf20Sopenharmony_ci 1798c2ecf20Sopenharmony_ci mkdir /opt/cpuset/01 1808c2ecf20Sopenharmony_ci echo 1 > /opt/cpuset/01/cpuset.cpus 1818c2ecf20Sopenharmony_ci echo 0 > /opt/cpuset/01/cpuset.mems 1828c2ecf20Sopenharmony_ci echo 1 > /opt/cpuset/01/cpuset.memory_migrate 1838c2ecf20Sopenharmony_ci mkdir /opt/cpuset/02 1848c2ecf20Sopenharmony_ci echo 1 > /opt/cpuset/02/cpuset.cpus 1858c2ecf20Sopenharmony_ci echo 1 > /opt/cpuset/02/cpuset.mems 1868c2ecf20Sopenharmony_ci echo 1 > /opt/cpuset/02/cpuset.memory_migrate 1878c2ecf20Sopenharmony_ci 1888c2ecf20Sopenharmony_ci In above set, when you moves a task from 01 to 02, page migration to 1898c2ecf20Sopenharmony_ci node 0 to node 1 will occur. Following is a script to migrate all 1908c2ecf20Sopenharmony_ci under cpuset.:: 1918c2ecf20Sopenharmony_ci 1928c2ecf20Sopenharmony_ci -- 1938c2ecf20Sopenharmony_ci move_task() 1948c2ecf20Sopenharmony_ci { 1958c2ecf20Sopenharmony_ci for pid in $1 1968c2ecf20Sopenharmony_ci do 1978c2ecf20Sopenharmony_ci /bin/echo $pid >$2/tasks 2>/dev/null 1988c2ecf20Sopenharmony_ci echo -n $pid 1998c2ecf20Sopenharmony_ci echo -n " " 2008c2ecf20Sopenharmony_ci done 2018c2ecf20Sopenharmony_ci echo END 2028c2ecf20Sopenharmony_ci } 2038c2ecf20Sopenharmony_ci 2048c2ecf20Sopenharmony_ci G1_TASK=`cat ${G1}/tasks` 2058c2ecf20Sopenharmony_ci G2_TASK=`cat ${G2}/tasks` 2068c2ecf20Sopenharmony_ci move_task "${G1_TASK}" ${G2} & 2078c2ecf20Sopenharmony_ci -- 2088c2ecf20Sopenharmony_ci 2098c2ecf20Sopenharmony_ci9.4 Memory hotplug 2108c2ecf20Sopenharmony_ci------------------ 2118c2ecf20Sopenharmony_ci 2128c2ecf20Sopenharmony_ci memory hotplug test is one of good test. 2138c2ecf20Sopenharmony_ci 2148c2ecf20Sopenharmony_ci to offline memory, do following:: 2158c2ecf20Sopenharmony_ci 2168c2ecf20Sopenharmony_ci # echo offline > /sys/devices/system/memory/memoryXXX/state 2178c2ecf20Sopenharmony_ci 2188c2ecf20Sopenharmony_ci (XXX is the place of memory) 2198c2ecf20Sopenharmony_ci 2208c2ecf20Sopenharmony_ci This is an easy way to test page migration, too. 2218c2ecf20Sopenharmony_ci 2228c2ecf20Sopenharmony_ci9.5 mkdir/rmdir 2238c2ecf20Sopenharmony_ci--------------- 2248c2ecf20Sopenharmony_ci 2258c2ecf20Sopenharmony_ci When using hierarchy, mkdir/rmdir test should be done. 2268c2ecf20Sopenharmony_ci Use tests like the following:: 2278c2ecf20Sopenharmony_ci 2288c2ecf20Sopenharmony_ci echo 1 >/opt/cgroup/01/memory/use_hierarchy 2298c2ecf20Sopenharmony_ci mkdir /opt/cgroup/01/child_a 2308c2ecf20Sopenharmony_ci mkdir /opt/cgroup/01/child_b 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ci set limit to 01. 2338c2ecf20Sopenharmony_ci add limit to 01/child_b 2348c2ecf20Sopenharmony_ci run jobs under child_a and child_b 2358c2ecf20Sopenharmony_ci 2368c2ecf20Sopenharmony_ci create/delete following groups at random while jobs are running:: 2378c2ecf20Sopenharmony_ci 2388c2ecf20Sopenharmony_ci /opt/cgroup/01/child_a/child_aa 2398c2ecf20Sopenharmony_ci /opt/cgroup/01/child_b/child_bb 2408c2ecf20Sopenharmony_ci /opt/cgroup/01/child_c 2418c2ecf20Sopenharmony_ci 2428c2ecf20Sopenharmony_ci running new jobs in new group is also good. 2438c2ecf20Sopenharmony_ci 2448c2ecf20Sopenharmony_ci9.6 Mount with other subsystems 2458c2ecf20Sopenharmony_ci------------------------------- 2468c2ecf20Sopenharmony_ci 2478c2ecf20Sopenharmony_ci Mounting with other subsystems is a good test because there is a 2488c2ecf20Sopenharmony_ci race and lock dependency with other cgroup subsystems. 2498c2ecf20Sopenharmony_ci 2508c2ecf20Sopenharmony_ci example:: 2518c2ecf20Sopenharmony_ci 2528c2ecf20Sopenharmony_ci # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices 2538c2ecf20Sopenharmony_ci 2548c2ecf20Sopenharmony_ci and do task move, mkdir, rmdir etc...under this. 2558c2ecf20Sopenharmony_ci 2568c2ecf20Sopenharmony_ci9.7 swapoff 2578c2ecf20Sopenharmony_ci----------- 2588c2ecf20Sopenharmony_ci 2598c2ecf20Sopenharmony_ci Besides management of swap is one of complicated parts of memcg, 2608c2ecf20Sopenharmony_ci call path of swap-in at swapoff is not same as usual swap-in path.. 2618c2ecf20Sopenharmony_ci It's worth to be tested explicitly. 2628c2ecf20Sopenharmony_ci 2638c2ecf20Sopenharmony_ci For example, test like following is good: 2648c2ecf20Sopenharmony_ci 2658c2ecf20Sopenharmony_ci (Shell-A):: 2668c2ecf20Sopenharmony_ci 2678c2ecf20Sopenharmony_ci # mount -t cgroup none /cgroup -o memory 2688c2ecf20Sopenharmony_ci # mkdir /cgroup/test 2698c2ecf20Sopenharmony_ci # echo 40M > /cgroup/test/memory.limit_in_bytes 2708c2ecf20Sopenharmony_ci # echo 0 > /cgroup/test/tasks 2718c2ecf20Sopenharmony_ci 2728c2ecf20Sopenharmony_ci Run malloc(100M) program under this. You'll see 60M of swaps. 2738c2ecf20Sopenharmony_ci 2748c2ecf20Sopenharmony_ci (Shell-B):: 2758c2ecf20Sopenharmony_ci 2768c2ecf20Sopenharmony_ci # move all tasks in /cgroup/test to /cgroup 2778c2ecf20Sopenharmony_ci # /sbin/swapoff -a 2788c2ecf20Sopenharmony_ci # rmdir /cgroup/test 2798c2ecf20Sopenharmony_ci # kill malloc task. 2808c2ecf20Sopenharmony_ci 2818c2ecf20Sopenharmony_ci Of course, tmpfs v.s. swapoff test should be tested, too. 2828c2ecf20Sopenharmony_ci 2838c2ecf20Sopenharmony_ci9.8 OOM-Killer 2848c2ecf20Sopenharmony_ci-------------- 2858c2ecf20Sopenharmony_ci 2868c2ecf20Sopenharmony_ci Out-of-memory caused by memcg's limit will kill tasks under 2878c2ecf20Sopenharmony_ci the memcg. When hierarchy is used, a task under hierarchy 2888c2ecf20Sopenharmony_ci will be killed by the kernel. 2898c2ecf20Sopenharmony_ci 2908c2ecf20Sopenharmony_ci In this case, panic_on_oom shouldn't be invoked and tasks 2918c2ecf20Sopenharmony_ci in other groups shouldn't be killed. 2928c2ecf20Sopenharmony_ci 2938c2ecf20Sopenharmony_ci It's not difficult to cause OOM under memcg as following. 2948c2ecf20Sopenharmony_ci 2958c2ecf20Sopenharmony_ci Case A) when you can swapoff:: 2968c2ecf20Sopenharmony_ci 2978c2ecf20Sopenharmony_ci #swapoff -a 2988c2ecf20Sopenharmony_ci #echo 50M > /memory.limit_in_bytes 2998c2ecf20Sopenharmony_ci 3008c2ecf20Sopenharmony_ci run 51M of malloc 3018c2ecf20Sopenharmony_ci 3028c2ecf20Sopenharmony_ci Case B) when you use mem+swap limitation:: 3038c2ecf20Sopenharmony_ci 3048c2ecf20Sopenharmony_ci #echo 50M > memory.limit_in_bytes 3058c2ecf20Sopenharmony_ci #echo 50M > memory.memsw.limit_in_bytes 3068c2ecf20Sopenharmony_ci 3078c2ecf20Sopenharmony_ci run 51M of malloc 3088c2ecf20Sopenharmony_ci 3098c2ecf20Sopenharmony_ci9.9 Move charges at task migration 3108c2ecf20Sopenharmony_ci---------------------------------- 3118c2ecf20Sopenharmony_ci 3128c2ecf20Sopenharmony_ci Charges associated with a task can be moved along with task migration. 3138c2ecf20Sopenharmony_ci 3148c2ecf20Sopenharmony_ci (Shell-A):: 3158c2ecf20Sopenharmony_ci 3168c2ecf20Sopenharmony_ci #mkdir /cgroup/A 3178c2ecf20Sopenharmony_ci #echo $$ >/cgroup/A/tasks 3188c2ecf20Sopenharmony_ci 3198c2ecf20Sopenharmony_ci run some programs which uses some amount of memory in /cgroup/A. 3208c2ecf20Sopenharmony_ci 3218c2ecf20Sopenharmony_ci (Shell-B):: 3228c2ecf20Sopenharmony_ci 3238c2ecf20Sopenharmony_ci #mkdir /cgroup/B 3248c2ecf20Sopenharmony_ci #echo 1 >/cgroup/B/memory.move_charge_at_immigrate 3258c2ecf20Sopenharmony_ci #echo "pid of the program running in group A" >/cgroup/B/tasks 3268c2ecf20Sopenharmony_ci 3278c2ecf20Sopenharmony_ci You can see charges have been moved by reading ``*.usage_in_bytes`` or 3288c2ecf20Sopenharmony_ci memory.stat of both A and B. 3298c2ecf20Sopenharmony_ci 3308c2ecf20Sopenharmony_ci See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should 3318c2ecf20Sopenharmony_ci be written to move_charge_at_immigrate. 3328c2ecf20Sopenharmony_ci 3338c2ecf20Sopenharmony_ci9.10 Memory thresholds 3348c2ecf20Sopenharmony_ci---------------------- 3358c2ecf20Sopenharmony_ci 3368c2ecf20Sopenharmony_ci Memory controller implements memory thresholds using cgroups notification 3378c2ecf20Sopenharmony_ci API. You can use tools/cgroup/cgroup_event_listener.c to test it. 3388c2ecf20Sopenharmony_ci 3398c2ecf20Sopenharmony_ci (Shell-A) Create cgroup and run event listener:: 3408c2ecf20Sopenharmony_ci 3418c2ecf20Sopenharmony_ci # mkdir /cgroup/A 3428c2ecf20Sopenharmony_ci # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M 3438c2ecf20Sopenharmony_ci 3448c2ecf20Sopenharmony_ci (Shell-B) Add task to cgroup and try to allocate and free memory:: 3458c2ecf20Sopenharmony_ci 3468c2ecf20Sopenharmony_ci # echo $$ >/cgroup/A/tasks 3478c2ecf20Sopenharmony_ci # a="$(dd if=/dev/zero bs=1M count=10)" 3488c2ecf20Sopenharmony_ci # a= 3498c2ecf20Sopenharmony_ci 3508c2ecf20Sopenharmony_ci You will see message from cgroup_event_listener every time you cross 3518c2ecf20Sopenharmony_ci the thresholds. 3528c2ecf20Sopenharmony_ci 3538c2ecf20Sopenharmony_ci Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. 3548c2ecf20Sopenharmony_ci 3558c2ecf20Sopenharmony_ci It's good idea to test root cgroup as well. 356