162306a36Sopenharmony_ci===================================================== 262306a36Sopenharmony_ciMemory Resource Controller(Memcg) Implementation Memo 362306a36Sopenharmony_ci===================================================== 462306a36Sopenharmony_ci 562306a36Sopenharmony_ciLast Updated: 2010/2 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciBase Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciBecause VM is getting complex (one of reasons is memcg...), memcg's behavior 1062306a36Sopenharmony_ciis complex. This is a document for memcg's internal behavior. 1162306a36Sopenharmony_ciPlease note that implementation details can be changed. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ci(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst) 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ci0. How to record usage ? 1662306a36Sopenharmony_ci======================== 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ci 2 objects are used. 1962306a36Sopenharmony_ci 2062306a36Sopenharmony_ci page_cgroup ....an object per page. 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ci Allocated at boot or memory hotplug. Freed at memory hot removal. 2362306a36Sopenharmony_ci 2462306a36Sopenharmony_ci swap_cgroup ... an entry per swp_entry. 2562306a36Sopenharmony_ci 2662306a36Sopenharmony_ci Allocated at swapon(). Freed at swapoff(). 2762306a36Sopenharmony_ci 2862306a36Sopenharmony_ci The page_cgroup has USED bit and double count against a page_cgroup never 2962306a36Sopenharmony_ci occurs. swap_cgroup is used only when a charged page is swapped-out. 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ci1. Charge 3262306a36Sopenharmony_ci========= 3362306a36Sopenharmony_ci 3462306a36Sopenharmony_ci a page/swp_entry may be charged (usage += PAGE_SIZE) at 3562306a36Sopenharmony_ci 3662306a36Sopenharmony_ci mem_cgroup_try_charge() 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ci2. Uncharge 3962306a36Sopenharmony_ci=========== 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ci a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by 4262306a36Sopenharmony_ci 4362306a36Sopenharmony_ci mem_cgroup_uncharge() 4462306a36Sopenharmony_ci Called when a page's refcount goes down to 0. 4562306a36Sopenharmony_ci 4662306a36Sopenharmony_ci mem_cgroup_uncharge_swap() 4762306a36Sopenharmony_ci Called when swp_entry's refcnt goes down to 0. A charge against swap 4862306a36Sopenharmony_ci disappears. 4962306a36Sopenharmony_ci 5062306a36Sopenharmony_ci3. charge-commit-cancel 5162306a36Sopenharmony_ci======================= 5262306a36Sopenharmony_ci 5362306a36Sopenharmony_ci Memcg pages are charged in two steps: 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ci - mem_cgroup_try_charge() 5662306a36Sopenharmony_ci - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ci At try_charge(), there are no flags to say "this page is charged". 5962306a36Sopenharmony_ci at this point, usage += PAGE_SIZE. 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ci At commit(), the page is associated with the memcg. 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ci At cancel(), simply usage -= PAGE_SIZE. 6462306a36Sopenharmony_ci 6562306a36Sopenharmony_ciUnder below explanation, we assume CONFIG_SWAP=y. 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ci4. Anonymous 6862306a36Sopenharmony_ci============ 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ci Anonymous page is newly allocated at 7162306a36Sopenharmony_ci - page fault into MAP_ANONYMOUS mapping. 7262306a36Sopenharmony_ci - Copy-On-Write. 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ci 4.1 Swap-in. 7562306a36Sopenharmony_ci At swap-in, the page is taken from swap-cache. There are 2 cases. 7662306a36Sopenharmony_ci 7762306a36Sopenharmony_ci (a) If the SwapCache is newly allocated and read, it has no charges. 7862306a36Sopenharmony_ci (b) If the SwapCache has been mapped by processes, it has been 7962306a36Sopenharmony_ci charged already. 8062306a36Sopenharmony_ci 8162306a36Sopenharmony_ci 4.2 Swap-out. 8262306a36Sopenharmony_ci At swap-out, typical state transition is below. 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ci (a) add to swap cache. (marked as SwapCache) 8562306a36Sopenharmony_ci swp_entry's refcnt += 1. 8662306a36Sopenharmony_ci (b) fully unmapped. 8762306a36Sopenharmony_ci swp_entry's refcnt += # of ptes. 8862306a36Sopenharmony_ci (c) write back to swap. 8962306a36Sopenharmony_ci (d) delete from swap cache. (remove from SwapCache) 9062306a36Sopenharmony_ci swp_entry's refcnt -= 1. 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ci 9362306a36Sopenharmony_ci Finally, at task exit, 9462306a36Sopenharmony_ci (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. 9562306a36Sopenharmony_ci 9662306a36Sopenharmony_ci5. Page Cache 9762306a36Sopenharmony_ci============= 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ci Page Cache is charged at 10062306a36Sopenharmony_ci - filemap_add_folio(). 10162306a36Sopenharmony_ci 10262306a36Sopenharmony_ci The logic is very clear. (About migration, see below) 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ci Note: 10562306a36Sopenharmony_ci __remove_from_page_cache() is called by remove_from_page_cache() 10662306a36Sopenharmony_ci and __remove_mapping(). 10762306a36Sopenharmony_ci 10862306a36Sopenharmony_ci6. Shmem(tmpfs) Page Cache 10962306a36Sopenharmony_ci=========================== 11062306a36Sopenharmony_ci 11162306a36Sopenharmony_ci The best way to understand shmem's page state transition is to read 11262306a36Sopenharmony_ci mm/shmem.c. 11362306a36Sopenharmony_ci 11462306a36Sopenharmony_ci But brief explanation of the behavior of memcg around shmem will be 11562306a36Sopenharmony_ci helpful to understand the logic. 11662306a36Sopenharmony_ci 11762306a36Sopenharmony_ci Shmem's page (just leaf page, not direct/indirect block) can be on 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ci - radix-tree of shmem's inode. 12062306a36Sopenharmony_ci - SwapCache. 12162306a36Sopenharmony_ci - Both on radix-tree and SwapCache. This happens at swap-in 12262306a36Sopenharmony_ci and swap-out, 12362306a36Sopenharmony_ci 12462306a36Sopenharmony_ci It's charged when... 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ci - A new page is added to shmem's radix-tree. 12762306a36Sopenharmony_ci - A swp page is read. (move a charge from swap_cgroup to page_cgroup) 12862306a36Sopenharmony_ci 12962306a36Sopenharmony_ci7. Page Migration 13062306a36Sopenharmony_ci================= 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ci mem_cgroup_migrate() 13362306a36Sopenharmony_ci 13462306a36Sopenharmony_ci8. LRU 13562306a36Sopenharmony_ci====== 13662306a36Sopenharmony_ci Each memcg has its own vector of LRUs (inactive anon, active anon, 13762306a36Sopenharmony_ci inactive file, active file, unevictable) of pages from each node, 13862306a36Sopenharmony_ci each LRU handled under a single lru_lock for that memcg and node. 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ci9. Typical Tests. 14162306a36Sopenharmony_ci================= 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ci Tests for racy cases. 14462306a36Sopenharmony_ci 14562306a36Sopenharmony_ci9.1 Small limit to memcg. 14662306a36Sopenharmony_ci------------------------- 14762306a36Sopenharmony_ci 14862306a36Sopenharmony_ci When you do test to do racy case, it's good test to set memcg's limit 14962306a36Sopenharmony_ci to be very small rather than GB. Many races found in the test under 15062306a36Sopenharmony_ci xKB or xxMB limits. 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ci (Memory behavior under GB and Memory behavior under MB shows very 15362306a36Sopenharmony_ci different situation.) 15462306a36Sopenharmony_ci 15562306a36Sopenharmony_ci9.2 Shmem 15662306a36Sopenharmony_ci--------- 15762306a36Sopenharmony_ci 15862306a36Sopenharmony_ci Historically, memcg's shmem handling was poor and we saw some amount 15962306a36Sopenharmony_ci of troubles here. This is because shmem is page-cache but can be 16062306a36Sopenharmony_ci SwapCache. Test with shmem/tmpfs is always good test. 16162306a36Sopenharmony_ci 16262306a36Sopenharmony_ci9.3 Migration 16362306a36Sopenharmony_ci------------- 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ci For NUMA, migration is an another special case. To do easy test, cpuset 16662306a36Sopenharmony_ci is useful. Following is a sample script to do migration:: 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ci mount -t cgroup -o cpuset none /opt/cpuset 16962306a36Sopenharmony_ci 17062306a36Sopenharmony_ci mkdir /opt/cpuset/01 17162306a36Sopenharmony_ci echo 1 > /opt/cpuset/01/cpuset.cpus 17262306a36Sopenharmony_ci echo 0 > /opt/cpuset/01/cpuset.mems 17362306a36Sopenharmony_ci echo 1 > /opt/cpuset/01/cpuset.memory_migrate 17462306a36Sopenharmony_ci mkdir /opt/cpuset/02 17562306a36Sopenharmony_ci echo 1 > /opt/cpuset/02/cpuset.cpus 17662306a36Sopenharmony_ci echo 1 > /opt/cpuset/02/cpuset.mems 17762306a36Sopenharmony_ci echo 1 > /opt/cpuset/02/cpuset.memory_migrate 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ci In above set, when you moves a task from 01 to 02, page migration to 18062306a36Sopenharmony_ci node 0 to node 1 will occur. Following is a script to migrate all 18162306a36Sopenharmony_ci under cpuset.:: 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ci -- 18462306a36Sopenharmony_ci move_task() 18562306a36Sopenharmony_ci { 18662306a36Sopenharmony_ci for pid in $1 18762306a36Sopenharmony_ci do 18862306a36Sopenharmony_ci /bin/echo $pid >$2/tasks 2>/dev/null 18962306a36Sopenharmony_ci echo -n $pid 19062306a36Sopenharmony_ci echo -n " " 19162306a36Sopenharmony_ci done 19262306a36Sopenharmony_ci echo END 19362306a36Sopenharmony_ci } 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ci G1_TASK=`cat ${G1}/tasks` 19662306a36Sopenharmony_ci G2_TASK=`cat ${G2}/tasks` 19762306a36Sopenharmony_ci move_task "${G1_TASK}" ${G2} & 19862306a36Sopenharmony_ci -- 19962306a36Sopenharmony_ci 20062306a36Sopenharmony_ci9.4 Memory hotplug 20162306a36Sopenharmony_ci------------------ 20262306a36Sopenharmony_ci 20362306a36Sopenharmony_ci memory hotplug test is one of good test. 20462306a36Sopenharmony_ci 20562306a36Sopenharmony_ci to offline memory, do following:: 20662306a36Sopenharmony_ci 20762306a36Sopenharmony_ci # echo offline > /sys/devices/system/memory/memoryXXX/state 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ci (XXX is the place of memory) 21062306a36Sopenharmony_ci 21162306a36Sopenharmony_ci This is an easy way to test page migration, too. 21262306a36Sopenharmony_ci 21362306a36Sopenharmony_ci9.5 nested cgroups 21462306a36Sopenharmony_ci------------------ 21562306a36Sopenharmony_ci 21662306a36Sopenharmony_ci Use tests like the following for testing nested cgroups:: 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ci mkdir /opt/cgroup/01/child_a 21962306a36Sopenharmony_ci mkdir /opt/cgroup/01/child_b 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ci set limit to 01. 22262306a36Sopenharmony_ci add limit to 01/child_b 22362306a36Sopenharmony_ci run jobs under child_a and child_b 22462306a36Sopenharmony_ci 22562306a36Sopenharmony_ci create/delete following groups at random while jobs are running:: 22662306a36Sopenharmony_ci 22762306a36Sopenharmony_ci /opt/cgroup/01/child_a/child_aa 22862306a36Sopenharmony_ci /opt/cgroup/01/child_b/child_bb 22962306a36Sopenharmony_ci /opt/cgroup/01/child_c 23062306a36Sopenharmony_ci 23162306a36Sopenharmony_ci running new jobs in new group is also good. 23262306a36Sopenharmony_ci 23362306a36Sopenharmony_ci9.6 Mount with other subsystems 23462306a36Sopenharmony_ci------------------------------- 23562306a36Sopenharmony_ci 23662306a36Sopenharmony_ci Mounting with other subsystems is a good test because there is a 23762306a36Sopenharmony_ci race and lock dependency with other cgroup subsystems. 23862306a36Sopenharmony_ci 23962306a36Sopenharmony_ci example:: 24062306a36Sopenharmony_ci 24162306a36Sopenharmony_ci # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices 24262306a36Sopenharmony_ci 24362306a36Sopenharmony_ci and do task move, mkdir, rmdir etc...under this. 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ci9.7 swapoff 24662306a36Sopenharmony_ci----------- 24762306a36Sopenharmony_ci 24862306a36Sopenharmony_ci Besides management of swap is one of complicated parts of memcg, 24962306a36Sopenharmony_ci call path of swap-in at swapoff is not same as usual swap-in path.. 25062306a36Sopenharmony_ci It's worth to be tested explicitly. 25162306a36Sopenharmony_ci 25262306a36Sopenharmony_ci For example, test like following is good: 25362306a36Sopenharmony_ci 25462306a36Sopenharmony_ci (Shell-A):: 25562306a36Sopenharmony_ci 25662306a36Sopenharmony_ci # mount -t cgroup none /cgroup -o memory 25762306a36Sopenharmony_ci # mkdir /cgroup/test 25862306a36Sopenharmony_ci # echo 40M > /cgroup/test/memory.limit_in_bytes 25962306a36Sopenharmony_ci # echo 0 > /cgroup/test/tasks 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ci Run malloc(100M) program under this. You'll see 60M of swaps. 26262306a36Sopenharmony_ci 26362306a36Sopenharmony_ci (Shell-B):: 26462306a36Sopenharmony_ci 26562306a36Sopenharmony_ci # move all tasks in /cgroup/test to /cgroup 26662306a36Sopenharmony_ci # /sbin/swapoff -a 26762306a36Sopenharmony_ci # rmdir /cgroup/test 26862306a36Sopenharmony_ci # kill malloc task. 26962306a36Sopenharmony_ci 27062306a36Sopenharmony_ci Of course, tmpfs v.s. swapoff test should be tested, too. 27162306a36Sopenharmony_ci 27262306a36Sopenharmony_ci9.8 OOM-Killer 27362306a36Sopenharmony_ci-------------- 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ci Out-of-memory caused by memcg's limit will kill tasks under 27662306a36Sopenharmony_ci the memcg. When hierarchy is used, a task under hierarchy 27762306a36Sopenharmony_ci will be killed by the kernel. 27862306a36Sopenharmony_ci 27962306a36Sopenharmony_ci In this case, panic_on_oom shouldn't be invoked and tasks 28062306a36Sopenharmony_ci in other groups shouldn't be killed. 28162306a36Sopenharmony_ci 28262306a36Sopenharmony_ci It's not difficult to cause OOM under memcg as following. 28362306a36Sopenharmony_ci 28462306a36Sopenharmony_ci Case A) when you can swapoff:: 28562306a36Sopenharmony_ci 28662306a36Sopenharmony_ci #swapoff -a 28762306a36Sopenharmony_ci #echo 50M > /memory.limit_in_bytes 28862306a36Sopenharmony_ci 28962306a36Sopenharmony_ci run 51M of malloc 29062306a36Sopenharmony_ci 29162306a36Sopenharmony_ci Case B) when you use mem+swap limitation:: 29262306a36Sopenharmony_ci 29362306a36Sopenharmony_ci #echo 50M > memory.limit_in_bytes 29462306a36Sopenharmony_ci #echo 50M > memory.memsw.limit_in_bytes 29562306a36Sopenharmony_ci 29662306a36Sopenharmony_ci run 51M of malloc 29762306a36Sopenharmony_ci 29862306a36Sopenharmony_ci9.9 Move charges at task migration 29962306a36Sopenharmony_ci---------------------------------- 30062306a36Sopenharmony_ci 30162306a36Sopenharmony_ci Charges associated with a task can be moved along with task migration. 30262306a36Sopenharmony_ci 30362306a36Sopenharmony_ci (Shell-A):: 30462306a36Sopenharmony_ci 30562306a36Sopenharmony_ci #mkdir /cgroup/A 30662306a36Sopenharmony_ci #echo $$ >/cgroup/A/tasks 30762306a36Sopenharmony_ci 30862306a36Sopenharmony_ci run some programs which uses some amount of memory in /cgroup/A. 30962306a36Sopenharmony_ci 31062306a36Sopenharmony_ci (Shell-B):: 31162306a36Sopenharmony_ci 31262306a36Sopenharmony_ci #mkdir /cgroup/B 31362306a36Sopenharmony_ci #echo 1 >/cgroup/B/memory.move_charge_at_immigrate 31462306a36Sopenharmony_ci #echo "pid of the program running in group A" >/cgroup/B/tasks 31562306a36Sopenharmony_ci 31662306a36Sopenharmony_ci You can see charges have been moved by reading ``*.usage_in_bytes`` or 31762306a36Sopenharmony_ci memory.stat of both A and B. 31862306a36Sopenharmony_ci 31962306a36Sopenharmony_ci See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should 32062306a36Sopenharmony_ci be written to move_charge_at_immigrate. 32162306a36Sopenharmony_ci 32262306a36Sopenharmony_ci9.10 Memory thresholds 32362306a36Sopenharmony_ci---------------------- 32462306a36Sopenharmony_ci 32562306a36Sopenharmony_ci Memory controller implements memory thresholds using cgroups notification 32662306a36Sopenharmony_ci API. You can use tools/cgroup/cgroup_event_listener.c to test it. 32762306a36Sopenharmony_ci 32862306a36Sopenharmony_ci (Shell-A) Create cgroup and run event listener:: 32962306a36Sopenharmony_ci 33062306a36Sopenharmony_ci # mkdir /cgroup/A 33162306a36Sopenharmony_ci # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M 33262306a36Sopenharmony_ci 33362306a36Sopenharmony_ci (Shell-B) Add task to cgroup and try to allocate and free memory:: 33462306a36Sopenharmony_ci 33562306a36Sopenharmony_ci # echo $$ >/cgroup/A/tasks 33662306a36Sopenharmony_ci # a="$(dd if=/dev/zero bs=1M count=10)" 33762306a36Sopenharmony_ci # a= 33862306a36Sopenharmony_ci 33962306a36Sopenharmony_ci You will see message from cgroup_event_listener every time you cross 34062306a36Sopenharmony_ci the thresholds. 34162306a36Sopenharmony_ci 34262306a36Sopenharmony_ci Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. 34362306a36Sopenharmony_ci 34462306a36Sopenharmony_ci It's good idea to test root cgroup as well. 345