162306a36Sopenharmony_ci============== 262306a36Sopenharmony_ciPage migration 362306a36Sopenharmony_ci============== 462306a36Sopenharmony_ci 562306a36Sopenharmony_ciPage migration allows moving the physical location of pages between 662306a36Sopenharmony_cinodes in a NUMA system while the process is running. This means that the 762306a36Sopenharmony_civirtual addresses that the process sees do not change. However, the 862306a36Sopenharmony_cisystem rearranges the physical location of those pages. 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciAlso see Documentation/mm/hmm.rst for migrating pages to or from device 1162306a36Sopenharmony_ciprivate memory. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciThe main intent of page migration is to reduce the latency of memory accesses 1462306a36Sopenharmony_ciby moving pages near to the processor where the process accessing that memory 1562306a36Sopenharmony_ciis running. 1662306a36Sopenharmony_ci 1762306a36Sopenharmony_ciPage migration allows a process to manually relocate the node on which its 1862306a36Sopenharmony_cipages are located through the MF_MOVE and MF_MOVE_ALL options while setting 1962306a36Sopenharmony_cia new memory policy via mbind(). The pages of a process can also be relocated 2062306a36Sopenharmony_cifrom another process using the sys_migrate_pages() function call. The 2162306a36Sopenharmony_cimigrate_pages() function call takes two sets of nodes and moves pages of a 2262306a36Sopenharmony_ciprocess that are located on the from nodes to the destination nodes. 2362306a36Sopenharmony_ciPage migration functions are provided by the numactl package by Andi Kleen 2462306a36Sopenharmony_ci(a version later than 0.9.3 is required. Get it from 2562306a36Sopenharmony_cihttps://github.com/numactl/numactl.git). numactl provides libnuma 2662306a36Sopenharmony_ciwhich provides an interface similar to other NUMA functionality for page 2762306a36Sopenharmony_cimigration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the 2862306a36Sopenharmony_cipages of a process are located. See also the numa_maps documentation in the 2962306a36Sopenharmony_ciproc(5) man page. 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ciManual migration is useful if for example the scheduler has relocated 3262306a36Sopenharmony_cia process to a processor on a distant node. A batch scheduler or an 3362306a36Sopenharmony_ciadministrator may detect the situation and move the pages of the process 3462306a36Sopenharmony_cinearer to the new processor. The kernel itself only provides 3562306a36Sopenharmony_cimanual page migration support. Automatic page migration may be implemented 3662306a36Sopenharmony_cithrough user space processes that move pages. A special function call 3762306a36Sopenharmony_ci"move_pages" allows the moving of individual pages within a process. 3862306a36Sopenharmony_ciFor example, A NUMA profiler may obtain a log showing frequent off-node 3962306a36Sopenharmony_ciaccesses and may use the result to move pages to more advantageous 4062306a36Sopenharmony_cilocations. 4162306a36Sopenharmony_ci 4262306a36Sopenharmony_ciLarger installations usually partition the system using cpusets into 4362306a36Sopenharmony_cisections of nodes. Paul Jackson has equipped cpusets with the ability to 4462306a36Sopenharmony_cimove pages when a task is moved to another cpuset (See 4562306a36Sopenharmony_ci:ref:`CPUSETS <cpusets>`). 4662306a36Sopenharmony_ciCpusets allow the automation of process locality. If a task is moved to 4762306a36Sopenharmony_cia new cpuset then also all its pages are moved with it so that the 4862306a36Sopenharmony_ciperformance of the process does not sink dramatically. Also the pages 4962306a36Sopenharmony_ciof processes in a cpuset are moved if the allowed memory nodes of a 5062306a36Sopenharmony_cicpuset are changed. 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ciPage migration allows the preservation of the relative location of pages 5362306a36Sopenharmony_ciwithin a group of nodes for all migration techniques which will preserve a 5462306a36Sopenharmony_ciparticular memory allocation pattern generated even after migrating a 5562306a36Sopenharmony_ciprocess. This is necessary in order to preserve the memory latencies. 5662306a36Sopenharmony_ciProcesses will run with similar performance after migration. 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ciPage migration occurs in several steps. First a high level 5962306a36Sopenharmony_cidescription for those trying to use migrate_pages() from the kernel 6062306a36Sopenharmony_ci(for userspace usage see the Andi Kleen's numactl package mentioned above) 6162306a36Sopenharmony_ciand then a low level description of how the low level details work. 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ciIn kernel use of migrate_pages() 6462306a36Sopenharmony_ci================================ 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ci1. Remove pages from the LRU. 6762306a36Sopenharmony_ci 6862306a36Sopenharmony_ci Lists of pages to be migrated are generated by scanning over 6962306a36Sopenharmony_ci pages and moving them into lists. This is done by 7062306a36Sopenharmony_ci calling isolate_lru_page(). 7162306a36Sopenharmony_ci Calling isolate_lru_page() increases the references to the page 7262306a36Sopenharmony_ci so that it cannot vanish while the page migration occurs. 7362306a36Sopenharmony_ci It also prevents the swapper or other scans from encountering 7462306a36Sopenharmony_ci the page. 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_ci2. We need to have a function of type new_folio_t that can be 7762306a36Sopenharmony_ci passed to migrate_pages(). This function should figure out 7862306a36Sopenharmony_ci how to allocate the correct new folio given the old folio. 7962306a36Sopenharmony_ci 8062306a36Sopenharmony_ci3. The migrate_pages() function is called which attempts 8162306a36Sopenharmony_ci to do the migration. It will call the function to allocate 8262306a36Sopenharmony_ci the new folio for each folio that is considered for moving. 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ciHow migrate_pages() works 8562306a36Sopenharmony_ci========================= 8662306a36Sopenharmony_ci 8762306a36Sopenharmony_cimigrate_pages() does several passes over its list of pages. A page is moved 8862306a36Sopenharmony_ciif all references to a page are removable at the time. The page has 8962306a36Sopenharmony_cialready been removed from the LRU via isolate_lru_page() and the refcount 9062306a36Sopenharmony_ciis increased so that the page cannot be freed while page migration occurs. 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ciSteps: 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ci1. Lock the page to be migrated. 9562306a36Sopenharmony_ci 9662306a36Sopenharmony_ci2. Ensure that writeback is complete. 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ci3. Lock the new page that we want to move to. It is locked so that accesses to 9962306a36Sopenharmony_ci this (not yet up-to-date) page immediately block while the move is in progress. 10062306a36Sopenharmony_ci 10162306a36Sopenharmony_ci4. All the page table references to the page are converted to migration 10262306a36Sopenharmony_ci entries. This decreases the mapcount of a page. If the resulting 10362306a36Sopenharmony_ci mapcount is not zero then we do not migrate the page. All user space 10462306a36Sopenharmony_ci processes that attempt to access the page will now wait on the page lock 10562306a36Sopenharmony_ci or wait for the migration page table entry to be removed. 10662306a36Sopenharmony_ci 10762306a36Sopenharmony_ci5. The i_pages lock is taken. This will cause all processes trying 10862306a36Sopenharmony_ci to access the page via the mapping to block on the spinlock. 10962306a36Sopenharmony_ci 11062306a36Sopenharmony_ci6. The refcount of the page is examined and we back out if references remain. 11162306a36Sopenharmony_ci Otherwise, we know that we are the only one referencing this page. 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_ci7. The radix tree is checked and if it does not contain the pointer to this 11462306a36Sopenharmony_ci page then we back out because someone else modified the radix tree. 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ci8. The new page is prepped with some settings from the old page so that 11762306a36Sopenharmony_ci accesses to the new page will discover a page with the correct settings. 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ci9. The radix tree is changed to point to the new page. 12062306a36Sopenharmony_ci 12162306a36Sopenharmony_ci10. The reference count of the old page is dropped because the address space 12262306a36Sopenharmony_ci reference is gone. A reference to the new page is established because 12362306a36Sopenharmony_ci the new page is referenced by the address space. 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ci11. The i_pages lock is dropped. With that lookups in the mapping 12662306a36Sopenharmony_ci become possible again. Processes will move from spinning on the lock 12762306a36Sopenharmony_ci to sleeping on the locked new page. 12862306a36Sopenharmony_ci 12962306a36Sopenharmony_ci12. The page contents are copied to the new page. 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ci13. The remaining page flags are copied to the new page. 13262306a36Sopenharmony_ci 13362306a36Sopenharmony_ci14. The old page flags are cleared to indicate that the page does 13462306a36Sopenharmony_ci not provide any information anymore. 13562306a36Sopenharmony_ci 13662306a36Sopenharmony_ci15. Queued up writeback on the new page is triggered. 13762306a36Sopenharmony_ci 13862306a36Sopenharmony_ci16. If migration entries were inserted into the page table, then replace them 13962306a36Sopenharmony_ci with real ptes. Doing so will enable access for user space processes not 14062306a36Sopenharmony_ci already waiting for the page lock. 14162306a36Sopenharmony_ci 14262306a36Sopenharmony_ci17. The page locks are dropped from the old and new page. 14362306a36Sopenharmony_ci Processes waiting on the page lock will redo their page faults 14462306a36Sopenharmony_ci and will reach the new page. 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ci18. The new page is moved to the LRU and can be scanned by the swapper, 14762306a36Sopenharmony_ci etc. again. 14862306a36Sopenharmony_ci 14962306a36Sopenharmony_ciNon-LRU page migration 15062306a36Sopenharmony_ci====================== 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ciAlthough migration originally aimed for reducing the latency of memory 15362306a36Sopenharmony_ciaccesses for NUMA, compaction also uses migration to create high-order 15462306a36Sopenharmony_cipages. For compaction purposes, it is also useful to be able to move 15562306a36Sopenharmony_cinon-LRU pages, such as zsmalloc and virtio-balloon pages. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ciIf a driver wants to make its pages movable, it should define a struct 15862306a36Sopenharmony_cimovable_operations. It then needs to call __SetPageMovable() on each 15962306a36Sopenharmony_cipage that it may be able to move. This uses the ``page->mapping`` field, 16062306a36Sopenharmony_ciso this field is not available for the driver to use for other purposes. 16162306a36Sopenharmony_ci 16262306a36Sopenharmony_ciMonitoring Migration 16362306a36Sopenharmony_ci===================== 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ciThe following events (counters) can be used to monitor page migration. 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ci1. PGMIGRATE_SUCCESS: Normal page migration success. Each count means that a 16862306a36Sopenharmony_ci page was migrated. If the page was a non-THP and non-hugetlb page, then 16962306a36Sopenharmony_ci this counter is increased by one. If the page was a THP or hugetlb, then 17062306a36Sopenharmony_ci this counter is increased by the number of THP or hugetlb subpages. 17162306a36Sopenharmony_ci For example, migration of a single 2MB THP that has 4KB-size base pages 17262306a36Sopenharmony_ci (subpages) will cause this counter to increase by 512. 17362306a36Sopenharmony_ci 17462306a36Sopenharmony_ci2. PGMIGRATE_FAIL: Normal page migration failure. Same counting rules as for 17562306a36Sopenharmony_ci PGMIGRATE_SUCCESS, above: this will be increased by the number of subpages, 17662306a36Sopenharmony_ci if it was a THP or hugetlb. 17762306a36Sopenharmony_ci 17862306a36Sopenharmony_ci3. THP_MIGRATION_SUCCESS: A THP was migrated without being split. 17962306a36Sopenharmony_ci 18062306a36Sopenharmony_ci4. THP_MIGRATION_FAIL: A THP could not be migrated nor it could be split. 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ci5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP had 18362306a36Sopenharmony_ci to be split. After splitting, a migration retry was used for its sub-pages. 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ciTHP_MIGRATION_* events also update the appropriate PGMIGRATE_SUCCESS or 18662306a36Sopenharmony_ciPGMIGRATE_FAIL events. For example, a THP migration failure will cause both 18762306a36Sopenharmony_ciTHP_MIGRATION_FAIL and PGMIGRATE_FAIL to increase. 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ciChristoph Lameter, May 8, 2006. 19062306a36Sopenharmony_ciMinchan Kim, Mar 28, 2016. 19162306a36Sopenharmony_ci 19262306a36Sopenharmony_ci.. kernel-doc:: include/linux/migrate.h 193