18c2ecf20Sopenharmony_ci.. _split_page_table_lock: 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci===================== 48c2ecf20Sopenharmony_ciSplit page table lock 58c2ecf20Sopenharmony_ci===================== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciOriginally, mm->page_table_lock spinlock protected all page tables of the 88c2ecf20Sopenharmony_cimm_struct. But this approach leads to poor page fault scalability of 98c2ecf20Sopenharmony_cimulti-threaded applications due high contention on the lock. To improve 108c2ecf20Sopenharmony_ciscalability, split page table lock was introduced. 118c2ecf20Sopenharmony_ci 128c2ecf20Sopenharmony_ciWith split page table lock we have separate per-table lock to serialize 138c2ecf20Sopenharmony_ciaccess to the table. At the moment we use split lock for PTE and PMD 148c2ecf20Sopenharmony_citables. Access to higher level tables protected by mm->page_table_lock. 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ciThere are helpers to lock/unlock a table and other accessor functions: 178c2ecf20Sopenharmony_ci 188c2ecf20Sopenharmony_ci - pte_offset_map_lock() 198c2ecf20Sopenharmony_ci maps pte and takes PTE table lock, returns pointer to the taken 208c2ecf20Sopenharmony_ci lock; 218c2ecf20Sopenharmony_ci - pte_unmap_unlock() 228c2ecf20Sopenharmony_ci unlocks and unmaps PTE table; 238c2ecf20Sopenharmony_ci - pte_alloc_map_lock() 248c2ecf20Sopenharmony_ci allocates PTE table if needed and take the lock, returns pointer 258c2ecf20Sopenharmony_ci to taken lock or NULL if allocation failed; 268c2ecf20Sopenharmony_ci - pte_lockptr() 278c2ecf20Sopenharmony_ci returns pointer to PTE table lock; 288c2ecf20Sopenharmony_ci - pmd_lock() 298c2ecf20Sopenharmony_ci takes PMD table lock, returns pointer to taken lock; 308c2ecf20Sopenharmony_ci - pmd_lockptr() 318c2ecf20Sopenharmony_ci returns pointer to PMD table lock; 328c2ecf20Sopenharmony_ci 338c2ecf20Sopenharmony_ciSplit page table lock for PTE tables is enabled compile-time if 348c2ecf20Sopenharmony_ciCONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS. 358c2ecf20Sopenharmony_ciIf split lock is disabled, all tables guaded by mm->page_table_lock. 368c2ecf20Sopenharmony_ci 378c2ecf20Sopenharmony_ciSplit page table lock for PMD tables is enabled, if it's enabled for PTE 388c2ecf20Sopenharmony_citables and the architecture supports it (see below). 398c2ecf20Sopenharmony_ci 408c2ecf20Sopenharmony_ciHugetlb and split page table lock 418c2ecf20Sopenharmony_ci================================= 428c2ecf20Sopenharmony_ci 438c2ecf20Sopenharmony_ciHugetlb can support several page sizes. We use split lock only for PMD 448c2ecf20Sopenharmony_cilevel, but not for PUD. 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ciHugetlb-specific helpers: 478c2ecf20Sopenharmony_ci 488c2ecf20Sopenharmony_ci - huge_pte_lock() 498c2ecf20Sopenharmony_ci takes pmd split lock for PMD_SIZE page, mm->page_table_lock 508c2ecf20Sopenharmony_ci otherwise; 518c2ecf20Sopenharmony_ci - huge_pte_lockptr() 528c2ecf20Sopenharmony_ci returns pointer to table lock; 538c2ecf20Sopenharmony_ci 548c2ecf20Sopenharmony_ciSupport of split page table lock by an architecture 558c2ecf20Sopenharmony_ci=================================================== 568c2ecf20Sopenharmony_ci 578c2ecf20Sopenharmony_ciThere's no need in special enabling of PTE split page table lock: everything 588c2ecf20Sopenharmony_cirequired is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which 598c2ecf20Sopenharmony_cimust be called on PTE table allocation / freeing. 608c2ecf20Sopenharmony_ci 618c2ecf20Sopenharmony_ciMake sure the architecture doesn't use slab allocator for page table 628c2ecf20Sopenharmony_ciallocation: slab uses page->slab_cache for its pages. 638c2ecf20Sopenharmony_ciThis field shares storage with page->ptl. 648c2ecf20Sopenharmony_ci 658c2ecf20Sopenharmony_ciPMD split lock only makes sense if you have more than two page table 668c2ecf20Sopenharmony_cilevels. 678c2ecf20Sopenharmony_ci 688c2ecf20Sopenharmony_ciPMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table 698c2ecf20Sopenharmony_ciallocation and pgtable_pmd_page_dtor() on freeing. 708c2ecf20Sopenharmony_ci 718c2ecf20Sopenharmony_ciAllocation usually happens in pmd_alloc_one(), freeing in pmd_free() and 728c2ecf20Sopenharmony_cipmd_free_tlb(), but make sure you cover all PMD table allocation / freeing 738c2ecf20Sopenharmony_cipaths: i.e X86_PAE preallocate few PMDs on pgd_alloc(). 748c2ecf20Sopenharmony_ci 758c2ecf20Sopenharmony_ciWith everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK. 768c2ecf20Sopenharmony_ci 778c2ecf20Sopenharmony_ciNOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must 788c2ecf20Sopenharmony_cibe handled properly. 798c2ecf20Sopenharmony_ci 808c2ecf20Sopenharmony_cipage->ptl 818c2ecf20Sopenharmony_ci========= 828c2ecf20Sopenharmony_ci 838c2ecf20Sopenharmony_cipage->ptl is used to access split page table lock, where 'page' is struct 848c2ecf20Sopenharmony_cipage of page containing the table. It shares storage with page->private 858c2ecf20Sopenharmony_ci(and few other fields in union). 868c2ecf20Sopenharmony_ci 878c2ecf20Sopenharmony_ciTo avoid increasing size of struct page and have best performance, we use a 888c2ecf20Sopenharmony_citrick: 898c2ecf20Sopenharmony_ci 908c2ecf20Sopenharmony_ci - if spinlock_t fits into long, we use page->ptr as spinlock, so we 918c2ecf20Sopenharmony_ci can avoid indirect access and save a cache line. 928c2ecf20Sopenharmony_ci - if size of spinlock_t is bigger then size of long, we use page->ptl as 938c2ecf20Sopenharmony_ci pointer to spinlock_t and allocate it dynamically. This allows to use 948c2ecf20Sopenharmony_ci split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs 958c2ecf20Sopenharmony_ci one more cache line for indirect access; 968c2ecf20Sopenharmony_ci 978c2ecf20Sopenharmony_ciThe spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in 988c2ecf20Sopenharmony_cipgtable_pmd_page_ctor() for PMD table. 998c2ecf20Sopenharmony_ci 1008c2ecf20Sopenharmony_ciPlease, never access page->ptl directly -- use appropriate helper. 101