18c2ecf20Sopenharmony_ci.. _split_page_table_lock:
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci=====================
48c2ecf20Sopenharmony_ciSplit page table lock
58c2ecf20Sopenharmony_ci=====================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciOriginally, mm->page_table_lock spinlock protected all page tables of the
88c2ecf20Sopenharmony_cimm_struct. But this approach leads to poor page fault scalability of
98c2ecf20Sopenharmony_cimulti-threaded applications due high contention on the lock. To improve
108c2ecf20Sopenharmony_ciscalability, split page table lock was introduced.
118c2ecf20Sopenharmony_ci
128c2ecf20Sopenharmony_ciWith split page table lock we have separate per-table lock to serialize
138c2ecf20Sopenharmony_ciaccess to the table. At the moment we use split lock for PTE and PMD
148c2ecf20Sopenharmony_citables. Access to higher level tables protected by mm->page_table_lock.
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ciThere are helpers to lock/unlock a table and other accessor functions:
178c2ecf20Sopenharmony_ci
188c2ecf20Sopenharmony_ci - pte_offset_map_lock()
198c2ecf20Sopenharmony_ci	maps pte and takes PTE table lock, returns pointer to the taken
208c2ecf20Sopenharmony_ci	lock;
218c2ecf20Sopenharmony_ci - pte_unmap_unlock()
228c2ecf20Sopenharmony_ci	unlocks and unmaps PTE table;
238c2ecf20Sopenharmony_ci - pte_alloc_map_lock()
248c2ecf20Sopenharmony_ci	allocates PTE table if needed and take the lock, returns pointer
258c2ecf20Sopenharmony_ci	to taken lock or NULL if allocation failed;
268c2ecf20Sopenharmony_ci - pte_lockptr()
278c2ecf20Sopenharmony_ci	returns pointer to PTE table lock;
288c2ecf20Sopenharmony_ci - pmd_lock()
298c2ecf20Sopenharmony_ci	takes PMD table lock, returns pointer to taken lock;
308c2ecf20Sopenharmony_ci - pmd_lockptr()
318c2ecf20Sopenharmony_ci	returns pointer to PMD table lock;
328c2ecf20Sopenharmony_ci
338c2ecf20Sopenharmony_ciSplit page table lock for PTE tables is enabled compile-time if
348c2ecf20Sopenharmony_ciCONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
358c2ecf20Sopenharmony_ciIf split lock is disabled, all tables guaded by mm->page_table_lock.
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ciSplit page table lock for PMD tables is enabled, if it's enabled for PTE
388c2ecf20Sopenharmony_citables and the architecture supports it (see below).
398c2ecf20Sopenharmony_ci
408c2ecf20Sopenharmony_ciHugetlb and split page table lock
418c2ecf20Sopenharmony_ci=================================
428c2ecf20Sopenharmony_ci
438c2ecf20Sopenharmony_ciHugetlb can support several page sizes. We use split lock only for PMD
448c2ecf20Sopenharmony_cilevel, but not for PUD.
458c2ecf20Sopenharmony_ci
468c2ecf20Sopenharmony_ciHugetlb-specific helpers:
478c2ecf20Sopenharmony_ci
488c2ecf20Sopenharmony_ci - huge_pte_lock()
498c2ecf20Sopenharmony_ci	takes pmd split lock for PMD_SIZE page, mm->page_table_lock
508c2ecf20Sopenharmony_ci	otherwise;
518c2ecf20Sopenharmony_ci - huge_pte_lockptr()
528c2ecf20Sopenharmony_ci	returns pointer to table lock;
538c2ecf20Sopenharmony_ci
548c2ecf20Sopenharmony_ciSupport of split page table lock by an architecture
558c2ecf20Sopenharmony_ci===================================================
568c2ecf20Sopenharmony_ci
578c2ecf20Sopenharmony_ciThere's no need in special enabling of PTE split page table lock: everything
588c2ecf20Sopenharmony_cirequired is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which
598c2ecf20Sopenharmony_cimust be called on PTE table allocation / freeing.
608c2ecf20Sopenharmony_ci
618c2ecf20Sopenharmony_ciMake sure the architecture doesn't use slab allocator for page table
628c2ecf20Sopenharmony_ciallocation: slab uses page->slab_cache for its pages.
638c2ecf20Sopenharmony_ciThis field shares storage with page->ptl.
648c2ecf20Sopenharmony_ci
658c2ecf20Sopenharmony_ciPMD split lock only makes sense if you have more than two page table
668c2ecf20Sopenharmony_cilevels.
678c2ecf20Sopenharmony_ci
688c2ecf20Sopenharmony_ciPMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
698c2ecf20Sopenharmony_ciallocation and pgtable_pmd_page_dtor() on freeing.
708c2ecf20Sopenharmony_ci
718c2ecf20Sopenharmony_ciAllocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
728c2ecf20Sopenharmony_cipmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
738c2ecf20Sopenharmony_cipaths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
748c2ecf20Sopenharmony_ci
758c2ecf20Sopenharmony_ciWith everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
768c2ecf20Sopenharmony_ci
778c2ecf20Sopenharmony_ciNOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
788c2ecf20Sopenharmony_cibe handled properly.
798c2ecf20Sopenharmony_ci
808c2ecf20Sopenharmony_cipage->ptl
818c2ecf20Sopenharmony_ci=========
828c2ecf20Sopenharmony_ci
838c2ecf20Sopenharmony_cipage->ptl is used to access split page table lock, where 'page' is struct
848c2ecf20Sopenharmony_cipage of page containing the table. It shares storage with page->private
858c2ecf20Sopenharmony_ci(and few other fields in union).
868c2ecf20Sopenharmony_ci
878c2ecf20Sopenharmony_ciTo avoid increasing size of struct page and have best performance, we use a
888c2ecf20Sopenharmony_citrick:
898c2ecf20Sopenharmony_ci
908c2ecf20Sopenharmony_ci - if spinlock_t fits into long, we use page->ptr as spinlock, so we
918c2ecf20Sopenharmony_ci   can avoid indirect access and save a cache line.
928c2ecf20Sopenharmony_ci - if size of spinlock_t is bigger then size of long, we use page->ptl as
938c2ecf20Sopenharmony_ci   pointer to spinlock_t and allocate it dynamically. This allows to use
948c2ecf20Sopenharmony_ci   split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
958c2ecf20Sopenharmony_ci   one more cache line for indirect access;
968c2ecf20Sopenharmony_ci
978c2ecf20Sopenharmony_ciThe spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in
988c2ecf20Sopenharmony_cipgtable_pmd_page_ctor() for PMD table.
998c2ecf20Sopenharmony_ci
1008c2ecf20Sopenharmony_ciPlease, never access page->ptl directly -- use appropriate helper.
101