18c2ecf20Sopenharmony_ci================ 28c2ecf20Sopenharmony_ciFutex Requeue PI 38c2ecf20Sopenharmony_ci================ 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ciRequeueing of tasks from a non-PI futex to a PI futex requires 68c2ecf20Sopenharmony_cispecial handling in order to ensure the underlying rt_mutex is never 78c2ecf20Sopenharmony_cileft without an owner if it has waiters; doing so would break the PI 88c2ecf20Sopenharmony_ciboosting logic [see rt-mutex-desgin.txt] For the purposes of 98c2ecf20Sopenharmony_cibrevity, this action will be referred to as "requeue_pi" throughout 108c2ecf20Sopenharmony_cithis document. Priority inheritance is abbreviated throughout as 118c2ecf20Sopenharmony_ci"PI". 128c2ecf20Sopenharmony_ci 138c2ecf20Sopenharmony_ciMotivation 148c2ecf20Sopenharmony_ci---------- 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ciWithout requeue_pi, the glibc implementation of 178c2ecf20Sopenharmony_cipthread_cond_broadcast() must resort to waking all the tasks waiting 188c2ecf20Sopenharmony_cion a pthread_condvar and letting them try to sort out which task 198c2ecf20Sopenharmony_cigets to run first in classic thundering-herd formation. An ideal 208c2ecf20Sopenharmony_ciimplementation would wake the highest-priority waiter, and leave the 218c2ecf20Sopenharmony_cirest to the natural wakeup inherent in unlocking the mutex 228c2ecf20Sopenharmony_ciassociated with the condvar. 238c2ecf20Sopenharmony_ci 248c2ecf20Sopenharmony_ciConsider the simplified glibc calls:: 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_ci /* caller must lock mutex */ 278c2ecf20Sopenharmony_ci pthread_cond_wait(cond, mutex) 288c2ecf20Sopenharmony_ci { 298c2ecf20Sopenharmony_ci lock(cond->__data.__lock); 308c2ecf20Sopenharmony_ci unlock(mutex); 318c2ecf20Sopenharmony_ci do { 328c2ecf20Sopenharmony_ci unlock(cond->__data.__lock); 338c2ecf20Sopenharmony_ci futex_wait(cond->__data.__futex); 348c2ecf20Sopenharmony_ci lock(cond->__data.__lock); 358c2ecf20Sopenharmony_ci } while(...) 368c2ecf20Sopenharmony_ci unlock(cond->__data.__lock); 378c2ecf20Sopenharmony_ci lock(mutex); 388c2ecf20Sopenharmony_ci } 398c2ecf20Sopenharmony_ci 408c2ecf20Sopenharmony_ci pthread_cond_broadcast(cond) 418c2ecf20Sopenharmony_ci { 428c2ecf20Sopenharmony_ci lock(cond->__data.__lock); 438c2ecf20Sopenharmony_ci unlock(cond->__data.__lock); 448c2ecf20Sopenharmony_ci futex_requeue(cond->data.__futex, cond->mutex); 458c2ecf20Sopenharmony_ci } 468c2ecf20Sopenharmony_ci 478c2ecf20Sopenharmony_ciOnce pthread_cond_broadcast() requeues the tasks, the cond->mutex 488c2ecf20Sopenharmony_cihas waiters. Note that pthread_cond_wait() attempts to lock the 498c2ecf20Sopenharmony_cimutex only after it has returned to user space. This will leave the 508c2ecf20Sopenharmony_ciunderlying rt_mutex with waiters, and no owner, breaking the 518c2ecf20Sopenharmony_cipreviously mentioned PI-boosting algorithms. 528c2ecf20Sopenharmony_ci 538c2ecf20Sopenharmony_ciIn order to support PI-aware pthread_condvar's, the kernel needs to 548c2ecf20Sopenharmony_cibe able to requeue tasks to PI futexes. This support implies that 558c2ecf20Sopenharmony_ciupon a successful futex_wait system call, the caller would return to 568c2ecf20Sopenharmony_ciuser space already holding the PI futex. The glibc implementation 578c2ecf20Sopenharmony_ciwould be modified as follows:: 588c2ecf20Sopenharmony_ci 598c2ecf20Sopenharmony_ci 608c2ecf20Sopenharmony_ci /* caller must lock mutex */ 618c2ecf20Sopenharmony_ci pthread_cond_wait_pi(cond, mutex) 628c2ecf20Sopenharmony_ci { 638c2ecf20Sopenharmony_ci lock(cond->__data.__lock); 648c2ecf20Sopenharmony_ci unlock(mutex); 658c2ecf20Sopenharmony_ci do { 668c2ecf20Sopenharmony_ci unlock(cond->__data.__lock); 678c2ecf20Sopenharmony_ci futex_wait_requeue_pi(cond->__data.__futex); 688c2ecf20Sopenharmony_ci lock(cond->__data.__lock); 698c2ecf20Sopenharmony_ci } while(...) 708c2ecf20Sopenharmony_ci unlock(cond->__data.__lock); 718c2ecf20Sopenharmony_ci /* the kernel acquired the mutex for us */ 728c2ecf20Sopenharmony_ci } 738c2ecf20Sopenharmony_ci 748c2ecf20Sopenharmony_ci pthread_cond_broadcast_pi(cond) 758c2ecf20Sopenharmony_ci { 768c2ecf20Sopenharmony_ci lock(cond->__data.__lock); 778c2ecf20Sopenharmony_ci unlock(cond->__data.__lock); 788c2ecf20Sopenharmony_ci futex_requeue_pi(cond->data.__futex, cond->mutex); 798c2ecf20Sopenharmony_ci } 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ciThe actual glibc implementation will likely test for PI and make the 828c2ecf20Sopenharmony_cinecessary changes inside the existing calls rather than creating new 838c2ecf20Sopenharmony_cicalls for the PI cases. Similar changes are needed for 848c2ecf20Sopenharmony_cipthread_cond_timedwait() and pthread_cond_signal(). 858c2ecf20Sopenharmony_ci 868c2ecf20Sopenharmony_ciImplementation 878c2ecf20Sopenharmony_ci-------------- 888c2ecf20Sopenharmony_ci 898c2ecf20Sopenharmony_ciIn order to ensure the rt_mutex has an owner if it has waiters, it 908c2ecf20Sopenharmony_ciis necessary for both the requeue code, as well as the waiting code, 918c2ecf20Sopenharmony_cito be able to acquire the rt_mutex before returning to user space. 928c2ecf20Sopenharmony_ciThe requeue code cannot simply wake the waiter and leave it to 938c2ecf20Sopenharmony_ciacquire the rt_mutex as it would open a race window between the 948c2ecf20Sopenharmony_cirequeue call returning to user space and the waiter waking and 958c2ecf20Sopenharmony_cistarting to run. This is especially true in the uncontended case. 968c2ecf20Sopenharmony_ci 978c2ecf20Sopenharmony_ciThe solution involves two new rt_mutex helper routines, 988c2ecf20Sopenharmony_cirt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which 998c2ecf20Sopenharmony_ciallow the requeue code to acquire an uncontended rt_mutex on behalf 1008c2ecf20Sopenharmony_ciof the waiter and to enqueue the waiter on a contended rt_mutex. 1018c2ecf20Sopenharmony_ciTwo new system calls provide the kernel<->user interface to 1028c2ecf20Sopenharmony_cirequeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI. 1038c2ecf20Sopenharmony_ci 1048c2ecf20Sopenharmony_ciFUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() 1058c2ecf20Sopenharmony_ciand pthread_cond_timedwait()) to block on the initial futex and wait 1068c2ecf20Sopenharmony_cito be requeued to a PI-aware futex. The implementation is the 1078c2ecf20Sopenharmony_ciresult of a high-speed collision between futex_wait() and 1088c2ecf20Sopenharmony_cifutex_lock_pi(), with some extra logic to check for the additional 1098c2ecf20Sopenharmony_ciwake-up scenarios. 1108c2ecf20Sopenharmony_ci 1118c2ecf20Sopenharmony_ciFUTEX_CMP_REQUEUE_PI is called by the waker 1128c2ecf20Sopenharmony_ci(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and 1138c2ecf20Sopenharmony_cipossibly wake the waiting tasks. Internally, this system call is 1148c2ecf20Sopenharmony_cistill handled by futex_requeue (by passing requeue_pi=1). Before 1158c2ecf20Sopenharmony_cirequeueing, futex_requeue() attempts to acquire the requeue target 1168c2ecf20Sopenharmony_ciPI futex on behalf of the top waiter. If it can, this waiter is 1178c2ecf20Sopenharmony_ciwoken. futex_requeue() then proceeds to requeue the remaining 1188c2ecf20Sopenharmony_cinr_wake+nr_requeue tasks to the PI futex, calling 1198c2ecf20Sopenharmony_cirt_mutex_start_proxy_lock() prior to each requeue to prepare the 1208c2ecf20Sopenharmony_citask as a waiter on the underlying rt_mutex. It is possible that 1218c2ecf20Sopenharmony_cithe lock can be acquired at this stage as well, if so, the next 1228c2ecf20Sopenharmony_ciwaiter is woken to finish the acquisition of the lock. 1238c2ecf20Sopenharmony_ci 1248c2ecf20Sopenharmony_ciFUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but 1258c2ecf20Sopenharmony_citheir sum is all that really matters. futex_requeue() will wake or 1268c2ecf20Sopenharmony_cirequeue up to nr_wake + nr_requeue tasks. It will wake only as many 1278c2ecf20Sopenharmony_citasks as it can acquire the lock for, which in the majority of cases 1288c2ecf20Sopenharmony_cishould be 0 as good programming practice dictates that the caller of 1298c2ecf20Sopenharmony_cieither pthread_cond_broadcast() or pthread_cond_signal() acquire the 1308c2ecf20Sopenharmony_cimutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that 1318c2ecf20Sopenharmony_cinr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for 1328c2ecf20Sopenharmony_cisignal. 133