18c2ecf20Sopenharmony_ci================
28c2ecf20Sopenharmony_ciFutex Requeue PI
38c2ecf20Sopenharmony_ci================
48c2ecf20Sopenharmony_ci
58c2ecf20Sopenharmony_ciRequeueing of tasks from a non-PI futex to a PI futex requires
68c2ecf20Sopenharmony_cispecial handling in order to ensure the underlying rt_mutex is never
78c2ecf20Sopenharmony_cileft without an owner if it has waiters; doing so would break the PI
88c2ecf20Sopenharmony_ciboosting logic [see rt-mutex-desgin.txt] For the purposes of
98c2ecf20Sopenharmony_cibrevity, this action will be referred to as "requeue_pi" throughout
108c2ecf20Sopenharmony_cithis document.  Priority inheritance is abbreviated throughout as
118c2ecf20Sopenharmony_ci"PI".
128c2ecf20Sopenharmony_ci
138c2ecf20Sopenharmony_ciMotivation
148c2ecf20Sopenharmony_ci----------
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ciWithout requeue_pi, the glibc implementation of
178c2ecf20Sopenharmony_cipthread_cond_broadcast() must resort to waking all the tasks waiting
188c2ecf20Sopenharmony_cion a pthread_condvar and letting them try to sort out which task
198c2ecf20Sopenharmony_cigets to run first in classic thundering-herd formation.  An ideal
208c2ecf20Sopenharmony_ciimplementation would wake the highest-priority waiter, and leave the
218c2ecf20Sopenharmony_cirest to the natural wakeup inherent in unlocking the mutex
228c2ecf20Sopenharmony_ciassociated with the condvar.
238c2ecf20Sopenharmony_ci
248c2ecf20Sopenharmony_ciConsider the simplified glibc calls::
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_ci	/* caller must lock mutex */
278c2ecf20Sopenharmony_ci	pthread_cond_wait(cond, mutex)
288c2ecf20Sopenharmony_ci	{
298c2ecf20Sopenharmony_ci		lock(cond->__data.__lock);
308c2ecf20Sopenharmony_ci		unlock(mutex);
318c2ecf20Sopenharmony_ci		do {
328c2ecf20Sopenharmony_ci		unlock(cond->__data.__lock);
338c2ecf20Sopenharmony_ci		futex_wait(cond->__data.__futex);
348c2ecf20Sopenharmony_ci		lock(cond->__data.__lock);
358c2ecf20Sopenharmony_ci		} while(...)
368c2ecf20Sopenharmony_ci		unlock(cond->__data.__lock);
378c2ecf20Sopenharmony_ci		lock(mutex);
388c2ecf20Sopenharmony_ci	}
398c2ecf20Sopenharmony_ci
408c2ecf20Sopenharmony_ci	pthread_cond_broadcast(cond)
418c2ecf20Sopenharmony_ci	{
428c2ecf20Sopenharmony_ci		lock(cond->__data.__lock);
438c2ecf20Sopenharmony_ci		unlock(cond->__data.__lock);
448c2ecf20Sopenharmony_ci		futex_requeue(cond->data.__futex, cond->mutex);
458c2ecf20Sopenharmony_ci	}
468c2ecf20Sopenharmony_ci
478c2ecf20Sopenharmony_ciOnce pthread_cond_broadcast() requeues the tasks, the cond->mutex
488c2ecf20Sopenharmony_cihas waiters. Note that pthread_cond_wait() attempts to lock the
498c2ecf20Sopenharmony_cimutex only after it has returned to user space.  This will leave the
508c2ecf20Sopenharmony_ciunderlying rt_mutex with waiters, and no owner, breaking the
518c2ecf20Sopenharmony_cipreviously mentioned PI-boosting algorithms.
528c2ecf20Sopenharmony_ci
538c2ecf20Sopenharmony_ciIn order to support PI-aware pthread_condvar's, the kernel needs to
548c2ecf20Sopenharmony_cibe able to requeue tasks to PI futexes.  This support implies that
558c2ecf20Sopenharmony_ciupon a successful futex_wait system call, the caller would return to
568c2ecf20Sopenharmony_ciuser space already holding the PI futex.  The glibc implementation
578c2ecf20Sopenharmony_ciwould be modified as follows::
588c2ecf20Sopenharmony_ci
598c2ecf20Sopenharmony_ci
608c2ecf20Sopenharmony_ci	/* caller must lock mutex */
618c2ecf20Sopenharmony_ci	pthread_cond_wait_pi(cond, mutex)
628c2ecf20Sopenharmony_ci	{
638c2ecf20Sopenharmony_ci		lock(cond->__data.__lock);
648c2ecf20Sopenharmony_ci		unlock(mutex);
658c2ecf20Sopenharmony_ci		do {
668c2ecf20Sopenharmony_ci		unlock(cond->__data.__lock);
678c2ecf20Sopenharmony_ci		futex_wait_requeue_pi(cond->__data.__futex);
688c2ecf20Sopenharmony_ci		lock(cond->__data.__lock);
698c2ecf20Sopenharmony_ci		} while(...)
708c2ecf20Sopenharmony_ci		unlock(cond->__data.__lock);
718c2ecf20Sopenharmony_ci		/* the kernel acquired the mutex for us */
728c2ecf20Sopenharmony_ci	}
738c2ecf20Sopenharmony_ci
748c2ecf20Sopenharmony_ci	pthread_cond_broadcast_pi(cond)
758c2ecf20Sopenharmony_ci	{
768c2ecf20Sopenharmony_ci		lock(cond->__data.__lock);
778c2ecf20Sopenharmony_ci		unlock(cond->__data.__lock);
788c2ecf20Sopenharmony_ci		futex_requeue_pi(cond->data.__futex, cond->mutex);
798c2ecf20Sopenharmony_ci	}
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ciThe actual glibc implementation will likely test for PI and make the
828c2ecf20Sopenharmony_cinecessary changes inside the existing calls rather than creating new
838c2ecf20Sopenharmony_cicalls for the PI cases.  Similar changes are needed for
848c2ecf20Sopenharmony_cipthread_cond_timedwait() and pthread_cond_signal().
858c2ecf20Sopenharmony_ci
868c2ecf20Sopenharmony_ciImplementation
878c2ecf20Sopenharmony_ci--------------
888c2ecf20Sopenharmony_ci
898c2ecf20Sopenharmony_ciIn order to ensure the rt_mutex has an owner if it has waiters, it
908c2ecf20Sopenharmony_ciis necessary for both the requeue code, as well as the waiting code,
918c2ecf20Sopenharmony_cito be able to acquire the rt_mutex before returning to user space.
928c2ecf20Sopenharmony_ciThe requeue code cannot simply wake the waiter and leave it to
938c2ecf20Sopenharmony_ciacquire the rt_mutex as it would open a race window between the
948c2ecf20Sopenharmony_cirequeue call returning to user space and the waiter waking and
958c2ecf20Sopenharmony_cistarting to run.  This is especially true in the uncontended case.
968c2ecf20Sopenharmony_ci
978c2ecf20Sopenharmony_ciThe solution involves two new rt_mutex helper routines,
988c2ecf20Sopenharmony_cirt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
998c2ecf20Sopenharmony_ciallow the requeue code to acquire an uncontended rt_mutex on behalf
1008c2ecf20Sopenharmony_ciof the waiter and to enqueue the waiter on a contended rt_mutex.
1018c2ecf20Sopenharmony_ciTwo new system calls provide the kernel<->user interface to
1028c2ecf20Sopenharmony_cirequeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
1038c2ecf20Sopenharmony_ci
1048c2ecf20Sopenharmony_ciFUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
1058c2ecf20Sopenharmony_ciand pthread_cond_timedwait()) to block on the initial futex and wait
1068c2ecf20Sopenharmony_cito be requeued to a PI-aware futex.  The implementation is the
1078c2ecf20Sopenharmony_ciresult of a high-speed collision between futex_wait() and
1088c2ecf20Sopenharmony_cifutex_lock_pi(), with some extra logic to check for the additional
1098c2ecf20Sopenharmony_ciwake-up scenarios.
1108c2ecf20Sopenharmony_ci
1118c2ecf20Sopenharmony_ciFUTEX_CMP_REQUEUE_PI is called by the waker
1128c2ecf20Sopenharmony_ci(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
1138c2ecf20Sopenharmony_cipossibly wake the waiting tasks. Internally, this system call is
1148c2ecf20Sopenharmony_cistill handled by futex_requeue (by passing requeue_pi=1).  Before
1158c2ecf20Sopenharmony_cirequeueing, futex_requeue() attempts to acquire the requeue target
1168c2ecf20Sopenharmony_ciPI futex on behalf of the top waiter.  If it can, this waiter is
1178c2ecf20Sopenharmony_ciwoken.  futex_requeue() then proceeds to requeue the remaining
1188c2ecf20Sopenharmony_cinr_wake+nr_requeue tasks to the PI futex, calling
1198c2ecf20Sopenharmony_cirt_mutex_start_proxy_lock() prior to each requeue to prepare the
1208c2ecf20Sopenharmony_citask as a waiter on the underlying rt_mutex.  It is possible that
1218c2ecf20Sopenharmony_cithe lock can be acquired at this stage as well, if so, the next
1228c2ecf20Sopenharmony_ciwaiter is woken to finish the acquisition of the lock.
1238c2ecf20Sopenharmony_ci
1248c2ecf20Sopenharmony_ciFUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
1258c2ecf20Sopenharmony_citheir sum is all that really matters.  futex_requeue() will wake or
1268c2ecf20Sopenharmony_cirequeue up to nr_wake + nr_requeue tasks.  It will wake only as many
1278c2ecf20Sopenharmony_citasks as it can acquire the lock for, which in the majority of cases
1288c2ecf20Sopenharmony_cishould be 0 as good programming practice dictates that the caller of
1298c2ecf20Sopenharmony_cieither pthread_cond_broadcast() or pthread_cond_signal() acquire the
1308c2ecf20Sopenharmony_cimutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
1318c2ecf20Sopenharmony_cinr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for
1328c2ecf20Sopenharmony_cisignal.
133