18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci===============================
48c2ecf20Sopenharmony_ciKernel level exception handling
58c2ecf20Sopenharmony_ci===============================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciCommentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
88c2ecf20Sopenharmony_ci
98c2ecf20Sopenharmony_ciWhen a process runs in kernel mode, it often has to access user
108c2ecf20Sopenharmony_cimode memory whose address has been passed by an untrusted program.
118c2ecf20Sopenharmony_ciTo protect itself the kernel has to verify this address.
128c2ecf20Sopenharmony_ci
138c2ecf20Sopenharmony_ciIn older versions of Linux this was done with the
148c2ecf20Sopenharmony_ciint verify_area(int type, const void * addr, unsigned long size)
158c2ecf20Sopenharmony_cifunction (which has since been replaced by access_ok()).
168c2ecf20Sopenharmony_ci
178c2ecf20Sopenharmony_ciThis function verified that the memory area starting at address
188c2ecf20Sopenharmony_ci'addr' and of size 'size' was accessible for the operation specified
198c2ecf20Sopenharmony_ciin type (read or write). To do this, verify_read had to look up the
208c2ecf20Sopenharmony_civirtual memory area (vma) that contained the address addr. In the
218c2ecf20Sopenharmony_cinormal case (correctly working program), this test was successful.
228c2ecf20Sopenharmony_ciIt only failed for a few buggy programs. In some kernel profiling
238c2ecf20Sopenharmony_citests, this normally unneeded verification used up a considerable
248c2ecf20Sopenharmony_ciamount of time.
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_ciTo overcome this situation, Linus decided to let the virtual memory
278c2ecf20Sopenharmony_cihardware present in every Linux-capable CPU handle this test.
288c2ecf20Sopenharmony_ci
298c2ecf20Sopenharmony_ciHow does this work?
308c2ecf20Sopenharmony_ci
318c2ecf20Sopenharmony_ciWhenever the kernel tries to access an address that is currently not
328c2ecf20Sopenharmony_ciaccessible, the CPU generates a page fault exception and calls the
338c2ecf20Sopenharmony_cipage fault handler::
348c2ecf20Sopenharmony_ci
358c2ecf20Sopenharmony_ci  void do_page_fault(struct pt_regs *regs, unsigned long error_code)
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ciin arch/x86/mm/fault.c. The parameters on the stack are set up by
388c2ecf20Sopenharmony_cithe low level assembly glue in arch/x86/entry/entry_32.S. The parameter
398c2ecf20Sopenharmony_ciregs is a pointer to the saved registers on the stack, error_code
408c2ecf20Sopenharmony_cicontains a reason code for the exception.
418c2ecf20Sopenharmony_ci
428c2ecf20Sopenharmony_cido_page_fault first obtains the unaccessible address from the CPU
438c2ecf20Sopenharmony_cicontrol register CR2. If the address is within the virtual address
448c2ecf20Sopenharmony_cispace of the process, the fault probably occurred, because the page
458c2ecf20Sopenharmony_ciwas not swapped in, write protected or something similar. However,
468c2ecf20Sopenharmony_ciwe are interested in the other case: the address is not valid, there
478c2ecf20Sopenharmony_ciis no vma that contains this address. In this case, the kernel jumps
488c2ecf20Sopenharmony_cito the bad_area label.
498c2ecf20Sopenharmony_ci
508c2ecf20Sopenharmony_ciThere it uses the address of the instruction that caused the exception
518c2ecf20Sopenharmony_ci(i.e. regs->eip) to find an address where the execution can continue
528c2ecf20Sopenharmony_ci(fixup). If this search is successful, the fault handler modifies the
538c2ecf20Sopenharmony_cireturn address (again regs->eip) and returns. The execution will
548c2ecf20Sopenharmony_cicontinue at the address in fixup.
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ciWhere does fixup point to?
578c2ecf20Sopenharmony_ci
588c2ecf20Sopenharmony_ciSince we jump to the contents of fixup, fixup obviously points
598c2ecf20Sopenharmony_cito executable code. This code is hidden inside the user access macros.
608c2ecf20Sopenharmony_ciI have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
618c2ecf20Sopenharmony_cias an example. The definition is somewhat hard to follow, so let's peek at
628c2ecf20Sopenharmony_cithe code generated by the preprocessor and the compiler. I selected
638c2ecf20Sopenharmony_cithe get_user call in drivers/char/sysrq.c for a detailed examination.
648c2ecf20Sopenharmony_ci
658c2ecf20Sopenharmony_ciThe original code in sysrq.c line 587::
668c2ecf20Sopenharmony_ci
678c2ecf20Sopenharmony_ci        get_user(c, buf);
688c2ecf20Sopenharmony_ci
698c2ecf20Sopenharmony_ciThe preprocessor output (edited to become somewhat readable)::
708c2ecf20Sopenharmony_ci
718c2ecf20Sopenharmony_ci  (
728c2ecf20Sopenharmony_ci    {
738c2ecf20Sopenharmony_ci      long __gu_err = - 14 , __gu_val = 0;
748c2ecf20Sopenharmony_ci      const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
758c2ecf20Sopenharmony_ci      if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
768c2ecf20Sopenharmony_ci        (((sizeof(*(buf))) <= 0xC0000000UL) &&
778c2ecf20Sopenharmony_ci        ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
788c2ecf20Sopenharmony_ci        do {
798c2ecf20Sopenharmony_ci          __gu_err  = 0;
808c2ecf20Sopenharmony_ci          switch ((sizeof(*(buf)))) {
818c2ecf20Sopenharmony_ci            case 1:
828c2ecf20Sopenharmony_ci              __asm__ __volatile__(
838c2ecf20Sopenharmony_ci                "1:      mov" "b" " %2,%" "b" "1\n"
848c2ecf20Sopenharmony_ci                "2:\n"
858c2ecf20Sopenharmony_ci                ".section .fixup,\"ax\"\n"
868c2ecf20Sopenharmony_ci                "3:      movl %3,%0\n"
878c2ecf20Sopenharmony_ci                "        xor" "b" " %" "b" "1,%" "b" "1\n"
888c2ecf20Sopenharmony_ci                "        jmp 2b\n"
898c2ecf20Sopenharmony_ci                ".section __ex_table,\"a\"\n"
908c2ecf20Sopenharmony_ci                "        .align 4\n"
918c2ecf20Sopenharmony_ci                "        .long 1b,3b\n"
928c2ecf20Sopenharmony_ci                ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
938c2ecf20Sopenharmony_ci                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
948c2ecf20Sopenharmony_ci                break;
958c2ecf20Sopenharmony_ci            case 2:
968c2ecf20Sopenharmony_ci              __asm__ __volatile__(
978c2ecf20Sopenharmony_ci                "1:      mov" "w" " %2,%" "w" "1\n"
988c2ecf20Sopenharmony_ci                "2:\n"
998c2ecf20Sopenharmony_ci                ".section .fixup,\"ax\"\n"
1008c2ecf20Sopenharmony_ci                "3:      movl %3,%0\n"
1018c2ecf20Sopenharmony_ci                "        xor" "w" " %" "w" "1,%" "w" "1\n"
1028c2ecf20Sopenharmony_ci                "        jmp 2b\n"
1038c2ecf20Sopenharmony_ci                ".section __ex_table,\"a\"\n"
1048c2ecf20Sopenharmony_ci                "        .align 4\n"
1058c2ecf20Sopenharmony_ci                "        .long 1b,3b\n"
1068c2ecf20Sopenharmony_ci                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
1078c2ecf20Sopenharmony_ci                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
1088c2ecf20Sopenharmony_ci                break;
1098c2ecf20Sopenharmony_ci            case 4:
1108c2ecf20Sopenharmony_ci              __asm__ __volatile__(
1118c2ecf20Sopenharmony_ci                "1:      mov" "l" " %2,%" "" "1\n"
1128c2ecf20Sopenharmony_ci                "2:\n"
1138c2ecf20Sopenharmony_ci                ".section .fixup,\"ax\"\n"
1148c2ecf20Sopenharmony_ci                "3:      movl %3,%0\n"
1158c2ecf20Sopenharmony_ci                "        xor" "l" " %" "" "1,%" "" "1\n"
1168c2ecf20Sopenharmony_ci                "        jmp 2b\n"
1178c2ecf20Sopenharmony_ci                ".section __ex_table,\"a\"\n"
1188c2ecf20Sopenharmony_ci                "        .align 4\n"        "        .long 1b,3b\n"
1198c2ecf20Sopenharmony_ci                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
1208c2ecf20Sopenharmony_ci                              (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
1218c2ecf20Sopenharmony_ci                break;
1228c2ecf20Sopenharmony_ci            default:
1238c2ecf20Sopenharmony_ci              (__gu_val) = __get_user_bad();
1248c2ecf20Sopenharmony_ci          }
1258c2ecf20Sopenharmony_ci        } while (0) ;
1268c2ecf20Sopenharmony_ci      ((c)) = (__typeof__(*((buf))))__gu_val;
1278c2ecf20Sopenharmony_ci      __gu_err;
1288c2ecf20Sopenharmony_ci    }
1298c2ecf20Sopenharmony_ci  );
1308c2ecf20Sopenharmony_ci
1318c2ecf20Sopenharmony_ciWOW! Black GCC/assembly magic. This is impossible to follow, so let's
1328c2ecf20Sopenharmony_cisee what code gcc generates::
1338c2ecf20Sopenharmony_ci
1348c2ecf20Sopenharmony_ci >         xorl %edx,%edx
1358c2ecf20Sopenharmony_ci >         movl current_set,%eax
1368c2ecf20Sopenharmony_ci >         cmpl $24,788(%eax)
1378c2ecf20Sopenharmony_ci >         je .L1424
1388c2ecf20Sopenharmony_ci >         cmpl $-1073741825,64(%esp)
1398c2ecf20Sopenharmony_ci >         ja .L1423
1408c2ecf20Sopenharmony_ci > .L1424:
1418c2ecf20Sopenharmony_ci >         movl %edx,%eax
1428c2ecf20Sopenharmony_ci >         movl 64(%esp),%ebx
1438c2ecf20Sopenharmony_ci > #APP
1448c2ecf20Sopenharmony_ci > 1:      movb (%ebx),%dl                /* this is the actual user access */
1458c2ecf20Sopenharmony_ci > 2:
1468c2ecf20Sopenharmony_ci > .section .fixup,"ax"
1478c2ecf20Sopenharmony_ci > 3:      movl $-14,%eax
1488c2ecf20Sopenharmony_ci >         xorb %dl,%dl
1498c2ecf20Sopenharmony_ci >         jmp 2b
1508c2ecf20Sopenharmony_ci > .section __ex_table,"a"
1518c2ecf20Sopenharmony_ci >         .align 4
1528c2ecf20Sopenharmony_ci >         .long 1b,3b
1538c2ecf20Sopenharmony_ci > .text
1548c2ecf20Sopenharmony_ci > #NO_APP
1558c2ecf20Sopenharmony_ci > .L1423:
1568c2ecf20Sopenharmony_ci >         movzbl %dl,%esi
1578c2ecf20Sopenharmony_ci
1588c2ecf20Sopenharmony_ciThe optimizer does a good job and gives us something we can actually
1598c2ecf20Sopenharmony_ciunderstand. Can we? The actual user access is quite obvious. Thanks
1608c2ecf20Sopenharmony_cito the unified address space we can just access the address in user
1618c2ecf20Sopenharmony_cimemory. But what does the .section stuff do?????
1628c2ecf20Sopenharmony_ci
1638c2ecf20Sopenharmony_ciTo understand this we have to look at the final kernel::
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ci > objdump --section-headers vmlinux
1668c2ecf20Sopenharmony_ci >
1678c2ecf20Sopenharmony_ci > vmlinux:     file format elf32-i386
1688c2ecf20Sopenharmony_ci >
1698c2ecf20Sopenharmony_ci > Sections:
1708c2ecf20Sopenharmony_ci > Idx Name          Size      VMA       LMA       File off  Algn
1718c2ecf20Sopenharmony_ci >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
1728c2ecf20Sopenharmony_ci >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
1738c2ecf20Sopenharmony_ci >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
1748c2ecf20Sopenharmony_ci >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
1758c2ecf20Sopenharmony_ci >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
1768c2ecf20Sopenharmony_ci >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
1778c2ecf20Sopenharmony_ci >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
1788c2ecf20Sopenharmony_ci >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
1798c2ecf20Sopenharmony_ci >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
1808c2ecf20Sopenharmony_ci >                   CONTENTS, ALLOC, LOAD, DATA
1818c2ecf20Sopenharmony_ci >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
1828c2ecf20Sopenharmony_ci >                   ALLOC
1838c2ecf20Sopenharmony_ci >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
1848c2ecf20Sopenharmony_ci >                   CONTENTS, READONLY
1858c2ecf20Sopenharmony_ci >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
1868c2ecf20Sopenharmony_ci >                   CONTENTS, READONLY
1878c2ecf20Sopenharmony_ci
1888c2ecf20Sopenharmony_ciThere are obviously 2 non standard ELF sections in the generated object
1898c2ecf20Sopenharmony_cifile. But first we want to find out what happened to our code in the
1908c2ecf20Sopenharmony_cifinal kernel executable::
1918c2ecf20Sopenharmony_ci
1928c2ecf20Sopenharmony_ci > objdump --disassemble --section=.text vmlinux
1938c2ecf20Sopenharmony_ci >
1948c2ecf20Sopenharmony_ci > c017e785 <do_con_write+c1> xorl   %edx,%edx
1958c2ecf20Sopenharmony_ci > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
1968c2ecf20Sopenharmony_ci > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
1978c2ecf20Sopenharmony_ci > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
1988c2ecf20Sopenharmony_ci > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
1998c2ecf20Sopenharmony_ci > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
2008c2ecf20Sopenharmony_ci > c017e79f <do_con_write+db> movl   %edx,%eax
2018c2ecf20Sopenharmony_ci > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
2028c2ecf20Sopenharmony_ci > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
2038c2ecf20Sopenharmony_ci > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
2048c2ecf20Sopenharmony_ci
2058c2ecf20Sopenharmony_ciThe whole user memory access is reduced to 10 x86 machine instructions.
2068c2ecf20Sopenharmony_ciThe instructions bracketed in the .section directives are no longer
2078c2ecf20Sopenharmony_ciin the normal execution path. They are located in a different section
2088c2ecf20Sopenharmony_ciof the executable file::
2098c2ecf20Sopenharmony_ci
2108c2ecf20Sopenharmony_ci > objdump --disassemble --section=.fixup vmlinux
2118c2ecf20Sopenharmony_ci >
2128c2ecf20Sopenharmony_ci > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
2138c2ecf20Sopenharmony_ci > c0199ffa <.fixup+10ba> xorb   %dl,%dl
2148c2ecf20Sopenharmony_ci > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
2158c2ecf20Sopenharmony_ci
2168c2ecf20Sopenharmony_ciAnd finally::
2178c2ecf20Sopenharmony_ci
2188c2ecf20Sopenharmony_ci > objdump --full-contents --section=__ex_table vmlinux
2198c2ecf20Sopenharmony_ci >
2208c2ecf20Sopenharmony_ci >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
2218c2ecf20Sopenharmony_ci >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
2228c2ecf20Sopenharmony_ci >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
2238c2ecf20Sopenharmony_ci
2248c2ecf20Sopenharmony_cior in human readable byte order::
2258c2ecf20Sopenharmony_ci
2268c2ecf20Sopenharmony_ci >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
2278c2ecf20Sopenharmony_ci >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
2288c2ecf20Sopenharmony_ci                               ^^^^^^^^^^^^^^^^^
2298c2ecf20Sopenharmony_ci                               this is the interesting part!
2308c2ecf20Sopenharmony_ci >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ciWhat happened? The assembly directives::
2338c2ecf20Sopenharmony_ci
2348c2ecf20Sopenharmony_ci  .section .fixup,"ax"
2358c2ecf20Sopenharmony_ci  .section __ex_table,"a"
2368c2ecf20Sopenharmony_ci
2378c2ecf20Sopenharmony_citold the assembler to move the following code to the specified
2388c2ecf20Sopenharmony_cisections in the ELF object file. So the instructions::
2398c2ecf20Sopenharmony_ci
2408c2ecf20Sopenharmony_ci  3:      movl $-14,%eax
2418c2ecf20Sopenharmony_ci          xorb %dl,%dl
2428c2ecf20Sopenharmony_ci          jmp 2b
2438c2ecf20Sopenharmony_ci
2448c2ecf20Sopenharmony_ciended up in the .fixup section of the object file and the addresses::
2458c2ecf20Sopenharmony_ci
2468c2ecf20Sopenharmony_ci        .long 1b,3b
2478c2ecf20Sopenharmony_ci
2488c2ecf20Sopenharmony_ciended up in the __ex_table section of the object file. 1b and 3b
2498c2ecf20Sopenharmony_ciare local labels. The local label 1b (1b stands for next label 1
2508c2ecf20Sopenharmony_cibackward) is the address of the instruction that might fault, i.e.
2518c2ecf20Sopenharmony_ciin our case the address of the label 1 is c017e7a5:
2528c2ecf20Sopenharmony_cithe original assembly code: > 1:      movb (%ebx),%dl
2538c2ecf20Sopenharmony_ciand linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
2548c2ecf20Sopenharmony_ci
2558c2ecf20Sopenharmony_ciThe local label 3 (backwards again) is the address of the code to handle
2568c2ecf20Sopenharmony_cithe fault, in our case the actual value is c0199ff5:
2578c2ecf20Sopenharmony_cithe original assembly code: > 3:      movl $-14,%eax
2588c2ecf20Sopenharmony_ciand linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
2598c2ecf20Sopenharmony_ci
2608c2ecf20Sopenharmony_ciIf the fixup was able to handle the exception, control flow may be returned
2618c2ecf20Sopenharmony_cito the instruction after the one that triggered the fault, ie. local label 2b.
2628c2ecf20Sopenharmony_ci
2638c2ecf20Sopenharmony_ciThe assembly code::
2648c2ecf20Sopenharmony_ci
2658c2ecf20Sopenharmony_ci > .section __ex_table,"a"
2668c2ecf20Sopenharmony_ci >         .align 4
2678c2ecf20Sopenharmony_ci >         .long 1b,3b
2688c2ecf20Sopenharmony_ci
2698c2ecf20Sopenharmony_cibecomes the value pair::
2708c2ecf20Sopenharmony_ci
2718c2ecf20Sopenharmony_ci >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
2728c2ecf20Sopenharmony_ci                               ^this is ^this is
2738c2ecf20Sopenharmony_ci                               1b       3b
2748c2ecf20Sopenharmony_ci
2758c2ecf20Sopenharmony_cic017e7a5,c0199ff5 in the exception table of the kernel.
2768c2ecf20Sopenharmony_ci
2778c2ecf20Sopenharmony_ciSo, what actually happens if a fault from kernel mode with no suitable
2788c2ecf20Sopenharmony_civma occurs?
2798c2ecf20Sopenharmony_ci
2808c2ecf20Sopenharmony_ci#. access to invalid address::
2818c2ecf20Sopenharmony_ci
2828c2ecf20Sopenharmony_ci    > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
2838c2ecf20Sopenharmony_ci#. MMU generates exception
2848c2ecf20Sopenharmony_ci#. CPU calls do_page_fault
2858c2ecf20Sopenharmony_ci#. do page fault calls search_exception_table (regs->eip == c017e7a5);
2868c2ecf20Sopenharmony_ci#. search_exception_table looks up the address c017e7a5 in the
2878c2ecf20Sopenharmony_ci   exception table (i.e. the contents of the ELF section __ex_table)
2888c2ecf20Sopenharmony_ci   and returns the address of the associated fault handle code c0199ff5.
2898c2ecf20Sopenharmony_ci#. do_page_fault modifies its own return address to point to the fault
2908c2ecf20Sopenharmony_ci   handle code and returns.
2918c2ecf20Sopenharmony_ci#. execution continues in the fault handling code.
2928c2ecf20Sopenharmony_ci#. a) EAX becomes -EFAULT (== -14)
2938c2ecf20Sopenharmony_ci   b) DL  becomes zero (the value we "read" from user space)
2948c2ecf20Sopenharmony_ci   c) execution continues at local label 2 (address of the
2958c2ecf20Sopenharmony_ci      instruction immediately after the faulting user access).
2968c2ecf20Sopenharmony_ci
2978c2ecf20Sopenharmony_ciThe steps 8a to 8c in a certain way emulate the faulting instruction.
2988c2ecf20Sopenharmony_ci
2998c2ecf20Sopenharmony_ciThat's it, mostly. If you look at our example, you might ask why
3008c2ecf20Sopenharmony_ciwe set EAX to -EFAULT in the exception handler code. Well, the
3018c2ecf20Sopenharmony_ciget_user macro actually returns a value: 0, if the user access was
3028c2ecf20Sopenharmony_cisuccessful, -EFAULT on failure. Our original code did not test this
3038c2ecf20Sopenharmony_cireturn value, however the inline assembly code in get_user tries to
3048c2ecf20Sopenharmony_cireturn -EFAULT. GCC selected EAX to return this value.
3058c2ecf20Sopenharmony_ci
3068c2ecf20Sopenharmony_ciNOTE:
3078c2ecf20Sopenharmony_ciDue to the way that the exception table is built and needs to be ordered,
3088c2ecf20Sopenharmony_cionly use exceptions for code in the .text section.  Any other section
3098c2ecf20Sopenharmony_ciwill cause the exception table to not be sorted correctly, and the
3108c2ecf20Sopenharmony_ciexceptions will fail.
3118c2ecf20Sopenharmony_ci
3128c2ecf20Sopenharmony_ciThings changed when 64-bit support was added to x86 Linux. Rather than
3138c2ecf20Sopenharmony_cidouble the size of the exception table by expanding the two entries
3148c2ecf20Sopenharmony_cifrom 32-bits to 64 bits, a clever trick was used to store addresses
3158c2ecf20Sopenharmony_cias relative offsets from the table itself. The assembly code changed
3168c2ecf20Sopenharmony_cifrom::
3178c2ecf20Sopenharmony_ci
3188c2ecf20Sopenharmony_ci    .long 1b,3b
3198c2ecf20Sopenharmony_ci  to:
3208c2ecf20Sopenharmony_ci          .long (from) - .
3218c2ecf20Sopenharmony_ci          .long (to) - .
3228c2ecf20Sopenharmony_ci
3238c2ecf20Sopenharmony_ciand the C-code that uses these values converts back to absolute addresses
3248c2ecf20Sopenharmony_cilike this::
3258c2ecf20Sopenharmony_ci
3268c2ecf20Sopenharmony_ci	ex_insn_addr(const struct exception_table_entry *x)
3278c2ecf20Sopenharmony_ci	{
3288c2ecf20Sopenharmony_ci		return (unsigned long)&x->insn + x->insn;
3298c2ecf20Sopenharmony_ci	}
3308c2ecf20Sopenharmony_ci
3318c2ecf20Sopenharmony_ciIn v4.6 the exception table entry was expanded with a new field "handler".
3328c2ecf20Sopenharmony_ciThis is also 32-bits wide and contains a third relative function
3338c2ecf20Sopenharmony_cipointer which points to one of:
3348c2ecf20Sopenharmony_ci
3358c2ecf20Sopenharmony_ci1) ``int ex_handler_default(const struct exception_table_entry *fixup)``
3368c2ecf20Sopenharmony_ci     This is legacy case that just jumps to the fixup code
3378c2ecf20Sopenharmony_ci
3388c2ecf20Sopenharmony_ci2) ``int ex_handler_fault(const struct exception_table_entry *fixup)``
3398c2ecf20Sopenharmony_ci     This case provides the fault number of the trap that occurred at
3408c2ecf20Sopenharmony_ci     entry->insn. It is used to distinguish page faults from machine
3418c2ecf20Sopenharmony_ci     check.
3428c2ecf20Sopenharmony_ci
3438c2ecf20Sopenharmony_ciMore functions can easily be added.
3448c2ecf20Sopenharmony_ci
3458c2ecf20Sopenharmony_ciCONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post
3468c2ecf20Sopenharmony_cilink of the kernel image, via a host utility scripts/sorttable. It will set the
3478c2ecf20Sopenharmony_cisymbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section
3488c2ecf20Sopenharmony_ciat boot time. With the exception table sorted, at runtime when an exception
3498c2ecf20Sopenharmony_cioccurs we can quickly lookup the __ex_table entry via binary search.
3508c2ecf20Sopenharmony_ci
3518c2ecf20Sopenharmony_ciThis is not just a boot time optimization, some architectures require this
3528c2ecf20Sopenharmony_citable to be sorted in order to handle exceptions relatively early in the boot
3538c2ecf20Sopenharmony_ciprocess. For example, i386 makes use of this form of exception handling before
3548c2ecf20Sopenharmony_cipaging support is even enabled!
355