1e1051a39Sopenharmony_ci.explicit 2e1051a39Sopenharmony_ci.text 3e1051a39Sopenharmony_ci.ident "ia64.S, Version 2.1" 4e1051a39Sopenharmony_ci.ident "IA-64 ISA artwork by Andy Polyakov <appro@openssl.org>" 5e1051a39Sopenharmony_ci 6e1051a39Sopenharmony_ci// Copyright 2001-2018 The OpenSSL Project Authors. All Rights Reserved. 7e1051a39Sopenharmony_ci// 8e1051a39Sopenharmony_ci// Licensed under the Apache License 2.0 (the "License"). You may not use 9e1051a39Sopenharmony_ci// this file except in compliance with the License. You can obtain a copy 10e1051a39Sopenharmony_ci// in the file LICENSE in the source distribution or at 11e1051a39Sopenharmony_ci// https://www.openssl.org/source/license.html 12e1051a39Sopenharmony_ci 13e1051a39Sopenharmony_ci// 14e1051a39Sopenharmony_ci// ==================================================================== 15e1051a39Sopenharmony_ci// Written by Andy Polyakov <appro@openssl.org> for the OpenSSL 16e1051a39Sopenharmony_ci// project. 17e1051a39Sopenharmony_ci// 18e1051a39Sopenharmony_ci// Rights for redistribution and usage in source and binary forms are 19e1051a39Sopenharmony_ci// granted according to the License. Warranty of any kind is disclaimed. 20e1051a39Sopenharmony_ci// ==================================================================== 21e1051a39Sopenharmony_ci// 22e1051a39Sopenharmony_ci// Version 2.x is Itanium2 re-tune. Few words about how Itanium2 is 23e1051a39Sopenharmony_ci// different from Itanium to this module viewpoint. Most notably, is it 24e1051a39Sopenharmony_ci// "wider" than Itanium? Can you experience loop scalability as 25e1051a39Sopenharmony_ci// discussed in commentary sections? Not really:-( Itanium2 has 6 26e1051a39Sopenharmony_ci// integer ALU ports, i.e. it's 2 ports wider, but it's not enough to 27e1051a39Sopenharmony_ci// spin twice as fast, as I need 8 IALU ports. Amount of floating point 28e1051a39Sopenharmony_ci// ports is the same, i.e. 2, while I need 4. In other words, to this 29e1051a39Sopenharmony_ci// module Itanium2 remains effectively as "wide" as Itanium. Yet it's 30e1051a39Sopenharmony_ci// essentially different in respect to this module, and a re-tune was 31e1051a39Sopenharmony_ci// required. Well, because some instruction latencies has changed. Most 32e1051a39Sopenharmony_ci// noticeably those intensively used: 33e1051a39Sopenharmony_ci// 34e1051a39Sopenharmony_ci// Itanium Itanium2 35e1051a39Sopenharmony_ci// ldf8 9 6 L2 hit 36e1051a39Sopenharmony_ci// ld8 2 1 L1 hit 37e1051a39Sopenharmony_ci// getf 2 5 38e1051a39Sopenharmony_ci// xma[->getf] 7[+1] 4[+0] 39e1051a39Sopenharmony_ci// add[->st8] 1[+1] 1[+0] 40e1051a39Sopenharmony_ci// 41e1051a39Sopenharmony_ci// What does it mean? You might ratiocinate that the original code 42e1051a39Sopenharmony_ci// should run just faster... Because sum of latencies is smaller... 43e1051a39Sopenharmony_ci// Wrong! Note that getf latency increased. This means that if a loop is 44e1051a39Sopenharmony_ci// scheduled for lower latency (as they were), then it will suffer from 45e1051a39Sopenharmony_ci// stall condition and the code will therefore turn anti-scalable, e.g. 46e1051a39Sopenharmony_ci// original bn_mul_words spun at 5*n or 2.5 times slower than expected 47e1051a39Sopenharmony_ci// on Itanium2! What to do? Reschedule loops for Itanium2? But then 48e1051a39Sopenharmony_ci// Itanium would exhibit anti-scalability. So I've chosen to reschedule 49e1051a39Sopenharmony_ci// for worst latency for every instruction aiming for best *all-round* 50e1051a39Sopenharmony_ci// performance. 51e1051a39Sopenharmony_ci 52e1051a39Sopenharmony_ci// Q. How much faster does it get? 53e1051a39Sopenharmony_ci// A. Here is the output from 'openssl speed rsa dsa' for vanilla 54e1051a39Sopenharmony_ci// 0.9.6a compiled with gcc version 2.96 20000731 (Red Hat 55e1051a39Sopenharmony_ci// Linux 7.1 2.96-81): 56e1051a39Sopenharmony_ci// 57e1051a39Sopenharmony_ci// sign verify sign/s verify/s 58e1051a39Sopenharmony_ci// rsa 512 bits 0.0036s 0.0003s 275.3 2999.2 59e1051a39Sopenharmony_ci// rsa 1024 bits 0.0203s 0.0011s 49.3 894.1 60e1051a39Sopenharmony_ci// rsa 2048 bits 0.1331s 0.0040s 7.5 250.9 61e1051a39Sopenharmony_ci// rsa 4096 bits 0.9270s 0.0147s 1.1 68.1 62e1051a39Sopenharmony_ci// sign verify sign/s verify/s 63e1051a39Sopenharmony_ci// dsa 512 bits 0.0035s 0.0043s 288.3 234.8 64e1051a39Sopenharmony_ci// dsa 1024 bits 0.0111s 0.0135s 90.0 74.2 65e1051a39Sopenharmony_ci// 66e1051a39Sopenharmony_ci// And here is similar output but for this assembler 67e1051a39Sopenharmony_ci// implementation:-) 68e1051a39Sopenharmony_ci// 69e1051a39Sopenharmony_ci// sign verify sign/s verify/s 70e1051a39Sopenharmony_ci// rsa 512 bits 0.0021s 0.0001s 549.4 9638.5 71e1051a39Sopenharmony_ci// rsa 1024 bits 0.0055s 0.0002s 183.8 4481.1 72e1051a39Sopenharmony_ci// rsa 2048 bits 0.0244s 0.0006s 41.4 1726.3 73e1051a39Sopenharmony_ci// rsa 4096 bits 0.1295s 0.0018s 7.7 561.5 74e1051a39Sopenharmony_ci// sign verify sign/s verify/s 75e1051a39Sopenharmony_ci// dsa 512 bits 0.0012s 0.0013s 891.9 756.6 76e1051a39Sopenharmony_ci// dsa 1024 bits 0.0023s 0.0028s 440.4 376.2 77e1051a39Sopenharmony_ci// 78e1051a39Sopenharmony_ci// Yes, you may argue that it's not fair comparison as it's 79e1051a39Sopenharmony_ci// possible to craft the C implementation with BN_UMULT_HIGH 80e1051a39Sopenharmony_ci// inline assembler macro. But of course! Here is the output 81e1051a39Sopenharmony_ci// with the macro: 82e1051a39Sopenharmony_ci// 83e1051a39Sopenharmony_ci// sign verify sign/s verify/s 84e1051a39Sopenharmony_ci// rsa 512 bits 0.0020s 0.0002s 495.0 6561.0 85e1051a39Sopenharmony_ci// rsa 1024 bits 0.0086s 0.0004s 116.2 2235.7 86e1051a39Sopenharmony_ci// rsa 2048 bits 0.0519s 0.0015s 19.3 667.3 87e1051a39Sopenharmony_ci// rsa 4096 bits 0.3464s 0.0053s 2.9 187.7 88e1051a39Sopenharmony_ci// sign verify sign/s verify/s 89e1051a39Sopenharmony_ci// dsa 512 bits 0.0016s 0.0020s 613.1 510.5 90e1051a39Sopenharmony_ci// dsa 1024 bits 0.0045s 0.0054s 221.0 183.9 91e1051a39Sopenharmony_ci// 92e1051a39Sopenharmony_ci// My code is still way faster, huh:-) And I believe that even 93e1051a39Sopenharmony_ci// higher performance can be achieved. Note that as keys get 94e1051a39Sopenharmony_ci// longer, performance gain is larger. Why? According to the 95e1051a39Sopenharmony_ci// profiler there is another player in the field, namely 96e1051a39Sopenharmony_ci// BN_from_montgomery consuming larger and larger portion of CPU 97e1051a39Sopenharmony_ci// time as keysize decreases. I therefore consider putting effort 98e1051a39Sopenharmony_ci// to assembler implementation of the following routine: 99e1051a39Sopenharmony_ci// 100e1051a39Sopenharmony_ci// void bn_mul_add_mont (BN_ULONG *rp,BN_ULONG *np,int nl,BN_ULONG n0) 101e1051a39Sopenharmony_ci// { 102e1051a39Sopenharmony_ci// int i,j; 103e1051a39Sopenharmony_ci// BN_ULONG v; 104e1051a39Sopenharmony_ci// 105e1051a39Sopenharmony_ci// for (i=0; i<nl; i++) 106e1051a39Sopenharmony_ci// { 107e1051a39Sopenharmony_ci// v=bn_mul_add_words(rp,np,nl,(rp[0]*n0)&BN_MASK2); 108e1051a39Sopenharmony_ci// nrp++; 109e1051a39Sopenharmony_ci// rp++; 110e1051a39Sopenharmony_ci// if (((nrp[-1]+=v)&BN_MASK2) < v) 111e1051a39Sopenharmony_ci// for (j=0; ((++nrp[j])&BN_MASK2) == 0; j++) ; 112e1051a39Sopenharmony_ci// } 113e1051a39Sopenharmony_ci// } 114e1051a39Sopenharmony_ci// 115e1051a39Sopenharmony_ci// It might as well be beneficial to implement even combaX 116e1051a39Sopenharmony_ci// variants, as it appears as it can literally unleash the 117e1051a39Sopenharmony_ci// performance (see comment section to bn_mul_comba8 below). 118e1051a39Sopenharmony_ci// 119e1051a39Sopenharmony_ci// And finally for your reference the output for 0.9.6a compiled 120e1051a39Sopenharmony_ci// with SGIcc version 0.01.0-12 (keep in mind that for the moment 121e1051a39Sopenharmony_ci// of this writing it's not possible to convince SGIcc to use 122e1051a39Sopenharmony_ci// BN_UMULT_HIGH inline assembler macro, yet the code is fast, 123e1051a39Sopenharmony_ci// i.e. for a compiler generated one:-): 124e1051a39Sopenharmony_ci// 125e1051a39Sopenharmony_ci// sign verify sign/s verify/s 126e1051a39Sopenharmony_ci// rsa 512 bits 0.0022s 0.0002s 452.7 5894.3 127e1051a39Sopenharmony_ci// rsa 1024 bits 0.0097s 0.0005s 102.7 2002.9 128e1051a39Sopenharmony_ci// rsa 2048 bits 0.0578s 0.0017s 17.3 600.2 129e1051a39Sopenharmony_ci// rsa 4096 bits 0.3838s 0.0061s 2.6 164.5 130e1051a39Sopenharmony_ci// sign verify sign/s verify/s 131e1051a39Sopenharmony_ci// dsa 512 bits 0.0018s 0.0022s 547.3 459.6 132e1051a39Sopenharmony_ci// dsa 1024 bits 0.0051s 0.0062s 196.6 161.3 133e1051a39Sopenharmony_ci// 134e1051a39Sopenharmony_ci// Oh! Benchmarks were performed on 733MHz Lion-class Itanium 135e1051a39Sopenharmony_ci// system running Redhat Linux 7.1 (very special thanks to Ray 136e1051a39Sopenharmony_ci// McCaffity of Williams Communications for providing an account). 137e1051a39Sopenharmony_ci// 138e1051a39Sopenharmony_ci// Q. What's the heck with 'rum 1<<5' at the end of every function? 139e1051a39Sopenharmony_ci// A. Well, by clearing the "upper FP registers written" bit of the 140e1051a39Sopenharmony_ci// User Mask I want to excuse the kernel from preserving upper 141e1051a39Sopenharmony_ci// (f32-f128) FP register bank over process context switch, thus 142e1051a39Sopenharmony_ci// minimizing bus bandwidth consumption during the switch (i.e. 143e1051a39Sopenharmony_ci// after PKI operation completes and the program is off doing 144e1051a39Sopenharmony_ci// something else like bulk symmetric encryption). Having said 145e1051a39Sopenharmony_ci// this, I also want to point out that it might be good idea 146e1051a39Sopenharmony_ci// to compile the whole toolkit (as well as majority of the 147e1051a39Sopenharmony_ci// programs for that matter) with -mfixed-range=f32-f127 command 148e1051a39Sopenharmony_ci// line option. No, it doesn't prevent the compiler from writing 149e1051a39Sopenharmony_ci// to upper bank, but at least discourages to do so. If you don't 150e1051a39Sopenharmony_ci// like the idea you have the option to compile the module with 151e1051a39Sopenharmony_ci// -Drum=nop.m in command line. 152e1051a39Sopenharmony_ci// 153e1051a39Sopenharmony_ci 154e1051a39Sopenharmony_ci#if defined(_HPUX_SOURCE) && !defined(_LP64) 155e1051a39Sopenharmony_ci#define ADDP addp4 156e1051a39Sopenharmony_ci#else 157e1051a39Sopenharmony_ci#define ADDP add 158e1051a39Sopenharmony_ci#endif 159e1051a39Sopenharmony_ci#ifdef __VMS 160e1051a39Sopenharmony_ci.alias abort, "decc$abort" 161e1051a39Sopenharmony_ci#endif 162e1051a39Sopenharmony_ci 163e1051a39Sopenharmony_ci#if 1 164e1051a39Sopenharmony_ci// 165e1051a39Sopenharmony_ci// bn_[add|sub]_words routines. 166e1051a39Sopenharmony_ci// 167e1051a39Sopenharmony_ci// Loops are spinning in 2*(n+5) ticks on Itanium (provided that the 168e1051a39Sopenharmony_ci// data reside in L1 cache, i.e. 2 ticks away). It's possible to 169e1051a39Sopenharmony_ci// compress the epilogue and get down to 2*n+6, but at the cost of 170e1051a39Sopenharmony_ci// scalability (the neat feature of this implementation is that it 171e1051a39Sopenharmony_ci// shall automagically spin in n+5 on "wider" IA-64 implementations:-) 172e1051a39Sopenharmony_ci// I consider that the epilogue is short enough as it is to trade tiny 173e1051a39Sopenharmony_ci// performance loss on Itanium for scalability. 174e1051a39Sopenharmony_ci// 175e1051a39Sopenharmony_ci// BN_ULONG bn_add_words(BN_ULONG *rp, BN_ULONG *ap, BN_ULONG *bp,int num) 176e1051a39Sopenharmony_ci// 177e1051a39Sopenharmony_ci.global bn_add_words# 178e1051a39Sopenharmony_ci.proc bn_add_words# 179e1051a39Sopenharmony_ci.align 64 180e1051a39Sopenharmony_ci.skip 32 // makes the loop body aligned at 64-byte boundary 181e1051a39Sopenharmony_cibn_add_words: 182e1051a39Sopenharmony_ci .prologue 183e1051a39Sopenharmony_ci .save ar.pfs,r2 184e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,4,12,0,16 185e1051a39Sopenharmony_ci cmp4.le p6,p0=r35,r0 };; 186e1051a39Sopenharmony_ci{ .mfb; mov r8=r0 // return value 187e1051a39Sopenharmony_ci(p6) br.ret.spnt.many b0 };; 188e1051a39Sopenharmony_ci 189e1051a39Sopenharmony_ci{ .mib; sub r10=r35,r0,1 190e1051a39Sopenharmony_ci .save ar.lc,r3 191e1051a39Sopenharmony_ci mov r3=ar.lc 192e1051a39Sopenharmony_ci brp.loop.imp .L_bn_add_words_ctop,.L_bn_add_words_cend-16 193e1051a39Sopenharmony_ci } 194e1051a39Sopenharmony_ci{ .mib; ADDP r14=0,r32 // rp 195e1051a39Sopenharmony_ci .save pr,r9 196e1051a39Sopenharmony_ci mov r9=pr };; 197e1051a39Sopenharmony_ci .body 198e1051a39Sopenharmony_ci{ .mii; ADDP r15=0,r33 // ap 199e1051a39Sopenharmony_ci mov ar.lc=r10 200e1051a39Sopenharmony_ci mov ar.ec=6 } 201e1051a39Sopenharmony_ci{ .mib; ADDP r16=0,r34 // bp 202e1051a39Sopenharmony_ci mov pr.rot=1<<16 };; 203e1051a39Sopenharmony_ci 204e1051a39Sopenharmony_ci.L_bn_add_words_ctop: 205e1051a39Sopenharmony_ci{ .mii; (p16) ld8 r32=[r16],8 // b=*(bp++) 206e1051a39Sopenharmony_ci (p18) add r39=r37,r34 207e1051a39Sopenharmony_ci (p19) cmp.ltu.unc p56,p0=r40,r38 } 208e1051a39Sopenharmony_ci{ .mfb; (p0) nop.m 0x0 209e1051a39Sopenharmony_ci (p0) nop.f 0x0 210e1051a39Sopenharmony_ci (p0) nop.b 0x0 } 211e1051a39Sopenharmony_ci{ .mii; (p16) ld8 r35=[r15],8 // a=*(ap++) 212e1051a39Sopenharmony_ci (p58) cmp.eq.or p57,p0=-1,r41 // (p20) 213e1051a39Sopenharmony_ci (p58) add r41=1,r41 } // (p20) 214e1051a39Sopenharmony_ci{ .mfb; (p21) st8 [r14]=r42,8 // *(rp++)=r 215e1051a39Sopenharmony_ci (p0) nop.f 0x0 216e1051a39Sopenharmony_ci br.ctop.sptk .L_bn_add_words_ctop };; 217e1051a39Sopenharmony_ci.L_bn_add_words_cend: 218e1051a39Sopenharmony_ci 219e1051a39Sopenharmony_ci{ .mii; 220e1051a39Sopenharmony_ci(p59) add r8=1,r8 // return value 221e1051a39Sopenharmony_ci mov pr=r9,0x1ffff 222e1051a39Sopenharmony_ci mov ar.lc=r3 } 223e1051a39Sopenharmony_ci{ .mbb; nop.b 0x0 224e1051a39Sopenharmony_ci br.ret.sptk.many b0 };; 225e1051a39Sopenharmony_ci.endp bn_add_words# 226e1051a39Sopenharmony_ci 227e1051a39Sopenharmony_ci// 228e1051a39Sopenharmony_ci// BN_ULONG bn_sub_words(BN_ULONG *rp, BN_ULONG *ap, BN_ULONG *bp,int num) 229e1051a39Sopenharmony_ci// 230e1051a39Sopenharmony_ci.global bn_sub_words# 231e1051a39Sopenharmony_ci.proc bn_sub_words# 232e1051a39Sopenharmony_ci.align 64 233e1051a39Sopenharmony_ci.skip 32 // makes the loop body aligned at 64-byte boundary 234e1051a39Sopenharmony_cibn_sub_words: 235e1051a39Sopenharmony_ci .prologue 236e1051a39Sopenharmony_ci .save ar.pfs,r2 237e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,4,12,0,16 238e1051a39Sopenharmony_ci cmp4.le p6,p0=r35,r0 };; 239e1051a39Sopenharmony_ci{ .mfb; mov r8=r0 // return value 240e1051a39Sopenharmony_ci(p6) br.ret.spnt.many b0 };; 241e1051a39Sopenharmony_ci 242e1051a39Sopenharmony_ci{ .mib; sub r10=r35,r0,1 243e1051a39Sopenharmony_ci .save ar.lc,r3 244e1051a39Sopenharmony_ci mov r3=ar.lc 245e1051a39Sopenharmony_ci brp.loop.imp .L_bn_sub_words_ctop,.L_bn_sub_words_cend-16 246e1051a39Sopenharmony_ci } 247e1051a39Sopenharmony_ci{ .mib; ADDP r14=0,r32 // rp 248e1051a39Sopenharmony_ci .save pr,r9 249e1051a39Sopenharmony_ci mov r9=pr };; 250e1051a39Sopenharmony_ci .body 251e1051a39Sopenharmony_ci{ .mii; ADDP r15=0,r33 // ap 252e1051a39Sopenharmony_ci mov ar.lc=r10 253e1051a39Sopenharmony_ci mov ar.ec=6 } 254e1051a39Sopenharmony_ci{ .mib; ADDP r16=0,r34 // bp 255e1051a39Sopenharmony_ci mov pr.rot=1<<16 };; 256e1051a39Sopenharmony_ci 257e1051a39Sopenharmony_ci.L_bn_sub_words_ctop: 258e1051a39Sopenharmony_ci{ .mii; (p16) ld8 r32=[r16],8 // b=*(bp++) 259e1051a39Sopenharmony_ci (p18) sub r39=r37,r34 260e1051a39Sopenharmony_ci (p19) cmp.gtu.unc p56,p0=r40,r38 } 261e1051a39Sopenharmony_ci{ .mfb; (p0) nop.m 0x0 262e1051a39Sopenharmony_ci (p0) nop.f 0x0 263e1051a39Sopenharmony_ci (p0) nop.b 0x0 } 264e1051a39Sopenharmony_ci{ .mii; (p16) ld8 r35=[r15],8 // a=*(ap++) 265e1051a39Sopenharmony_ci (p58) cmp.eq.or p57,p0=0,r41 // (p20) 266e1051a39Sopenharmony_ci (p58) add r41=-1,r41 } // (p20) 267e1051a39Sopenharmony_ci{ .mbb; (p21) st8 [r14]=r42,8 // *(rp++)=r 268e1051a39Sopenharmony_ci (p0) nop.b 0x0 269e1051a39Sopenharmony_ci br.ctop.sptk .L_bn_sub_words_ctop };; 270e1051a39Sopenharmony_ci.L_bn_sub_words_cend: 271e1051a39Sopenharmony_ci 272e1051a39Sopenharmony_ci{ .mii; 273e1051a39Sopenharmony_ci(p59) add r8=1,r8 // return value 274e1051a39Sopenharmony_ci mov pr=r9,0x1ffff 275e1051a39Sopenharmony_ci mov ar.lc=r3 } 276e1051a39Sopenharmony_ci{ .mbb; nop.b 0x0 277e1051a39Sopenharmony_ci br.ret.sptk.many b0 };; 278e1051a39Sopenharmony_ci.endp bn_sub_words# 279e1051a39Sopenharmony_ci#endif 280e1051a39Sopenharmony_ci 281e1051a39Sopenharmony_ci#if 0 282e1051a39Sopenharmony_ci#define XMA_TEMPTATION 283e1051a39Sopenharmony_ci#endif 284e1051a39Sopenharmony_ci 285e1051a39Sopenharmony_ci#if 1 286e1051a39Sopenharmony_ci// 287e1051a39Sopenharmony_ci// BN_ULONG bn_mul_words(BN_ULONG *rp, BN_ULONG *ap, int num, BN_ULONG w) 288e1051a39Sopenharmony_ci// 289e1051a39Sopenharmony_ci.global bn_mul_words# 290e1051a39Sopenharmony_ci.proc bn_mul_words# 291e1051a39Sopenharmony_ci.align 64 292e1051a39Sopenharmony_ci.skip 32 // makes the loop body aligned at 64-byte boundary 293e1051a39Sopenharmony_cibn_mul_words: 294e1051a39Sopenharmony_ci .prologue 295e1051a39Sopenharmony_ci .save ar.pfs,r2 296e1051a39Sopenharmony_ci#ifdef XMA_TEMPTATION 297e1051a39Sopenharmony_ci{ .mfi; alloc r2=ar.pfs,4,0,0,0 };; 298e1051a39Sopenharmony_ci#else 299e1051a39Sopenharmony_ci{ .mfi; alloc r2=ar.pfs,4,12,0,16 };; 300e1051a39Sopenharmony_ci#endif 301e1051a39Sopenharmony_ci{ .mib; mov r8=r0 // return value 302e1051a39Sopenharmony_ci cmp4.le p6,p0=r34,r0 303e1051a39Sopenharmony_ci(p6) br.ret.spnt.many b0 };; 304e1051a39Sopenharmony_ci 305e1051a39Sopenharmony_ci{ .mii; sub r10=r34,r0,1 306e1051a39Sopenharmony_ci .save ar.lc,r3 307e1051a39Sopenharmony_ci mov r3=ar.lc 308e1051a39Sopenharmony_ci .save pr,r9 309e1051a39Sopenharmony_ci mov r9=pr };; 310e1051a39Sopenharmony_ci 311e1051a39Sopenharmony_ci .body 312e1051a39Sopenharmony_ci{ .mib; setf.sig f8=r35 // w 313e1051a39Sopenharmony_ci mov pr.rot=0x800001<<16 314e1051a39Sopenharmony_ci // ------^----- serves as (p50) at first (p27) 315e1051a39Sopenharmony_ci brp.loop.imp .L_bn_mul_words_ctop,.L_bn_mul_words_cend-16 316e1051a39Sopenharmony_ci } 317e1051a39Sopenharmony_ci 318e1051a39Sopenharmony_ci#ifndef XMA_TEMPTATION 319e1051a39Sopenharmony_ci 320e1051a39Sopenharmony_ci{ .mmi; ADDP r14=0,r32 // rp 321e1051a39Sopenharmony_ci ADDP r15=0,r33 // ap 322e1051a39Sopenharmony_ci mov ar.lc=r10 } 323e1051a39Sopenharmony_ci{ .mmi; mov r40=0 // serves as r35 at first (p27) 324e1051a39Sopenharmony_ci mov ar.ec=13 };; 325e1051a39Sopenharmony_ci 326e1051a39Sopenharmony_ci// This loop spins in 2*(n+12) ticks. It's scheduled for data in Itanium 327e1051a39Sopenharmony_ci// L2 cache (i.e. 9 ticks away) as floating point load/store instructions 328e1051a39Sopenharmony_ci// bypass L1 cache and L2 latency is actually best-case scenario for 329e1051a39Sopenharmony_ci// ldf8. The loop is not scalable and shall run in 2*(n+12) even on 330e1051a39Sopenharmony_ci// "wider" IA-64 implementations. It's a trade-off here. n+24 loop 331e1051a39Sopenharmony_ci// would give us ~5% in *overall* performance improvement on "wider" 332e1051a39Sopenharmony_ci// IA-64, but would hurt Itanium for about same because of longer 333e1051a39Sopenharmony_ci// epilogue. As it's a matter of few percents in either case I've 334e1051a39Sopenharmony_ci// chosen to trade the scalability for development time (you can see 335e1051a39Sopenharmony_ci// this very instruction sequence in bn_mul_add_words loop which in 336e1051a39Sopenharmony_ci// turn is scalable). 337e1051a39Sopenharmony_ci.L_bn_mul_words_ctop: 338e1051a39Sopenharmony_ci{ .mfi; (p25) getf.sig r36=f52 // low 339e1051a39Sopenharmony_ci (p21) xmpy.lu f48=f37,f8 340e1051a39Sopenharmony_ci (p28) cmp.ltu p54,p50=r41,r39 } 341e1051a39Sopenharmony_ci{ .mfi; (p16) ldf8 f32=[r15],8 342e1051a39Sopenharmony_ci (p21) xmpy.hu f40=f37,f8 343e1051a39Sopenharmony_ci (p0) nop.i 0x0 };; 344e1051a39Sopenharmony_ci{ .mii; (p25) getf.sig r32=f44 // high 345e1051a39Sopenharmony_ci .pred.rel "mutex",p50,p54 346e1051a39Sopenharmony_ci (p50) add r40=r38,r35 // (p27) 347e1051a39Sopenharmony_ci (p54) add r40=r38,r35,1 } // (p27) 348e1051a39Sopenharmony_ci{ .mfb; (p28) st8 [r14]=r41,8 349e1051a39Sopenharmony_ci (p0) nop.f 0x0 350e1051a39Sopenharmony_ci br.ctop.sptk .L_bn_mul_words_ctop };; 351e1051a39Sopenharmony_ci.L_bn_mul_words_cend: 352e1051a39Sopenharmony_ci 353e1051a39Sopenharmony_ci{ .mii; nop.m 0x0 354e1051a39Sopenharmony_ci.pred.rel "mutex",p51,p55 355e1051a39Sopenharmony_ci(p51) add r8=r36,r0 356e1051a39Sopenharmony_ci(p55) add r8=r36,r0,1 } 357e1051a39Sopenharmony_ci{ .mfb; nop.m 0x0 358e1051a39Sopenharmony_ci nop.f 0x0 359e1051a39Sopenharmony_ci nop.b 0x0 } 360e1051a39Sopenharmony_ci 361e1051a39Sopenharmony_ci#else // XMA_TEMPTATION 362e1051a39Sopenharmony_ci 363e1051a39Sopenharmony_ci setf.sig f37=r0 // serves as carry at (p18) tick 364e1051a39Sopenharmony_ci mov ar.lc=r10 365e1051a39Sopenharmony_ci mov ar.ec=5;; 366e1051a39Sopenharmony_ci 367e1051a39Sopenharmony_ci// Most of you examining this code very likely wonder why in the name 368e1051a39Sopenharmony_ci// of Intel the following loop is commented out? Indeed, it looks so 369e1051a39Sopenharmony_ci// neat that you find it hard to believe that it's something wrong 370e1051a39Sopenharmony_ci// with it, right? The catch is that every iteration depends on the 371e1051a39Sopenharmony_ci// result from previous one and the latter isn't available instantly. 372e1051a39Sopenharmony_ci// The loop therefore spins at the latency of xma minus 1, or in other 373e1051a39Sopenharmony_ci// words at 6*(n+4) ticks:-( Compare to the "production" loop above 374e1051a39Sopenharmony_ci// that runs in 2*(n+11) where the low latency problem is worked around 375e1051a39Sopenharmony_ci// by moving the dependency to one-tick latent integer ALU. Note that 376e1051a39Sopenharmony_ci// "distance" between ldf8 and xma is not latency of ldf8, but the 377e1051a39Sopenharmony_ci// *difference* between xma and ldf8 latencies. 378e1051a39Sopenharmony_ci.L_bn_mul_words_ctop: 379e1051a39Sopenharmony_ci{ .mfi; (p16) ldf8 f32=[r33],8 380e1051a39Sopenharmony_ci (p18) xma.hu f38=f34,f8,f39 } 381e1051a39Sopenharmony_ci{ .mfb; (p20) stf8 [r32]=f37,8 382e1051a39Sopenharmony_ci (p18) xma.lu f35=f34,f8,f39 383e1051a39Sopenharmony_ci br.ctop.sptk .L_bn_mul_words_ctop };; 384e1051a39Sopenharmony_ci.L_bn_mul_words_cend: 385e1051a39Sopenharmony_ci 386e1051a39Sopenharmony_ci getf.sig r8=f41 // the return value 387e1051a39Sopenharmony_ci 388e1051a39Sopenharmony_ci#endif // XMA_TEMPTATION 389e1051a39Sopenharmony_ci 390e1051a39Sopenharmony_ci{ .mii; nop.m 0x0 391e1051a39Sopenharmony_ci mov pr=r9,0x1ffff 392e1051a39Sopenharmony_ci mov ar.lc=r3 } 393e1051a39Sopenharmony_ci{ .mfb; rum 1<<5 // clear um.mfh 394e1051a39Sopenharmony_ci nop.f 0x0 395e1051a39Sopenharmony_ci br.ret.sptk.many b0 };; 396e1051a39Sopenharmony_ci.endp bn_mul_words# 397e1051a39Sopenharmony_ci#endif 398e1051a39Sopenharmony_ci 399e1051a39Sopenharmony_ci#if 1 400e1051a39Sopenharmony_ci// 401e1051a39Sopenharmony_ci// BN_ULONG bn_mul_add_words(BN_ULONG *rp, BN_ULONG *ap, int num, BN_ULONG w) 402e1051a39Sopenharmony_ci// 403e1051a39Sopenharmony_ci.global bn_mul_add_words# 404e1051a39Sopenharmony_ci.proc bn_mul_add_words# 405e1051a39Sopenharmony_ci.align 64 406e1051a39Sopenharmony_ci.skip 48 // makes the loop body aligned at 64-byte boundary 407e1051a39Sopenharmony_cibn_mul_add_words: 408e1051a39Sopenharmony_ci .prologue 409e1051a39Sopenharmony_ci .save ar.pfs,r2 410e1051a39Sopenharmony_ci{ .mmi; alloc r2=ar.pfs,4,4,0,8 411e1051a39Sopenharmony_ci cmp4.le p6,p0=r34,r0 412e1051a39Sopenharmony_ci .save ar.lc,r3 413e1051a39Sopenharmony_ci mov r3=ar.lc };; 414e1051a39Sopenharmony_ci{ .mib; mov r8=r0 // return value 415e1051a39Sopenharmony_ci sub r10=r34,r0,1 416e1051a39Sopenharmony_ci(p6) br.ret.spnt.many b0 };; 417e1051a39Sopenharmony_ci 418e1051a39Sopenharmony_ci{ .mib; setf.sig f8=r35 // w 419e1051a39Sopenharmony_ci .save pr,r9 420e1051a39Sopenharmony_ci mov r9=pr 421e1051a39Sopenharmony_ci brp.loop.imp .L_bn_mul_add_words_ctop,.L_bn_mul_add_words_cend-16 422e1051a39Sopenharmony_ci } 423e1051a39Sopenharmony_ci .body 424e1051a39Sopenharmony_ci{ .mmi; ADDP r14=0,r32 // rp 425e1051a39Sopenharmony_ci ADDP r15=0,r33 // ap 426e1051a39Sopenharmony_ci mov ar.lc=r10 } 427e1051a39Sopenharmony_ci{ .mii; ADDP r16=0,r32 // rp copy 428e1051a39Sopenharmony_ci mov pr.rot=0x2001<<16 429e1051a39Sopenharmony_ci // ------^----- serves as (p40) at first (p27) 430e1051a39Sopenharmony_ci mov ar.ec=11 };; 431e1051a39Sopenharmony_ci 432e1051a39Sopenharmony_ci// This loop spins in 3*(n+10) ticks on Itanium and in 2*(n+10) on 433e1051a39Sopenharmony_ci// Itanium 2. Yes, unlike previous versions it scales:-) Previous 434e1051a39Sopenharmony_ci// version was performing *all* additions in IALU and was starving 435e1051a39Sopenharmony_ci// for those even on Itanium 2. In this version one addition is 436e1051a39Sopenharmony_ci// moved to FPU and is folded with multiplication. This is at cost 437e1051a39Sopenharmony_ci// of propagating the result from previous call to this subroutine 438e1051a39Sopenharmony_ci// to L2 cache... In other words negligible even for shorter keys. 439e1051a39Sopenharmony_ci// *Overall* performance improvement [over previous version] varies 440e1051a39Sopenharmony_ci// from 11 to 22 percent depending on key length. 441e1051a39Sopenharmony_ci.L_bn_mul_add_words_ctop: 442e1051a39Sopenharmony_ci.pred.rel "mutex",p40,p42 443e1051a39Sopenharmony_ci{ .mfi; (p23) getf.sig r36=f45 // low 444e1051a39Sopenharmony_ci (p20) xma.lu f42=f36,f8,f50 // low 445e1051a39Sopenharmony_ci (p40) add r39=r39,r35 } // (p27) 446e1051a39Sopenharmony_ci{ .mfi; (p16) ldf8 f32=[r15],8 // *(ap++) 447e1051a39Sopenharmony_ci (p20) xma.hu f36=f36,f8,f50 // high 448e1051a39Sopenharmony_ci (p42) add r39=r39,r35,1 };; // (p27) 449e1051a39Sopenharmony_ci{ .mmi; (p24) getf.sig r32=f40 // high 450e1051a39Sopenharmony_ci (p16) ldf8 f46=[r16],8 // *(rp1++) 451e1051a39Sopenharmony_ci (p40) cmp.ltu p41,p39=r39,r35 } // (p27) 452e1051a39Sopenharmony_ci{ .mib; (p26) st8 [r14]=r39,8 // *(rp2++) 453e1051a39Sopenharmony_ci (p42) cmp.leu p41,p39=r39,r35 // (p27) 454e1051a39Sopenharmony_ci br.ctop.sptk .L_bn_mul_add_words_ctop};; 455e1051a39Sopenharmony_ci.L_bn_mul_add_words_cend: 456e1051a39Sopenharmony_ci 457e1051a39Sopenharmony_ci{ .mmi; .pred.rel "mutex",p40,p42 458e1051a39Sopenharmony_ci(p40) add r8=r35,r0 459e1051a39Sopenharmony_ci(p42) add r8=r35,r0,1 460e1051a39Sopenharmony_ci mov pr=r9,0x1ffff } 461e1051a39Sopenharmony_ci{ .mib; rum 1<<5 // clear um.mfh 462e1051a39Sopenharmony_ci mov ar.lc=r3 463e1051a39Sopenharmony_ci br.ret.sptk.many b0 };; 464e1051a39Sopenharmony_ci.endp bn_mul_add_words# 465e1051a39Sopenharmony_ci#endif 466e1051a39Sopenharmony_ci 467e1051a39Sopenharmony_ci#if 1 468e1051a39Sopenharmony_ci// 469e1051a39Sopenharmony_ci// void bn_sqr_words(BN_ULONG *rp, BN_ULONG *ap, int num) 470e1051a39Sopenharmony_ci// 471e1051a39Sopenharmony_ci.global bn_sqr_words# 472e1051a39Sopenharmony_ci.proc bn_sqr_words# 473e1051a39Sopenharmony_ci.align 64 474e1051a39Sopenharmony_ci.skip 32 // makes the loop body aligned at 64-byte boundary 475e1051a39Sopenharmony_cibn_sqr_words: 476e1051a39Sopenharmony_ci .prologue 477e1051a39Sopenharmony_ci .save ar.pfs,r2 478e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,3,0,0,0 479e1051a39Sopenharmony_ci sxt4 r34=r34 };; 480e1051a39Sopenharmony_ci{ .mii; cmp.le p6,p0=r34,r0 481e1051a39Sopenharmony_ci mov r8=r0 } // return value 482e1051a39Sopenharmony_ci{ .mfb; ADDP r32=0,r32 483e1051a39Sopenharmony_ci nop.f 0x0 484e1051a39Sopenharmony_ci(p6) br.ret.spnt.many b0 };; 485e1051a39Sopenharmony_ci 486e1051a39Sopenharmony_ci{ .mii; sub r10=r34,r0,1 487e1051a39Sopenharmony_ci .save ar.lc,r3 488e1051a39Sopenharmony_ci mov r3=ar.lc 489e1051a39Sopenharmony_ci .save pr,r9 490e1051a39Sopenharmony_ci mov r9=pr };; 491e1051a39Sopenharmony_ci 492e1051a39Sopenharmony_ci .body 493e1051a39Sopenharmony_ci{ .mib; ADDP r33=0,r33 494e1051a39Sopenharmony_ci mov pr.rot=1<<16 495e1051a39Sopenharmony_ci brp.loop.imp .L_bn_sqr_words_ctop,.L_bn_sqr_words_cend-16 496e1051a39Sopenharmony_ci } 497e1051a39Sopenharmony_ci{ .mii; add r34=8,r32 498e1051a39Sopenharmony_ci mov ar.lc=r10 499e1051a39Sopenharmony_ci mov ar.ec=18 };; 500e1051a39Sopenharmony_ci 501e1051a39Sopenharmony_ci// 2*(n+17) on Itanium, (n+17) on "wider" IA-64 implementations. It's 502e1051a39Sopenharmony_ci// possible to compress the epilogue (I'm getting tired to write this 503e1051a39Sopenharmony_ci// comment over and over) and get down to 2*n+16 at the cost of 504e1051a39Sopenharmony_ci// scalability. The decision will very likely be reconsidered after the 505e1051a39Sopenharmony_ci// benchmark program is profiled. I.e. if performance gain on Itanium 506e1051a39Sopenharmony_ci// will appear larger than loss on "wider" IA-64, then the loop should 507e1051a39Sopenharmony_ci// be explicitly split and the epilogue compressed. 508e1051a39Sopenharmony_ci.L_bn_sqr_words_ctop: 509e1051a39Sopenharmony_ci{ .mfi; (p16) ldf8 f32=[r33],8 510e1051a39Sopenharmony_ci (p25) xmpy.lu f42=f41,f41 511e1051a39Sopenharmony_ci (p0) nop.i 0x0 } 512e1051a39Sopenharmony_ci{ .mib; (p33) stf8 [r32]=f50,16 513e1051a39Sopenharmony_ci (p0) nop.i 0x0 514e1051a39Sopenharmony_ci (p0) nop.b 0x0 } 515e1051a39Sopenharmony_ci{ .mfi; (p0) nop.m 0x0 516e1051a39Sopenharmony_ci (p25) xmpy.hu f52=f41,f41 517e1051a39Sopenharmony_ci (p0) nop.i 0x0 } 518e1051a39Sopenharmony_ci{ .mib; (p33) stf8 [r34]=f60,16 519e1051a39Sopenharmony_ci (p0) nop.i 0x0 520e1051a39Sopenharmony_ci br.ctop.sptk .L_bn_sqr_words_ctop };; 521e1051a39Sopenharmony_ci.L_bn_sqr_words_cend: 522e1051a39Sopenharmony_ci 523e1051a39Sopenharmony_ci{ .mii; nop.m 0x0 524e1051a39Sopenharmony_ci mov pr=r9,0x1ffff 525e1051a39Sopenharmony_ci mov ar.lc=r3 } 526e1051a39Sopenharmony_ci{ .mfb; rum 1<<5 // clear um.mfh 527e1051a39Sopenharmony_ci nop.f 0x0 528e1051a39Sopenharmony_ci br.ret.sptk.many b0 };; 529e1051a39Sopenharmony_ci.endp bn_sqr_words# 530e1051a39Sopenharmony_ci#endif 531e1051a39Sopenharmony_ci 532e1051a39Sopenharmony_ci#if 1 533e1051a39Sopenharmony_ci// Apparently we win nothing by implementing special bn_sqr_comba8. 534e1051a39Sopenharmony_ci// Yes, it is possible to reduce the number of multiplications by 535e1051a39Sopenharmony_ci// almost factor of two, but then the amount of additions would 536e1051a39Sopenharmony_ci// increase by factor of two (as we would have to perform those 537e1051a39Sopenharmony_ci// otherwise performed by xma ourselves). Normally we would trade 538e1051a39Sopenharmony_ci// anyway as multiplications are way more expensive, but not this 539e1051a39Sopenharmony_ci// time... Multiplication kernel is fully pipelined and as we drain 540e1051a39Sopenharmony_ci// one 128-bit multiplication result per clock cycle multiplications 541e1051a39Sopenharmony_ci// are effectively as inexpensive as additions. Special implementation 542e1051a39Sopenharmony_ci// might become of interest for "wider" IA-64 implementation as you'll 543e1051a39Sopenharmony_ci// be able to get through the multiplication phase faster (there won't 544e1051a39Sopenharmony_ci// be any stall issues as discussed in the commentary section below and 545e1051a39Sopenharmony_ci// you therefore will be able to employ all 4 FP units)... But these 546e1051a39Sopenharmony_ci// Itanium days it's simply too hard to justify the effort so I just 547e1051a39Sopenharmony_ci// drop down to bn_mul_comba8 code:-) 548e1051a39Sopenharmony_ci// 549e1051a39Sopenharmony_ci// void bn_sqr_comba8(BN_ULONG *r, BN_ULONG *a) 550e1051a39Sopenharmony_ci// 551e1051a39Sopenharmony_ci.global bn_sqr_comba8# 552e1051a39Sopenharmony_ci.proc bn_sqr_comba8# 553e1051a39Sopenharmony_ci.align 64 554e1051a39Sopenharmony_cibn_sqr_comba8: 555e1051a39Sopenharmony_ci .prologue 556e1051a39Sopenharmony_ci .save ar.pfs,r2 557e1051a39Sopenharmony_ci#if defined(_HPUX_SOURCE) && !defined(_LP64) 558e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,2,1,0,0 559e1051a39Sopenharmony_ci addp4 r33=0,r33 560e1051a39Sopenharmony_ci addp4 r32=0,r32 };; 561e1051a39Sopenharmony_ci{ .mii; 562e1051a39Sopenharmony_ci#else 563e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,2,1,0,0 564e1051a39Sopenharmony_ci#endif 565e1051a39Sopenharmony_ci mov r34=r33 566e1051a39Sopenharmony_ci add r14=8,r33 };; 567e1051a39Sopenharmony_ci .body 568e1051a39Sopenharmony_ci{ .mii; add r17=8,r34 569e1051a39Sopenharmony_ci add r15=16,r33 570e1051a39Sopenharmony_ci add r18=16,r34 } 571e1051a39Sopenharmony_ci{ .mfb; add r16=24,r33 572e1051a39Sopenharmony_ci br .L_cheat_entry_point8 };; 573e1051a39Sopenharmony_ci.endp bn_sqr_comba8# 574e1051a39Sopenharmony_ci#endif 575e1051a39Sopenharmony_ci 576e1051a39Sopenharmony_ci#if 1 577e1051a39Sopenharmony_ci// I've estimated this routine to run in ~120 ticks, but in reality 578e1051a39Sopenharmony_ci// (i.e. according to ar.itc) it takes ~160 ticks. Are those extra 579e1051a39Sopenharmony_ci// cycles consumed for instructions fetch? Or did I misinterpret some 580e1051a39Sopenharmony_ci// clause in Itanium µ-architecture manual? Comments are welcomed and 581e1051a39Sopenharmony_ci// highly appreciated. 582e1051a39Sopenharmony_ci// 583e1051a39Sopenharmony_ci// On Itanium 2 it takes ~190 ticks. This is because of stalls on 584e1051a39Sopenharmony_ci// result from getf.sig. I do nothing about it at this point for 585e1051a39Sopenharmony_ci// reasons depicted below. 586e1051a39Sopenharmony_ci// 587e1051a39Sopenharmony_ci// However! It should be noted that even 160 ticks is darn good result 588e1051a39Sopenharmony_ci// as it's over 10 (yes, ten, spelled as t-e-n) times faster than the 589e1051a39Sopenharmony_ci// C version (compiled with gcc with inline assembler). I really 590e1051a39Sopenharmony_ci// kicked compiler's butt here, didn't I? Yeah! This brings us to the 591e1051a39Sopenharmony_ci// following statement. It's damn shame that this routine isn't called 592e1051a39Sopenharmony_ci// very often nowadays! According to the profiler most CPU time is 593e1051a39Sopenharmony_ci// consumed by bn_mul_add_words called from BN_from_montgomery. In 594e1051a39Sopenharmony_ci// order to estimate what we're missing, I've compared the performance 595e1051a39Sopenharmony_ci// of this routine against "traditional" implementation, i.e. against 596e1051a39Sopenharmony_ci// following routine: 597e1051a39Sopenharmony_ci// 598e1051a39Sopenharmony_ci// void bn_mul_comba8(BN_ULONG *r, BN_ULONG *a, BN_ULONG *b) 599e1051a39Sopenharmony_ci// { r[ 8]=bn_mul_words( &(r[0]),a,8,b[0]); 600e1051a39Sopenharmony_ci// r[ 9]=bn_mul_add_words(&(r[1]),a,8,b[1]); 601e1051a39Sopenharmony_ci// r[10]=bn_mul_add_words(&(r[2]),a,8,b[2]); 602e1051a39Sopenharmony_ci// r[11]=bn_mul_add_words(&(r[3]),a,8,b[3]); 603e1051a39Sopenharmony_ci// r[12]=bn_mul_add_words(&(r[4]),a,8,b[4]); 604e1051a39Sopenharmony_ci// r[13]=bn_mul_add_words(&(r[5]),a,8,b[5]); 605e1051a39Sopenharmony_ci// r[14]=bn_mul_add_words(&(r[6]),a,8,b[6]); 606e1051a39Sopenharmony_ci// r[15]=bn_mul_add_words(&(r[7]),a,8,b[7]); 607e1051a39Sopenharmony_ci// } 608e1051a39Sopenharmony_ci// 609e1051a39Sopenharmony_ci// The one below is over 8 times faster than the one above:-( Even 610e1051a39Sopenharmony_ci// more reasons to "combafy" bn_mul_add_mont... 611e1051a39Sopenharmony_ci// 612e1051a39Sopenharmony_ci// And yes, this routine really made me wish there were an optimizing 613e1051a39Sopenharmony_ci// assembler! It also feels like it deserves a dedication. 614e1051a39Sopenharmony_ci// 615e1051a39Sopenharmony_ci// To my wife for being there and to my kids... 616e1051a39Sopenharmony_ci// 617e1051a39Sopenharmony_ci// void bn_mul_comba8(BN_ULONG *r, BN_ULONG *a, BN_ULONG *b) 618e1051a39Sopenharmony_ci// 619e1051a39Sopenharmony_ci#define carry1 r14 620e1051a39Sopenharmony_ci#define carry2 r15 621e1051a39Sopenharmony_ci#define carry3 r34 622e1051a39Sopenharmony_ci.global bn_mul_comba8# 623e1051a39Sopenharmony_ci.proc bn_mul_comba8# 624e1051a39Sopenharmony_ci.align 64 625e1051a39Sopenharmony_cibn_mul_comba8: 626e1051a39Sopenharmony_ci .prologue 627e1051a39Sopenharmony_ci .save ar.pfs,r2 628e1051a39Sopenharmony_ci#if defined(_HPUX_SOURCE) && !defined(_LP64) 629e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,3,0,0,0 630e1051a39Sopenharmony_ci addp4 r33=0,r33 631e1051a39Sopenharmony_ci addp4 r34=0,r34 };; 632e1051a39Sopenharmony_ci{ .mii; addp4 r32=0,r32 633e1051a39Sopenharmony_ci#else 634e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,3,0,0,0 635e1051a39Sopenharmony_ci#endif 636e1051a39Sopenharmony_ci add r14=8,r33 637e1051a39Sopenharmony_ci add r17=8,r34 } 638e1051a39Sopenharmony_ci .body 639e1051a39Sopenharmony_ci{ .mii; add r15=16,r33 640e1051a39Sopenharmony_ci add r18=16,r34 641e1051a39Sopenharmony_ci add r16=24,r33 } 642e1051a39Sopenharmony_ci.L_cheat_entry_point8: 643e1051a39Sopenharmony_ci{ .mmi; add r19=24,r34 644e1051a39Sopenharmony_ci 645e1051a39Sopenharmony_ci ldf8 f32=[r33],32 };; 646e1051a39Sopenharmony_ci 647e1051a39Sopenharmony_ci{ .mmi; ldf8 f120=[r34],32 648e1051a39Sopenharmony_ci ldf8 f121=[r17],32 } 649e1051a39Sopenharmony_ci{ .mmi; ldf8 f122=[r18],32 650e1051a39Sopenharmony_ci ldf8 f123=[r19],32 };; 651e1051a39Sopenharmony_ci{ .mmi; ldf8 f124=[r34] 652e1051a39Sopenharmony_ci ldf8 f125=[r17] } 653e1051a39Sopenharmony_ci{ .mmi; ldf8 f126=[r18] 654e1051a39Sopenharmony_ci ldf8 f127=[r19] } 655e1051a39Sopenharmony_ci 656e1051a39Sopenharmony_ci{ .mmi; ldf8 f33=[r14],32 657e1051a39Sopenharmony_ci ldf8 f34=[r15],32 } 658e1051a39Sopenharmony_ci{ .mmi; ldf8 f35=[r16],32;; 659e1051a39Sopenharmony_ci ldf8 f36=[r33] } 660e1051a39Sopenharmony_ci{ .mmi; ldf8 f37=[r14] 661e1051a39Sopenharmony_ci ldf8 f38=[r15] } 662e1051a39Sopenharmony_ci{ .mfi; ldf8 f39=[r16] 663e1051a39Sopenharmony_ci// -------\ Entering multiplier's heaven /------- 664e1051a39Sopenharmony_ci// ------------\ /------------ 665e1051a39Sopenharmony_ci// -----------------\ /----------------- 666e1051a39Sopenharmony_ci// ----------------------\/---------------------- 667e1051a39Sopenharmony_ci xma.hu f41=f32,f120,f0 } 668e1051a39Sopenharmony_ci{ .mfi; xma.lu f40=f32,f120,f0 };; // (*) 669e1051a39Sopenharmony_ci{ .mfi; xma.hu f51=f32,f121,f0 } 670e1051a39Sopenharmony_ci{ .mfi; xma.lu f50=f32,f121,f0 };; 671e1051a39Sopenharmony_ci{ .mfi; xma.hu f61=f32,f122,f0 } 672e1051a39Sopenharmony_ci{ .mfi; xma.lu f60=f32,f122,f0 };; 673e1051a39Sopenharmony_ci{ .mfi; xma.hu f71=f32,f123,f0 } 674e1051a39Sopenharmony_ci{ .mfi; xma.lu f70=f32,f123,f0 };; 675e1051a39Sopenharmony_ci{ .mfi; xma.hu f81=f32,f124,f0 } 676e1051a39Sopenharmony_ci{ .mfi; xma.lu f80=f32,f124,f0 };; 677e1051a39Sopenharmony_ci{ .mfi; xma.hu f91=f32,f125,f0 } 678e1051a39Sopenharmony_ci{ .mfi; xma.lu f90=f32,f125,f0 };; 679e1051a39Sopenharmony_ci{ .mfi; xma.hu f101=f32,f126,f0 } 680e1051a39Sopenharmony_ci{ .mfi; xma.lu f100=f32,f126,f0 };; 681e1051a39Sopenharmony_ci{ .mfi; xma.hu f111=f32,f127,f0 } 682e1051a39Sopenharmony_ci{ .mfi; xma.lu f110=f32,f127,f0 };;// 683e1051a39Sopenharmony_ci// (*) You can argue that splitting at every second bundle would 684e1051a39Sopenharmony_ci// prevent "wider" IA-64 implementations from achieving the peak 685e1051a39Sopenharmony_ci// performance. Well, not really... The catch is that if you 686e1051a39Sopenharmony_ci// intend to keep 4 FP units busy by splitting at every fourth 687e1051a39Sopenharmony_ci// bundle and thus perform these 16 multiplications in 4 ticks, 688e1051a39Sopenharmony_ci// the first bundle *below* would stall because the result from 689e1051a39Sopenharmony_ci// the first xma bundle *above* won't be available for another 3 690e1051a39Sopenharmony_ci// ticks (if not more, being an optimist, I assume that "wider" 691e1051a39Sopenharmony_ci// implementation will have same latency:-). This stall will hold 692e1051a39Sopenharmony_ci// you back and the performance would be as if every second bundle 693e1051a39Sopenharmony_ci// were split *anyway*... 694e1051a39Sopenharmony_ci{ .mfi; getf.sig r16=f40 695e1051a39Sopenharmony_ci xma.hu f42=f33,f120,f41 696e1051a39Sopenharmony_ci add r33=8,r32 } 697e1051a39Sopenharmony_ci{ .mfi; xma.lu f41=f33,f120,f41 };; 698e1051a39Sopenharmony_ci{ .mfi; getf.sig r24=f50 699e1051a39Sopenharmony_ci xma.hu f52=f33,f121,f51 } 700e1051a39Sopenharmony_ci{ .mfi; xma.lu f51=f33,f121,f51 };; 701e1051a39Sopenharmony_ci{ .mfi; st8 [r32]=r16,16 702e1051a39Sopenharmony_ci xma.hu f62=f33,f122,f61 } 703e1051a39Sopenharmony_ci{ .mfi; xma.lu f61=f33,f122,f61 };; 704e1051a39Sopenharmony_ci{ .mfi; xma.hu f72=f33,f123,f71 } 705e1051a39Sopenharmony_ci{ .mfi; xma.lu f71=f33,f123,f71 };; 706e1051a39Sopenharmony_ci{ .mfi; xma.hu f82=f33,f124,f81 } 707e1051a39Sopenharmony_ci{ .mfi; xma.lu f81=f33,f124,f81 };; 708e1051a39Sopenharmony_ci{ .mfi; xma.hu f92=f33,f125,f91 } 709e1051a39Sopenharmony_ci{ .mfi; xma.lu f91=f33,f125,f91 };; 710e1051a39Sopenharmony_ci{ .mfi; xma.hu f102=f33,f126,f101 } 711e1051a39Sopenharmony_ci{ .mfi; xma.lu f101=f33,f126,f101 };; 712e1051a39Sopenharmony_ci{ .mfi; xma.hu f112=f33,f127,f111 } 713e1051a39Sopenharmony_ci{ .mfi; xma.lu f111=f33,f127,f111 };;// 714e1051a39Sopenharmony_ci//-------------------------------------------------// 715e1051a39Sopenharmony_ci{ .mfi; getf.sig r25=f41 716e1051a39Sopenharmony_ci xma.hu f43=f34,f120,f42 } 717e1051a39Sopenharmony_ci{ .mfi; xma.lu f42=f34,f120,f42 };; 718e1051a39Sopenharmony_ci{ .mfi; getf.sig r16=f60 719e1051a39Sopenharmony_ci xma.hu f53=f34,f121,f52 } 720e1051a39Sopenharmony_ci{ .mfi; xma.lu f52=f34,f121,f52 };; 721e1051a39Sopenharmony_ci{ .mfi; getf.sig r17=f51 722e1051a39Sopenharmony_ci xma.hu f63=f34,f122,f62 723e1051a39Sopenharmony_ci add r25=r25,r24 } 724e1051a39Sopenharmony_ci{ .mfi; xma.lu f62=f34,f122,f62 725e1051a39Sopenharmony_ci mov carry1=0 };; 726e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p6,p0=r25,r24 727e1051a39Sopenharmony_ci xma.hu f73=f34,f123,f72 } 728e1051a39Sopenharmony_ci{ .mfi; xma.lu f72=f34,f123,f72 };; 729e1051a39Sopenharmony_ci{ .mfi; st8 [r33]=r25,16 730e1051a39Sopenharmony_ci xma.hu f83=f34,f124,f82 731e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 732e1051a39Sopenharmony_ci{ .mfi; xma.lu f82=f34,f124,f82 };; 733e1051a39Sopenharmony_ci{ .mfi; xma.hu f93=f34,f125,f92 } 734e1051a39Sopenharmony_ci{ .mfi; xma.lu f92=f34,f125,f92 };; 735e1051a39Sopenharmony_ci{ .mfi; xma.hu f103=f34,f126,f102 } 736e1051a39Sopenharmony_ci{ .mfi; xma.lu f102=f34,f126,f102 };; 737e1051a39Sopenharmony_ci{ .mfi; xma.hu f113=f34,f127,f112 } 738e1051a39Sopenharmony_ci{ .mfi; xma.lu f112=f34,f127,f112 };;// 739e1051a39Sopenharmony_ci//-------------------------------------------------// 740e1051a39Sopenharmony_ci{ .mfi; getf.sig r18=f42 741e1051a39Sopenharmony_ci xma.hu f44=f35,f120,f43 742e1051a39Sopenharmony_ci add r17=r17,r16 } 743e1051a39Sopenharmony_ci{ .mfi; xma.lu f43=f35,f120,f43 };; 744e1051a39Sopenharmony_ci{ .mfi; getf.sig r24=f70 745e1051a39Sopenharmony_ci xma.hu f54=f35,f121,f53 } 746e1051a39Sopenharmony_ci{ .mfi; mov carry2=0 747e1051a39Sopenharmony_ci xma.lu f53=f35,f121,f53 };; 748e1051a39Sopenharmony_ci{ .mfi; getf.sig r25=f61 749e1051a39Sopenharmony_ci xma.hu f64=f35,f122,f63 750e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 } 751e1051a39Sopenharmony_ci{ .mfi; add r18=r18,r17 752e1051a39Sopenharmony_ci xma.lu f63=f35,f122,f63 };; 753e1051a39Sopenharmony_ci{ .mfi; getf.sig r26=f52 754e1051a39Sopenharmony_ci xma.hu f74=f35,f123,f73 755e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 756e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r18,r17 757e1051a39Sopenharmony_ci xma.lu f73=f35,f123,f73 758e1051a39Sopenharmony_ci add r18=r18,carry1 };; 759e1051a39Sopenharmony_ci{ .mfi; 760e1051a39Sopenharmony_ci xma.hu f84=f35,f124,f83 761e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 762e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r18,carry1 763e1051a39Sopenharmony_ci xma.lu f83=f35,f124,f83 };; 764e1051a39Sopenharmony_ci{ .mfi; st8 [r32]=r18,16 765e1051a39Sopenharmony_ci xma.hu f94=f35,f125,f93 766e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 767e1051a39Sopenharmony_ci{ .mfi; xma.lu f93=f35,f125,f93 };; 768e1051a39Sopenharmony_ci{ .mfi; xma.hu f104=f35,f126,f103 } 769e1051a39Sopenharmony_ci{ .mfi; xma.lu f103=f35,f126,f103 };; 770e1051a39Sopenharmony_ci{ .mfi; xma.hu f114=f35,f127,f113 } 771e1051a39Sopenharmony_ci{ .mfi; mov carry1=0 772e1051a39Sopenharmony_ci xma.lu f113=f35,f127,f113 773e1051a39Sopenharmony_ci add r25=r25,r24 };;// 774e1051a39Sopenharmony_ci//-------------------------------------------------// 775e1051a39Sopenharmony_ci{ .mfi; getf.sig r27=f43 776e1051a39Sopenharmony_ci xma.hu f45=f36,f120,f44 777e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 } 778e1051a39Sopenharmony_ci{ .mfi; xma.lu f44=f36,f120,f44 779e1051a39Sopenharmony_ci add r26=r26,r25 };; 780e1051a39Sopenharmony_ci{ .mfi; getf.sig r16=f80 781e1051a39Sopenharmony_ci xma.hu f55=f36,f121,f54 782e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 783e1051a39Sopenharmony_ci{ .mfi; xma.lu f54=f36,f121,f54 };; 784e1051a39Sopenharmony_ci{ .mfi; getf.sig r17=f71 785e1051a39Sopenharmony_ci xma.hu f65=f36,f122,f64 786e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,r25 } 787e1051a39Sopenharmony_ci{ .mfi; xma.lu f64=f36,f122,f64 788e1051a39Sopenharmony_ci add r27=r27,r26 };; 789e1051a39Sopenharmony_ci{ .mfi; getf.sig r18=f62 790e1051a39Sopenharmony_ci xma.hu f75=f36,f123,f74 791e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 792e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p6,p0=r27,r26 793e1051a39Sopenharmony_ci xma.lu f74=f36,f123,f74 794e1051a39Sopenharmony_ci add r27=r27,carry2 };; 795e1051a39Sopenharmony_ci{ .mfi; getf.sig r19=f53 796e1051a39Sopenharmony_ci xma.hu f85=f36,f124,f84 797e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 798e1051a39Sopenharmony_ci{ .mfi; xma.lu f84=f36,f124,f84 799e1051a39Sopenharmony_ci cmp.ltu p6,p0=r27,carry2 };; 800e1051a39Sopenharmony_ci{ .mfi; st8 [r33]=r27,16 801e1051a39Sopenharmony_ci xma.hu f95=f36,f125,f94 802e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 803e1051a39Sopenharmony_ci{ .mfi; xma.lu f94=f36,f125,f94 };; 804e1051a39Sopenharmony_ci{ .mfi; xma.hu f105=f36,f126,f104 } 805e1051a39Sopenharmony_ci{ .mfi; mov carry2=0 806e1051a39Sopenharmony_ci xma.lu f104=f36,f126,f104 807e1051a39Sopenharmony_ci add r17=r17,r16 };; 808e1051a39Sopenharmony_ci{ .mfi; xma.hu f115=f36,f127,f114 809e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 } 810e1051a39Sopenharmony_ci{ .mfi; xma.lu f114=f36,f127,f114 811e1051a39Sopenharmony_ci add r18=r18,r17 };;// 812e1051a39Sopenharmony_ci//-------------------------------------------------// 813e1051a39Sopenharmony_ci{ .mfi; getf.sig r20=f44 814e1051a39Sopenharmony_ci xma.hu f46=f37,f120,f45 815e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 816e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r18,r17 817e1051a39Sopenharmony_ci xma.lu f45=f37,f120,f45 818e1051a39Sopenharmony_ci add r19=r19,r18 };; 819e1051a39Sopenharmony_ci{ .mfi; getf.sig r24=f90 820e1051a39Sopenharmony_ci xma.hu f56=f37,f121,f55 } 821e1051a39Sopenharmony_ci{ .mfi; xma.lu f55=f37,f121,f55 };; 822e1051a39Sopenharmony_ci{ .mfi; getf.sig r25=f81 823e1051a39Sopenharmony_ci xma.hu f66=f37,f122,f65 824e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 825e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r19,r18 826e1051a39Sopenharmony_ci xma.lu f65=f37,f122,f65 827e1051a39Sopenharmony_ci add r20=r20,r19 };; 828e1051a39Sopenharmony_ci{ .mfi; getf.sig r26=f72 829e1051a39Sopenharmony_ci xma.hu f76=f37,f123,f75 830e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 831e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r20,r19 832e1051a39Sopenharmony_ci xma.lu f75=f37,f123,f75 833e1051a39Sopenharmony_ci add r20=r20,carry1 };; 834e1051a39Sopenharmony_ci{ .mfi; getf.sig r27=f63 835e1051a39Sopenharmony_ci xma.hu f86=f37,f124,f85 836e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 837e1051a39Sopenharmony_ci{ .mfi; xma.lu f85=f37,f124,f85 838e1051a39Sopenharmony_ci cmp.ltu p7,p0=r20,carry1 };; 839e1051a39Sopenharmony_ci{ .mfi; getf.sig r28=f54 840e1051a39Sopenharmony_ci xma.hu f96=f37,f125,f95 841e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 842e1051a39Sopenharmony_ci{ .mfi; st8 [r32]=r20,16 843e1051a39Sopenharmony_ci xma.lu f95=f37,f125,f95 };; 844e1051a39Sopenharmony_ci{ .mfi; xma.hu f106=f37,f126,f105 } 845e1051a39Sopenharmony_ci{ .mfi; mov carry1=0 846e1051a39Sopenharmony_ci xma.lu f105=f37,f126,f105 847e1051a39Sopenharmony_ci add r25=r25,r24 };; 848e1051a39Sopenharmony_ci{ .mfi; xma.hu f116=f37,f127,f115 849e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 } 850e1051a39Sopenharmony_ci{ .mfi; xma.lu f115=f37,f127,f115 851e1051a39Sopenharmony_ci add r26=r26,r25 };;// 852e1051a39Sopenharmony_ci//-------------------------------------------------// 853e1051a39Sopenharmony_ci{ .mfi; getf.sig r29=f45 854e1051a39Sopenharmony_ci xma.hu f47=f38,f120,f46 855e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 856e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p6,p0=r26,r25 857e1051a39Sopenharmony_ci xma.lu f46=f38,f120,f46 858e1051a39Sopenharmony_ci add r27=r27,r26 };; 859e1051a39Sopenharmony_ci{ .mfi; getf.sig r16=f100 860e1051a39Sopenharmony_ci xma.hu f57=f38,f121,f56 861e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 862e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p6,p0=r27,r26 863e1051a39Sopenharmony_ci xma.lu f56=f38,f121,f56 864e1051a39Sopenharmony_ci add r28=r28,r27 };; 865e1051a39Sopenharmony_ci{ .mfi; getf.sig r17=f91 866e1051a39Sopenharmony_ci xma.hu f67=f38,f122,f66 867e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 868e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p6,p0=r28,r27 869e1051a39Sopenharmony_ci xma.lu f66=f38,f122,f66 870e1051a39Sopenharmony_ci add r29=r29,r28 };; 871e1051a39Sopenharmony_ci{ .mfi; getf.sig r18=f82 872e1051a39Sopenharmony_ci xma.hu f77=f38,f123,f76 873e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 874e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p6,p0=r29,r28 875e1051a39Sopenharmony_ci xma.lu f76=f38,f123,f76 876e1051a39Sopenharmony_ci add r29=r29,carry2 };; 877e1051a39Sopenharmony_ci{ .mfi; getf.sig r19=f73 878e1051a39Sopenharmony_ci xma.hu f87=f38,f124,f86 879e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 880e1051a39Sopenharmony_ci{ .mfi; xma.lu f86=f38,f124,f86 881e1051a39Sopenharmony_ci cmp.ltu p6,p0=r29,carry2 };; 882e1051a39Sopenharmony_ci{ .mfi; getf.sig r20=f64 883e1051a39Sopenharmony_ci xma.hu f97=f38,f125,f96 884e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 885e1051a39Sopenharmony_ci{ .mfi; st8 [r33]=r29,16 886e1051a39Sopenharmony_ci xma.lu f96=f38,f125,f96 };; 887e1051a39Sopenharmony_ci{ .mfi; getf.sig r21=f55 888e1051a39Sopenharmony_ci xma.hu f107=f38,f126,f106 } 889e1051a39Sopenharmony_ci{ .mfi; mov carry2=0 890e1051a39Sopenharmony_ci xma.lu f106=f38,f126,f106 891e1051a39Sopenharmony_ci add r17=r17,r16 };; 892e1051a39Sopenharmony_ci{ .mfi; xma.hu f117=f38,f127,f116 893e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 } 894e1051a39Sopenharmony_ci{ .mfi; xma.lu f116=f38,f127,f116 895e1051a39Sopenharmony_ci add r18=r18,r17 };;// 896e1051a39Sopenharmony_ci//-------------------------------------------------// 897e1051a39Sopenharmony_ci{ .mfi; getf.sig r22=f46 898e1051a39Sopenharmony_ci xma.hu f48=f39,f120,f47 899e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 900e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r18,r17 901e1051a39Sopenharmony_ci xma.lu f47=f39,f120,f47 902e1051a39Sopenharmony_ci add r19=r19,r18 };; 903e1051a39Sopenharmony_ci{ .mfi; getf.sig r24=f110 904e1051a39Sopenharmony_ci xma.hu f58=f39,f121,f57 905e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 906e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r19,r18 907e1051a39Sopenharmony_ci xma.lu f57=f39,f121,f57 908e1051a39Sopenharmony_ci add r20=r20,r19 };; 909e1051a39Sopenharmony_ci{ .mfi; getf.sig r25=f101 910e1051a39Sopenharmony_ci xma.hu f68=f39,f122,f67 911e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 912e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r20,r19 913e1051a39Sopenharmony_ci xma.lu f67=f39,f122,f67 914e1051a39Sopenharmony_ci add r21=r21,r20 };; 915e1051a39Sopenharmony_ci{ .mfi; getf.sig r26=f92 916e1051a39Sopenharmony_ci xma.hu f78=f39,f123,f77 917e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 918e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r21,r20 919e1051a39Sopenharmony_ci xma.lu f77=f39,f123,f77 920e1051a39Sopenharmony_ci add r22=r22,r21 };; 921e1051a39Sopenharmony_ci{ .mfi; getf.sig r27=f83 922e1051a39Sopenharmony_ci xma.hu f88=f39,f124,f87 923e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 924e1051a39Sopenharmony_ci{ .mfi; cmp.ltu p7,p0=r22,r21 925e1051a39Sopenharmony_ci xma.lu f87=f39,f124,f87 926e1051a39Sopenharmony_ci add r22=r22,carry1 };; 927e1051a39Sopenharmony_ci{ .mfi; getf.sig r28=f74 928e1051a39Sopenharmony_ci xma.hu f98=f39,f125,f97 929e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 930e1051a39Sopenharmony_ci{ .mfi; xma.lu f97=f39,f125,f97 931e1051a39Sopenharmony_ci cmp.ltu p7,p0=r22,carry1 };; 932e1051a39Sopenharmony_ci{ .mfi; getf.sig r29=f65 933e1051a39Sopenharmony_ci xma.hu f108=f39,f126,f107 934e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 } 935e1051a39Sopenharmony_ci{ .mfi; st8 [r32]=r22,16 936e1051a39Sopenharmony_ci xma.lu f107=f39,f126,f107 };; 937e1051a39Sopenharmony_ci{ .mfi; getf.sig r30=f56 938e1051a39Sopenharmony_ci xma.hu f118=f39,f127,f117 } 939e1051a39Sopenharmony_ci{ .mfi; xma.lu f117=f39,f127,f117 };;// 940e1051a39Sopenharmony_ci//-------------------------------------------------// 941e1051a39Sopenharmony_ci// Leaving multiplier's heaven... Quite a ride, huh? 942e1051a39Sopenharmony_ci 943e1051a39Sopenharmony_ci{ .mii; getf.sig r31=f47 944e1051a39Sopenharmony_ci add r25=r25,r24 945e1051a39Sopenharmony_ci mov carry1=0 };; 946e1051a39Sopenharmony_ci{ .mii; getf.sig r16=f111 947e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 948e1051a39Sopenharmony_ci add r26=r26,r25 };; 949e1051a39Sopenharmony_ci{ .mfb; getf.sig r17=f102 } 950e1051a39Sopenharmony_ci{ .mii; 951e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 952e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,r25 953e1051a39Sopenharmony_ci add r27=r27,r26 };; 954e1051a39Sopenharmony_ci{ .mfb; nop.m 0x0 } 955e1051a39Sopenharmony_ci{ .mii; 956e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 957e1051a39Sopenharmony_ci cmp.ltu p6,p0=r27,r26 958e1051a39Sopenharmony_ci add r28=r28,r27 };; 959e1051a39Sopenharmony_ci{ .mii; getf.sig r18=f93 960e1051a39Sopenharmony_ci add r17=r17,r16 961e1051a39Sopenharmony_ci mov carry3=0 } 962e1051a39Sopenharmony_ci{ .mii; 963e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 964e1051a39Sopenharmony_ci cmp.ltu p6,p0=r28,r27 965e1051a39Sopenharmony_ci add r29=r29,r28 };; 966e1051a39Sopenharmony_ci{ .mii; getf.sig r19=f84 967e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 } 968e1051a39Sopenharmony_ci{ .mii; 969e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 970e1051a39Sopenharmony_ci cmp.ltu p6,p0=r29,r28 971e1051a39Sopenharmony_ci add r30=r30,r29 };; 972e1051a39Sopenharmony_ci{ .mii; getf.sig r20=f75 973e1051a39Sopenharmony_ci add r18=r18,r17 } 974e1051a39Sopenharmony_ci{ .mii; 975e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 976e1051a39Sopenharmony_ci cmp.ltu p6,p0=r30,r29 977e1051a39Sopenharmony_ci add r31=r31,r30 };; 978e1051a39Sopenharmony_ci{ .mfb; getf.sig r21=f66 } 979e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 980e1051a39Sopenharmony_ci cmp.ltu p7,p0=r18,r17 981e1051a39Sopenharmony_ci add r19=r19,r18 } 982e1051a39Sopenharmony_ci{ .mfb; nop.m 0x0 } 983e1051a39Sopenharmony_ci{ .mii; 984e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 985e1051a39Sopenharmony_ci cmp.ltu p6,p0=r31,r30 986e1051a39Sopenharmony_ci add r31=r31,carry2 };; 987e1051a39Sopenharmony_ci{ .mfb; getf.sig r22=f57 } 988e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 989e1051a39Sopenharmony_ci cmp.ltu p7,p0=r19,r18 990e1051a39Sopenharmony_ci add r20=r20,r19 } 991e1051a39Sopenharmony_ci{ .mfb; nop.m 0x0 } 992e1051a39Sopenharmony_ci{ .mii; 993e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 994e1051a39Sopenharmony_ci cmp.ltu p6,p0=r31,carry2 };; 995e1051a39Sopenharmony_ci{ .mfb; getf.sig r23=f48 } 996e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 997e1051a39Sopenharmony_ci cmp.ltu p7,p0=r20,r19 998e1051a39Sopenharmony_ci add r21=r21,r20 } 999e1051a39Sopenharmony_ci{ .mii; 1000e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 1001e1051a39Sopenharmony_ci{ .mfb; st8 [r33]=r31,16 };; 1002e1051a39Sopenharmony_ci 1003e1051a39Sopenharmony_ci{ .mfb; getf.sig r24=f112 } 1004e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1005e1051a39Sopenharmony_ci cmp.ltu p7,p0=r21,r20 1006e1051a39Sopenharmony_ci add r22=r22,r21 };; 1007e1051a39Sopenharmony_ci{ .mfb; getf.sig r25=f103 } 1008e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1009e1051a39Sopenharmony_ci cmp.ltu p7,p0=r22,r21 1010e1051a39Sopenharmony_ci add r23=r23,r22 };; 1011e1051a39Sopenharmony_ci{ .mfb; getf.sig r26=f94 } 1012e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1013e1051a39Sopenharmony_ci cmp.ltu p7,p0=r23,r22 1014e1051a39Sopenharmony_ci add r23=r23,carry1 };; 1015e1051a39Sopenharmony_ci{ .mfb; getf.sig r27=f85 } 1016e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1017e1051a39Sopenharmony_ci cmp.ltu p7,p8=r23,carry1};; 1018e1051a39Sopenharmony_ci{ .mii; getf.sig r28=f76 1019e1051a39Sopenharmony_ci add r25=r25,r24 1020e1051a39Sopenharmony_ci mov carry1=0 } 1021e1051a39Sopenharmony_ci{ .mii; st8 [r32]=r23,16 1022e1051a39Sopenharmony_ci (p7) add carry2=1,carry3 1023e1051a39Sopenharmony_ci (p8) add carry2=0,carry3 };; 1024e1051a39Sopenharmony_ci 1025e1051a39Sopenharmony_ci{ .mfb; nop.m 0x0 } 1026e1051a39Sopenharmony_ci{ .mii; getf.sig r29=f67 1027e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 1028e1051a39Sopenharmony_ci add r26=r26,r25 };; 1029e1051a39Sopenharmony_ci{ .mfb; getf.sig r30=f58 } 1030e1051a39Sopenharmony_ci{ .mii; 1031e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1032e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,r25 1033e1051a39Sopenharmony_ci add r27=r27,r26 };; 1034e1051a39Sopenharmony_ci{ .mfb; getf.sig r16=f113 } 1035e1051a39Sopenharmony_ci{ .mii; 1036e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1037e1051a39Sopenharmony_ci cmp.ltu p6,p0=r27,r26 1038e1051a39Sopenharmony_ci add r28=r28,r27 };; 1039e1051a39Sopenharmony_ci{ .mfb; getf.sig r17=f104 } 1040e1051a39Sopenharmony_ci{ .mii; 1041e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1042e1051a39Sopenharmony_ci cmp.ltu p6,p0=r28,r27 1043e1051a39Sopenharmony_ci add r29=r29,r28 };; 1044e1051a39Sopenharmony_ci{ .mfb; getf.sig r18=f95 } 1045e1051a39Sopenharmony_ci{ .mii; 1046e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1047e1051a39Sopenharmony_ci cmp.ltu p6,p0=r29,r28 1048e1051a39Sopenharmony_ci add r30=r30,r29 };; 1049e1051a39Sopenharmony_ci{ .mii; getf.sig r19=f86 1050e1051a39Sopenharmony_ci add r17=r17,r16 1051e1051a39Sopenharmony_ci mov carry3=0 } 1052e1051a39Sopenharmony_ci{ .mii; 1053e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1054e1051a39Sopenharmony_ci cmp.ltu p6,p0=r30,r29 1055e1051a39Sopenharmony_ci add r30=r30,carry2 };; 1056e1051a39Sopenharmony_ci{ .mii; getf.sig r20=f77 1057e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 1058e1051a39Sopenharmony_ci add r18=r18,r17 } 1059e1051a39Sopenharmony_ci{ .mii; 1060e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1061e1051a39Sopenharmony_ci cmp.ltu p6,p0=r30,carry2 };; 1062e1051a39Sopenharmony_ci{ .mfb; getf.sig r21=f68 } 1063e1051a39Sopenharmony_ci{ .mii; st8 [r33]=r30,16 1064e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 };; 1065e1051a39Sopenharmony_ci 1066e1051a39Sopenharmony_ci{ .mfb; getf.sig r24=f114 } 1067e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1068e1051a39Sopenharmony_ci cmp.ltu p7,p0=r18,r17 1069e1051a39Sopenharmony_ci add r19=r19,r18 };; 1070e1051a39Sopenharmony_ci{ .mfb; getf.sig r25=f105 } 1071e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1072e1051a39Sopenharmony_ci cmp.ltu p7,p0=r19,r18 1073e1051a39Sopenharmony_ci add r20=r20,r19 };; 1074e1051a39Sopenharmony_ci{ .mfb; getf.sig r26=f96 } 1075e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1076e1051a39Sopenharmony_ci cmp.ltu p7,p0=r20,r19 1077e1051a39Sopenharmony_ci add r21=r21,r20 };; 1078e1051a39Sopenharmony_ci{ .mfb; getf.sig r27=f87 } 1079e1051a39Sopenharmony_ci{ .mii; (p7) add carry3=1,carry3 1080e1051a39Sopenharmony_ci cmp.ltu p7,p0=r21,r20 1081e1051a39Sopenharmony_ci add r21=r21,carry1 };; 1082e1051a39Sopenharmony_ci{ .mib; getf.sig r28=f78 1083e1051a39Sopenharmony_ci add r25=r25,r24 } 1084e1051a39Sopenharmony_ci{ .mib; (p7) add carry3=1,carry3 1085e1051a39Sopenharmony_ci cmp.ltu p7,p8=r21,carry1};; 1086e1051a39Sopenharmony_ci{ .mii; st8 [r32]=r21,16 1087e1051a39Sopenharmony_ci (p7) add carry2=1,carry3 1088e1051a39Sopenharmony_ci (p8) add carry2=0,carry3 } 1089e1051a39Sopenharmony_ci 1090e1051a39Sopenharmony_ci{ .mii; mov carry1=0 1091e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 1092e1051a39Sopenharmony_ci add r26=r26,r25 };; 1093e1051a39Sopenharmony_ci{ .mfb; getf.sig r16=f115 } 1094e1051a39Sopenharmony_ci{ .mii; 1095e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1096e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,r25 1097e1051a39Sopenharmony_ci add r27=r27,r26 };; 1098e1051a39Sopenharmony_ci{ .mfb; getf.sig r17=f106 } 1099e1051a39Sopenharmony_ci{ .mii; 1100e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1101e1051a39Sopenharmony_ci cmp.ltu p6,p0=r27,r26 1102e1051a39Sopenharmony_ci add r28=r28,r27 };; 1103e1051a39Sopenharmony_ci{ .mfb; getf.sig r18=f97 } 1104e1051a39Sopenharmony_ci{ .mii; 1105e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1106e1051a39Sopenharmony_ci cmp.ltu p6,p0=r28,r27 1107e1051a39Sopenharmony_ci add r28=r28,carry2 };; 1108e1051a39Sopenharmony_ci{ .mib; getf.sig r19=f88 1109e1051a39Sopenharmony_ci add r17=r17,r16 } 1110e1051a39Sopenharmony_ci{ .mib; 1111e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1112e1051a39Sopenharmony_ci cmp.ltu p6,p0=r28,carry2 };; 1113e1051a39Sopenharmony_ci{ .mii; st8 [r33]=r28,16 1114e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 1115e1051a39Sopenharmony_ci 1116e1051a39Sopenharmony_ci{ .mii; mov carry2=0 1117e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 1118e1051a39Sopenharmony_ci add r18=r18,r17 };; 1119e1051a39Sopenharmony_ci{ .mfb; getf.sig r24=f116 } 1120e1051a39Sopenharmony_ci{ .mii; (p7) add carry2=1,carry2 1121e1051a39Sopenharmony_ci cmp.ltu p7,p0=r18,r17 1122e1051a39Sopenharmony_ci add r19=r19,r18 };; 1123e1051a39Sopenharmony_ci{ .mfb; getf.sig r25=f107 } 1124e1051a39Sopenharmony_ci{ .mii; (p7) add carry2=1,carry2 1125e1051a39Sopenharmony_ci cmp.ltu p7,p0=r19,r18 1126e1051a39Sopenharmony_ci add r19=r19,carry1 };; 1127e1051a39Sopenharmony_ci{ .mfb; getf.sig r26=f98 } 1128e1051a39Sopenharmony_ci{ .mii; (p7) add carry2=1,carry2 1129e1051a39Sopenharmony_ci cmp.ltu p7,p0=r19,carry1};; 1130e1051a39Sopenharmony_ci{ .mii; st8 [r32]=r19,16 1131e1051a39Sopenharmony_ci (p7) add carry2=1,carry2 } 1132e1051a39Sopenharmony_ci 1133e1051a39Sopenharmony_ci{ .mfb; add r25=r25,r24 };; 1134e1051a39Sopenharmony_ci 1135e1051a39Sopenharmony_ci{ .mfb; getf.sig r16=f117 } 1136e1051a39Sopenharmony_ci{ .mii; mov carry1=0 1137e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 1138e1051a39Sopenharmony_ci add r26=r26,r25 };; 1139e1051a39Sopenharmony_ci{ .mfb; getf.sig r17=f108 } 1140e1051a39Sopenharmony_ci{ .mii; 1141e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1142e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,r25 1143e1051a39Sopenharmony_ci add r26=r26,carry2 };; 1144e1051a39Sopenharmony_ci{ .mfb; nop.m 0x0 } 1145e1051a39Sopenharmony_ci{ .mii; 1146e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1147e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,carry2 };; 1148e1051a39Sopenharmony_ci{ .mii; st8 [r33]=r26,16 1149e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 1150e1051a39Sopenharmony_ci 1151e1051a39Sopenharmony_ci{ .mfb; add r17=r17,r16 };; 1152e1051a39Sopenharmony_ci{ .mfb; getf.sig r24=f118 } 1153e1051a39Sopenharmony_ci{ .mii; mov carry2=0 1154e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 1155e1051a39Sopenharmony_ci add r17=r17,carry1 };; 1156e1051a39Sopenharmony_ci{ .mii; (p7) add carry2=1,carry2 1157e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,carry1};; 1158e1051a39Sopenharmony_ci{ .mii; st8 [r32]=r17 1159e1051a39Sopenharmony_ci (p7) add carry2=1,carry2 };; 1160e1051a39Sopenharmony_ci{ .mfb; add r24=r24,carry2 };; 1161e1051a39Sopenharmony_ci{ .mib; st8 [r33]=r24 } 1162e1051a39Sopenharmony_ci 1163e1051a39Sopenharmony_ci{ .mib; rum 1<<5 // clear um.mfh 1164e1051a39Sopenharmony_ci br.ret.sptk.many b0 };; 1165e1051a39Sopenharmony_ci.endp bn_mul_comba8# 1166e1051a39Sopenharmony_ci#undef carry3 1167e1051a39Sopenharmony_ci#undef carry2 1168e1051a39Sopenharmony_ci#undef carry1 1169e1051a39Sopenharmony_ci#endif 1170e1051a39Sopenharmony_ci 1171e1051a39Sopenharmony_ci#if 1 1172e1051a39Sopenharmony_ci// It's possible to make it faster (see comment to bn_sqr_comba8), but 1173e1051a39Sopenharmony_ci// I reckon it doesn't worth the effort. Basically because the routine 1174e1051a39Sopenharmony_ci// (actually both of them) practically never called... So I just play 1175e1051a39Sopenharmony_ci// same trick as with bn_sqr_comba8. 1176e1051a39Sopenharmony_ci// 1177e1051a39Sopenharmony_ci// void bn_sqr_comba4(BN_ULONG *r, BN_ULONG *a) 1178e1051a39Sopenharmony_ci// 1179e1051a39Sopenharmony_ci.global bn_sqr_comba4# 1180e1051a39Sopenharmony_ci.proc bn_sqr_comba4# 1181e1051a39Sopenharmony_ci.align 64 1182e1051a39Sopenharmony_cibn_sqr_comba4: 1183e1051a39Sopenharmony_ci .prologue 1184e1051a39Sopenharmony_ci .save ar.pfs,r2 1185e1051a39Sopenharmony_ci#if defined(_HPUX_SOURCE) && !defined(_LP64) 1186e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,2,1,0,0 1187e1051a39Sopenharmony_ci addp4 r32=0,r32 1188e1051a39Sopenharmony_ci addp4 r33=0,r33 };; 1189e1051a39Sopenharmony_ci{ .mii; 1190e1051a39Sopenharmony_ci#else 1191e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,2,1,0,0 1192e1051a39Sopenharmony_ci#endif 1193e1051a39Sopenharmony_ci mov r34=r33 1194e1051a39Sopenharmony_ci add r14=8,r33 };; 1195e1051a39Sopenharmony_ci .body 1196e1051a39Sopenharmony_ci{ .mii; add r17=8,r34 1197e1051a39Sopenharmony_ci add r15=16,r33 1198e1051a39Sopenharmony_ci add r18=16,r34 } 1199e1051a39Sopenharmony_ci{ .mfb; add r16=24,r33 1200e1051a39Sopenharmony_ci br .L_cheat_entry_point4 };; 1201e1051a39Sopenharmony_ci.endp bn_sqr_comba4# 1202e1051a39Sopenharmony_ci#endif 1203e1051a39Sopenharmony_ci 1204e1051a39Sopenharmony_ci#if 1 1205e1051a39Sopenharmony_ci// Runs in ~115 cycles and ~4.5 times faster than C. Well, whatever... 1206e1051a39Sopenharmony_ci// 1207e1051a39Sopenharmony_ci// void bn_mul_comba4(BN_ULONG *r, BN_ULONG *a, BN_ULONG *b) 1208e1051a39Sopenharmony_ci// 1209e1051a39Sopenharmony_ci#define carry1 r14 1210e1051a39Sopenharmony_ci#define carry2 r15 1211e1051a39Sopenharmony_ci.global bn_mul_comba4# 1212e1051a39Sopenharmony_ci.proc bn_mul_comba4# 1213e1051a39Sopenharmony_ci.align 64 1214e1051a39Sopenharmony_cibn_mul_comba4: 1215e1051a39Sopenharmony_ci .prologue 1216e1051a39Sopenharmony_ci .save ar.pfs,r2 1217e1051a39Sopenharmony_ci#if defined(_HPUX_SOURCE) && !defined(_LP64) 1218e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,3,0,0,0 1219e1051a39Sopenharmony_ci addp4 r33=0,r33 1220e1051a39Sopenharmony_ci addp4 r34=0,r34 };; 1221e1051a39Sopenharmony_ci{ .mii; addp4 r32=0,r32 1222e1051a39Sopenharmony_ci#else 1223e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,3,0,0,0 1224e1051a39Sopenharmony_ci#endif 1225e1051a39Sopenharmony_ci add r14=8,r33 1226e1051a39Sopenharmony_ci add r17=8,r34 } 1227e1051a39Sopenharmony_ci .body 1228e1051a39Sopenharmony_ci{ .mii; add r15=16,r33 1229e1051a39Sopenharmony_ci add r18=16,r34 1230e1051a39Sopenharmony_ci add r16=24,r33 };; 1231e1051a39Sopenharmony_ci.L_cheat_entry_point4: 1232e1051a39Sopenharmony_ci{ .mmi; add r19=24,r34 1233e1051a39Sopenharmony_ci 1234e1051a39Sopenharmony_ci ldf8 f32=[r33] } 1235e1051a39Sopenharmony_ci 1236e1051a39Sopenharmony_ci{ .mmi; ldf8 f120=[r34] 1237e1051a39Sopenharmony_ci ldf8 f121=[r17] };; 1238e1051a39Sopenharmony_ci{ .mmi; ldf8 f122=[r18] 1239e1051a39Sopenharmony_ci ldf8 f123=[r19] } 1240e1051a39Sopenharmony_ci 1241e1051a39Sopenharmony_ci{ .mmi; ldf8 f33=[r14] 1242e1051a39Sopenharmony_ci ldf8 f34=[r15] } 1243e1051a39Sopenharmony_ci{ .mfi; ldf8 f35=[r16] 1244e1051a39Sopenharmony_ci 1245e1051a39Sopenharmony_ci xma.hu f41=f32,f120,f0 } 1246e1051a39Sopenharmony_ci{ .mfi; xma.lu f40=f32,f120,f0 };; 1247e1051a39Sopenharmony_ci{ .mfi; xma.hu f51=f32,f121,f0 } 1248e1051a39Sopenharmony_ci{ .mfi; xma.lu f50=f32,f121,f0 };; 1249e1051a39Sopenharmony_ci{ .mfi; xma.hu f61=f32,f122,f0 } 1250e1051a39Sopenharmony_ci{ .mfi; xma.lu f60=f32,f122,f0 };; 1251e1051a39Sopenharmony_ci{ .mfi; xma.hu f71=f32,f123,f0 } 1252e1051a39Sopenharmony_ci{ .mfi; xma.lu f70=f32,f123,f0 };;// 1253e1051a39Sopenharmony_ci// Major stall takes place here, and 3 more places below. Result from 1254e1051a39Sopenharmony_ci// first xma is not available for another 3 ticks. 1255e1051a39Sopenharmony_ci{ .mfi; getf.sig r16=f40 1256e1051a39Sopenharmony_ci xma.hu f42=f33,f120,f41 1257e1051a39Sopenharmony_ci add r33=8,r32 } 1258e1051a39Sopenharmony_ci{ .mfi; xma.lu f41=f33,f120,f41 };; 1259e1051a39Sopenharmony_ci{ .mfi; getf.sig r24=f50 1260e1051a39Sopenharmony_ci xma.hu f52=f33,f121,f51 } 1261e1051a39Sopenharmony_ci{ .mfi; xma.lu f51=f33,f121,f51 };; 1262e1051a39Sopenharmony_ci{ .mfi; st8 [r32]=r16,16 1263e1051a39Sopenharmony_ci xma.hu f62=f33,f122,f61 } 1264e1051a39Sopenharmony_ci{ .mfi; xma.lu f61=f33,f122,f61 };; 1265e1051a39Sopenharmony_ci{ .mfi; xma.hu f72=f33,f123,f71 } 1266e1051a39Sopenharmony_ci{ .mfi; xma.lu f71=f33,f123,f71 };;// 1267e1051a39Sopenharmony_ci//-------------------------------------------------// 1268e1051a39Sopenharmony_ci{ .mfi; getf.sig r25=f41 1269e1051a39Sopenharmony_ci xma.hu f43=f34,f120,f42 } 1270e1051a39Sopenharmony_ci{ .mfi; xma.lu f42=f34,f120,f42 };; 1271e1051a39Sopenharmony_ci{ .mfi; getf.sig r16=f60 1272e1051a39Sopenharmony_ci xma.hu f53=f34,f121,f52 } 1273e1051a39Sopenharmony_ci{ .mfi; xma.lu f52=f34,f121,f52 };; 1274e1051a39Sopenharmony_ci{ .mfi; getf.sig r17=f51 1275e1051a39Sopenharmony_ci xma.hu f63=f34,f122,f62 1276e1051a39Sopenharmony_ci add r25=r25,r24 } 1277e1051a39Sopenharmony_ci{ .mfi; mov carry1=0 1278e1051a39Sopenharmony_ci xma.lu f62=f34,f122,f62 };; 1279e1051a39Sopenharmony_ci{ .mfi; st8 [r33]=r25,16 1280e1051a39Sopenharmony_ci xma.hu f73=f34,f123,f72 1281e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 } 1282e1051a39Sopenharmony_ci{ .mfi; xma.lu f72=f34,f123,f72 };;// 1283e1051a39Sopenharmony_ci//-------------------------------------------------// 1284e1051a39Sopenharmony_ci{ .mfi; getf.sig r18=f42 1285e1051a39Sopenharmony_ci xma.hu f44=f35,f120,f43 1286e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 1287e1051a39Sopenharmony_ci{ .mfi; add r17=r17,r16 1288e1051a39Sopenharmony_ci xma.lu f43=f35,f120,f43 1289e1051a39Sopenharmony_ci mov carry2=0 };; 1290e1051a39Sopenharmony_ci{ .mfi; getf.sig r24=f70 1291e1051a39Sopenharmony_ci xma.hu f54=f35,f121,f53 1292e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 } 1293e1051a39Sopenharmony_ci{ .mfi; xma.lu f53=f35,f121,f53 };; 1294e1051a39Sopenharmony_ci{ .mfi; getf.sig r25=f61 1295e1051a39Sopenharmony_ci xma.hu f64=f35,f122,f63 1296e1051a39Sopenharmony_ci add r18=r18,r17 } 1297e1051a39Sopenharmony_ci{ .mfi; xma.lu f63=f35,f122,f63 1298e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 };; 1299e1051a39Sopenharmony_ci{ .mfi; getf.sig r26=f52 1300e1051a39Sopenharmony_ci xma.hu f74=f35,f123,f73 1301e1051a39Sopenharmony_ci cmp.ltu p7,p0=r18,r17 } 1302e1051a39Sopenharmony_ci{ .mfi; xma.lu f73=f35,f123,f73 1303e1051a39Sopenharmony_ci add r18=r18,carry1 };; 1304e1051a39Sopenharmony_ci//-------------------------------------------------// 1305e1051a39Sopenharmony_ci{ .mii; st8 [r32]=r18,16 1306e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 1307e1051a39Sopenharmony_ci cmp.ltu p7,p0=r18,carry1 };; 1308e1051a39Sopenharmony_ci 1309e1051a39Sopenharmony_ci{ .mfi; getf.sig r27=f43 // last major stall 1310e1051a39Sopenharmony_ci(p7) add carry2=1,carry2 };; 1311e1051a39Sopenharmony_ci{ .mii; getf.sig r16=f71 1312e1051a39Sopenharmony_ci add r25=r25,r24 1313e1051a39Sopenharmony_ci mov carry1=0 };; 1314e1051a39Sopenharmony_ci{ .mii; getf.sig r17=f62 1315e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 1316e1051a39Sopenharmony_ci add r26=r26,r25 };; 1317e1051a39Sopenharmony_ci{ .mii; 1318e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1319e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,r25 1320e1051a39Sopenharmony_ci add r27=r27,r26 };; 1321e1051a39Sopenharmony_ci{ .mii; 1322e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1323e1051a39Sopenharmony_ci cmp.ltu p6,p0=r27,r26 1324e1051a39Sopenharmony_ci add r27=r27,carry2 };; 1325e1051a39Sopenharmony_ci{ .mii; getf.sig r18=f53 1326e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1327e1051a39Sopenharmony_ci cmp.ltu p6,p0=r27,carry2 };; 1328e1051a39Sopenharmony_ci{ .mfi; st8 [r33]=r27,16 1329e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 1330e1051a39Sopenharmony_ci 1331e1051a39Sopenharmony_ci{ .mii; getf.sig r19=f44 1332e1051a39Sopenharmony_ci add r17=r17,r16 1333e1051a39Sopenharmony_ci mov carry2=0 };; 1334e1051a39Sopenharmony_ci{ .mii; getf.sig r24=f72 1335e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,r16 1336e1051a39Sopenharmony_ci add r18=r18,r17 };; 1337e1051a39Sopenharmony_ci{ .mii; (p7) add carry2=1,carry2 1338e1051a39Sopenharmony_ci cmp.ltu p7,p0=r18,r17 1339e1051a39Sopenharmony_ci add r19=r19,r18 };; 1340e1051a39Sopenharmony_ci{ .mii; (p7) add carry2=1,carry2 1341e1051a39Sopenharmony_ci cmp.ltu p7,p0=r19,r18 1342e1051a39Sopenharmony_ci add r19=r19,carry1 };; 1343e1051a39Sopenharmony_ci{ .mii; getf.sig r25=f63 1344e1051a39Sopenharmony_ci (p7) add carry2=1,carry2 1345e1051a39Sopenharmony_ci cmp.ltu p7,p0=r19,carry1};; 1346e1051a39Sopenharmony_ci{ .mii; st8 [r32]=r19,16 1347e1051a39Sopenharmony_ci (p7) add carry2=1,carry2 } 1348e1051a39Sopenharmony_ci 1349e1051a39Sopenharmony_ci{ .mii; getf.sig r26=f54 1350e1051a39Sopenharmony_ci add r25=r25,r24 1351e1051a39Sopenharmony_ci mov carry1=0 };; 1352e1051a39Sopenharmony_ci{ .mii; getf.sig r16=f73 1353e1051a39Sopenharmony_ci cmp.ltu p6,p0=r25,r24 1354e1051a39Sopenharmony_ci add r26=r26,r25 };; 1355e1051a39Sopenharmony_ci{ .mii; 1356e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1357e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,r25 1358e1051a39Sopenharmony_ci add r26=r26,carry2 };; 1359e1051a39Sopenharmony_ci{ .mii; getf.sig r17=f64 1360e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 1361e1051a39Sopenharmony_ci cmp.ltu p6,p0=r26,carry2 };; 1362e1051a39Sopenharmony_ci{ .mii; st8 [r33]=r26,16 1363e1051a39Sopenharmony_ci(p6) add carry1=1,carry1 } 1364e1051a39Sopenharmony_ci 1365e1051a39Sopenharmony_ci{ .mii; getf.sig r24=f74 1366e1051a39Sopenharmony_ci add r17=r17,r16 1367e1051a39Sopenharmony_ci mov carry2=0 };; 1368e1051a39Sopenharmony_ci{ .mii; cmp.ltu p7,p0=r17,r16 1369e1051a39Sopenharmony_ci add r17=r17,carry1 };; 1370e1051a39Sopenharmony_ci 1371e1051a39Sopenharmony_ci{ .mii; (p7) add carry2=1,carry2 1372e1051a39Sopenharmony_ci cmp.ltu p7,p0=r17,carry1};; 1373e1051a39Sopenharmony_ci{ .mii; st8 [r32]=r17,16 1374e1051a39Sopenharmony_ci (p7) add carry2=1,carry2 };; 1375e1051a39Sopenharmony_ci 1376e1051a39Sopenharmony_ci{ .mii; add r24=r24,carry2 };; 1377e1051a39Sopenharmony_ci{ .mii; st8 [r33]=r24 } 1378e1051a39Sopenharmony_ci 1379e1051a39Sopenharmony_ci{ .mib; rum 1<<5 // clear um.mfh 1380e1051a39Sopenharmony_ci br.ret.sptk.many b0 };; 1381e1051a39Sopenharmony_ci.endp bn_mul_comba4# 1382e1051a39Sopenharmony_ci#undef carry2 1383e1051a39Sopenharmony_ci#undef carry1 1384e1051a39Sopenharmony_ci#endif 1385e1051a39Sopenharmony_ci 1386e1051a39Sopenharmony_ci#if 1 1387e1051a39Sopenharmony_ci// 1388e1051a39Sopenharmony_ci// BN_ULONG bn_div_words(BN_ULONG h, BN_ULONG l, BN_ULONG d) 1389e1051a39Sopenharmony_ci// 1390e1051a39Sopenharmony_ci// In the nutshell it's a port of my MIPS III/IV implementation. 1391e1051a39Sopenharmony_ci// 1392e1051a39Sopenharmony_ci#define AT r14 1393e1051a39Sopenharmony_ci#define H r16 1394e1051a39Sopenharmony_ci#define HH r20 1395e1051a39Sopenharmony_ci#define L r17 1396e1051a39Sopenharmony_ci#define D r18 1397e1051a39Sopenharmony_ci#define DH r22 1398e1051a39Sopenharmony_ci#define I r21 1399e1051a39Sopenharmony_ci 1400e1051a39Sopenharmony_ci#if 0 1401e1051a39Sopenharmony_ci// Some preprocessors (most notably HP-UX) appear to be allergic to 1402e1051a39Sopenharmony_ci// macros enclosed to parenthesis [as these three were]. 1403e1051a39Sopenharmony_ci#define cont p16 1404e1051a39Sopenharmony_ci#define break p0 // p20 1405e1051a39Sopenharmony_ci#define equ p24 1406e1051a39Sopenharmony_ci#else 1407e1051a39Sopenharmony_cicont=p16 1408e1051a39Sopenharmony_cibreak=p0 1409e1051a39Sopenharmony_ciequ=p24 1410e1051a39Sopenharmony_ci#endif 1411e1051a39Sopenharmony_ci 1412e1051a39Sopenharmony_ci.global abort# 1413e1051a39Sopenharmony_ci.global bn_div_words# 1414e1051a39Sopenharmony_ci.proc bn_div_words# 1415e1051a39Sopenharmony_ci.align 64 1416e1051a39Sopenharmony_cibn_div_words: 1417e1051a39Sopenharmony_ci .prologue 1418e1051a39Sopenharmony_ci .save ar.pfs,r2 1419e1051a39Sopenharmony_ci{ .mii; alloc r2=ar.pfs,3,5,0,8 1420e1051a39Sopenharmony_ci .save b0,r3 1421e1051a39Sopenharmony_ci mov r3=b0 1422e1051a39Sopenharmony_ci .save pr,r10 1423e1051a39Sopenharmony_ci mov r10=pr };; 1424e1051a39Sopenharmony_ci{ .mmb; cmp.eq p6,p0=r34,r0 1425e1051a39Sopenharmony_ci mov r8=-1 1426e1051a39Sopenharmony_ci(p6) br.ret.spnt.many b0 };; 1427e1051a39Sopenharmony_ci 1428e1051a39Sopenharmony_ci .body 1429e1051a39Sopenharmony_ci{ .mii; mov H=r32 // save h 1430e1051a39Sopenharmony_ci mov ar.ec=0 // don't rotate at exit 1431e1051a39Sopenharmony_ci mov pr.rot=0 } 1432e1051a39Sopenharmony_ci{ .mii; mov L=r33 // save l 1433e1051a39Sopenharmony_ci mov r25=r0 // needed if abort is called on VMS 1434e1051a39Sopenharmony_ci mov r36=r0 };; 1435e1051a39Sopenharmony_ci 1436e1051a39Sopenharmony_ci.L_divw_shift: // -vv- note signed comparison 1437e1051a39Sopenharmony_ci{ .mfi; (p0) cmp.lt p16,p0=r0,r34 // d 1438e1051a39Sopenharmony_ci (p0) shladd r33=r34,1,r0 } 1439e1051a39Sopenharmony_ci{ .mfb; (p0) add r35=1,r36 1440e1051a39Sopenharmony_ci (p0) nop.f 0x0 1441e1051a39Sopenharmony_ci(p16) br.wtop.dpnt .L_divw_shift };; 1442e1051a39Sopenharmony_ci 1443e1051a39Sopenharmony_ci{ .mii; mov D=r34 1444e1051a39Sopenharmony_ci shr.u DH=r34,32 1445e1051a39Sopenharmony_ci sub r35=64,r36 };; 1446e1051a39Sopenharmony_ci{ .mii; setf.sig f7=DH 1447e1051a39Sopenharmony_ci shr.u AT=H,r35 1448e1051a39Sopenharmony_ci mov I=r36 };; 1449e1051a39Sopenharmony_ci{ .mib; cmp.ne p6,p0=r0,AT 1450e1051a39Sopenharmony_ci shl H=H,r36 1451e1051a39Sopenharmony_ci(p6) br.call.spnt.clr b0=abort };; // overflow, die... 1452e1051a39Sopenharmony_ci 1453e1051a39Sopenharmony_ci{ .mfi; fcvt.xuf.s1 f7=f7 1454e1051a39Sopenharmony_ci shr.u AT=L,r35 };; 1455e1051a39Sopenharmony_ci{ .mii; shl L=L,r36 1456e1051a39Sopenharmony_ci or H=H,AT };; 1457e1051a39Sopenharmony_ci 1458e1051a39Sopenharmony_ci{ .mii; nop.m 0x0 1459e1051a39Sopenharmony_ci cmp.leu p6,p0=D,H;; 1460e1051a39Sopenharmony_ci(p6) sub H=H,D } 1461e1051a39Sopenharmony_ci 1462e1051a39Sopenharmony_ci{ .mlx; setf.sig f14=D 1463e1051a39Sopenharmony_ci movl AT=0xffffffff };; 1464e1051a39Sopenharmony_ci/////////////////////////////////////////////////////////// 1465e1051a39Sopenharmony_ci{ .mii; setf.sig f6=H 1466e1051a39Sopenharmony_ci shr.u HH=H,32;; 1467e1051a39Sopenharmony_ci cmp.eq p6,p7=HH,DH };; 1468e1051a39Sopenharmony_ci{ .mfb; 1469e1051a39Sopenharmony_ci(p6) setf.sig f8=AT 1470e1051a39Sopenharmony_ci(p7) fcvt.xuf.s1 f6=f6 1471e1051a39Sopenharmony_ci(p7) br.call.sptk b6=.L_udiv64_32_b6 };; 1472e1051a39Sopenharmony_ci 1473e1051a39Sopenharmony_ci{ .mfi; getf.sig r33=f8 // q 1474e1051a39Sopenharmony_ci xmpy.lu f9=f8,f14 } 1475e1051a39Sopenharmony_ci{ .mfi; xmpy.hu f10=f8,f14 1476e1051a39Sopenharmony_ci shrp H=H,L,32 };; 1477e1051a39Sopenharmony_ci 1478e1051a39Sopenharmony_ci{ .mmi; getf.sig r35=f9 // tl 1479e1051a39Sopenharmony_ci getf.sig r31=f10 };; // th 1480e1051a39Sopenharmony_ci 1481e1051a39Sopenharmony_ci.L_divw_1st_iter: 1482e1051a39Sopenharmony_ci{ .mii; (p0) add r32=-1,r33 1483e1051a39Sopenharmony_ci (p0) cmp.eq equ,cont=HH,r31 };; 1484e1051a39Sopenharmony_ci{ .mii; (p0) cmp.ltu p8,p0=r35,D 1485e1051a39Sopenharmony_ci (p0) sub r34=r35,D 1486e1051a39Sopenharmony_ci (equ) cmp.leu break,cont=r35,H };; 1487e1051a39Sopenharmony_ci{ .mib; (cont) cmp.leu cont,break=HH,r31 1488e1051a39Sopenharmony_ci (p8) add r31=-1,r31 1489e1051a39Sopenharmony_ci(cont) br.wtop.spnt .L_divw_1st_iter };; 1490e1051a39Sopenharmony_ci/////////////////////////////////////////////////////////// 1491e1051a39Sopenharmony_ci{ .mii; sub H=H,r35 1492e1051a39Sopenharmony_ci shl r8=r33,32 1493e1051a39Sopenharmony_ci shl L=L,32 };; 1494e1051a39Sopenharmony_ci/////////////////////////////////////////////////////////// 1495e1051a39Sopenharmony_ci{ .mii; setf.sig f6=H 1496e1051a39Sopenharmony_ci shr.u HH=H,32;; 1497e1051a39Sopenharmony_ci cmp.eq p6,p7=HH,DH };; 1498e1051a39Sopenharmony_ci{ .mfb; 1499e1051a39Sopenharmony_ci(p6) setf.sig f8=AT 1500e1051a39Sopenharmony_ci(p7) fcvt.xuf.s1 f6=f6 1501e1051a39Sopenharmony_ci(p7) br.call.sptk b6=.L_udiv64_32_b6 };; 1502e1051a39Sopenharmony_ci 1503e1051a39Sopenharmony_ci{ .mfi; getf.sig r33=f8 // q 1504e1051a39Sopenharmony_ci xmpy.lu f9=f8,f14 } 1505e1051a39Sopenharmony_ci{ .mfi; xmpy.hu f10=f8,f14 1506e1051a39Sopenharmony_ci shrp H=H,L,32 };; 1507e1051a39Sopenharmony_ci 1508e1051a39Sopenharmony_ci{ .mmi; getf.sig r35=f9 // tl 1509e1051a39Sopenharmony_ci getf.sig r31=f10 };; // th 1510e1051a39Sopenharmony_ci 1511e1051a39Sopenharmony_ci.L_divw_2nd_iter: 1512e1051a39Sopenharmony_ci{ .mii; (p0) add r32=-1,r33 1513e1051a39Sopenharmony_ci (p0) cmp.eq equ,cont=HH,r31 };; 1514e1051a39Sopenharmony_ci{ .mii; (p0) cmp.ltu p8,p0=r35,D 1515e1051a39Sopenharmony_ci (p0) sub r34=r35,D 1516e1051a39Sopenharmony_ci (equ) cmp.leu break,cont=r35,H };; 1517e1051a39Sopenharmony_ci{ .mib; (cont) cmp.leu cont,break=HH,r31 1518e1051a39Sopenharmony_ci (p8) add r31=-1,r31 1519e1051a39Sopenharmony_ci(cont) br.wtop.spnt .L_divw_2nd_iter };; 1520e1051a39Sopenharmony_ci/////////////////////////////////////////////////////////// 1521e1051a39Sopenharmony_ci{ .mii; sub H=H,r35 1522e1051a39Sopenharmony_ci or r8=r8,r33 1523e1051a39Sopenharmony_ci mov ar.pfs=r2 };; 1524e1051a39Sopenharmony_ci{ .mii; shr.u r9=H,I // remainder if anybody wants it 1525e1051a39Sopenharmony_ci mov pr=r10,0x1ffff } 1526e1051a39Sopenharmony_ci{ .mfb; br.ret.sptk.many b0 };; 1527e1051a39Sopenharmony_ci 1528e1051a39Sopenharmony_ci// Unsigned 64 by 32 (well, by 64 for the moment) bit integer division 1529e1051a39Sopenharmony_ci// procedure. 1530e1051a39Sopenharmony_ci// 1531e1051a39Sopenharmony_ci// inputs: f6 = (double)a, f7 = (double)b 1532e1051a39Sopenharmony_ci// output: f8 = (int)(a/b) 1533e1051a39Sopenharmony_ci// clobbered: f8,f9,f10,f11,pred 1534e1051a39Sopenharmony_cipred=p15 1535e1051a39Sopenharmony_ci// This snippet is based on text found in the "Divide, Square 1536e1051a39Sopenharmony_ci// Root and Remainder" section at 1537e1051a39Sopenharmony_ci// http://www.intel.com/software/products/opensource/libraries/num.htm. 1538e1051a39Sopenharmony_ci// Yes, I admit that the referred code was used as template, 1539e1051a39Sopenharmony_ci// but after I realized that there hardly is any other instruction 1540e1051a39Sopenharmony_ci// sequence which would perform this operation. I mean I figure that 1541e1051a39Sopenharmony_ci// any independent attempt to implement high-performance division 1542e1051a39Sopenharmony_ci// will result in code virtually identical to the Intel code. It 1543e1051a39Sopenharmony_ci// should be noted though that below division kernel is 1 cycle 1544e1051a39Sopenharmony_ci// faster than Intel one (note commented splits:-), not to mention 1545e1051a39Sopenharmony_ci// original prologue (rather lack of one) and epilogue. 1546e1051a39Sopenharmony_ci.align 32 1547e1051a39Sopenharmony_ci.skip 16 1548e1051a39Sopenharmony_ci.L_udiv64_32_b6: 1549e1051a39Sopenharmony_ci frcpa.s1 f8,pred=f6,f7;; // [0] y0 = 1 / b 1550e1051a39Sopenharmony_ci 1551e1051a39Sopenharmony_ci(pred) fnma.s1 f9=f7,f8,f1 // [5] e0 = 1 - b * y0 1552e1051a39Sopenharmony_ci(pred) fmpy.s1 f10=f6,f8;; // [5] q0 = a * y0 1553e1051a39Sopenharmony_ci(pred) fmpy.s1 f11=f9,f9 // [10] e1 = e0 * e0 1554e1051a39Sopenharmony_ci(pred) fma.s1 f10=f9,f10,f10;; // [10] q1 = q0 + e0 * q0 1555e1051a39Sopenharmony_ci(pred) fma.s1 f8=f9,f8,f8 //;; // [15] y1 = y0 + e0 * y0 1556e1051a39Sopenharmony_ci(pred) fma.s1 f9=f11,f10,f10;; // [15] q2 = q1 + e1 * q1 1557e1051a39Sopenharmony_ci(pred) fma.s1 f8=f11,f8,f8 //;; // [20] y2 = y1 + e1 * y1 1558e1051a39Sopenharmony_ci(pred) fnma.s1 f10=f7,f9,f6;; // [20] r2 = a - b * q2 1559e1051a39Sopenharmony_ci(pred) fma.s1 f8=f10,f8,f9;; // [25] q3 = q2 + r2 * y2 1560e1051a39Sopenharmony_ci 1561e1051a39Sopenharmony_ci fcvt.fxu.trunc.s1 f8=f8 // [30] q = trunc(q3) 1562e1051a39Sopenharmony_ci br.ret.sptk.many b6;; 1563e1051a39Sopenharmony_ci.endp bn_div_words# 1564e1051a39Sopenharmony_ci#endif 1565