162306a36Sopenharmony_ci========================== 262306a36Sopenharmony_ciNAND Error-correction Code 362306a36Sopenharmony_ci========================== 462306a36Sopenharmony_ci 562306a36Sopenharmony_ciIntroduction 662306a36Sopenharmony_ci============ 762306a36Sopenharmony_ci 862306a36Sopenharmony_ciHaving looked at the linux mtd/nand Hamming software ECC engine driver 962306a36Sopenharmony_ciI felt there was room for optimisation. I bashed the code for a few hours 1062306a36Sopenharmony_ciperforming tricks like table lookup removing superfluous code etc. 1162306a36Sopenharmony_ciAfter that the speed was increased by 35-40%. 1262306a36Sopenharmony_ciStill I was not too happy as I felt there was additional room for improvement. 1362306a36Sopenharmony_ci 1462306a36Sopenharmony_ciBad! I was hooked. 1562306a36Sopenharmony_ciI decided to annotate my steps in this file. Perhaps it is useful to someone 1662306a36Sopenharmony_cior someone learns something from it. 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ci 1962306a36Sopenharmony_ciThe problem 2062306a36Sopenharmony_ci=========== 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ciNAND flash (at least SLC one) typically has sectors of 256 bytes. 2362306a36Sopenharmony_ciHowever NAND flash is not extremely reliable so some error detection 2462306a36Sopenharmony_ci(and sometimes correction) is needed. 2562306a36Sopenharmony_ci 2662306a36Sopenharmony_ciThis is done by means of a Hamming code. I'll try to explain it in 2762306a36Sopenharmony_cilaymans terms (and apologies to all the pro's in the field in case I do 2862306a36Sopenharmony_cinot use the right terminology, my coding theory class was almost 30 2962306a36Sopenharmony_ciyears ago, and I must admit it was not one of my favourites). 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ciAs I said before the ecc calculation is performed on sectors of 256 3262306a36Sopenharmony_cibytes. This is done by calculating several parity bits over the rows and 3362306a36Sopenharmony_cicolumns. The parity used is even parity which means that the parity bit = 1 3462306a36Sopenharmony_ciif the data over which the parity is calculated is 1 and the parity bit = 0 3562306a36Sopenharmony_ciif the data over which the parity is calculated is 0. So the total 3662306a36Sopenharmony_cinumber of bits over the data over which the parity is calculated + the 3762306a36Sopenharmony_ciparity bit is even. (see wikipedia if you can't follow this). 3862306a36Sopenharmony_ciParity is often calculated by means of an exclusive or operation, 3962306a36Sopenharmony_cisometimes also referred to as xor. In C the operator for xor is ^ 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciBack to ecc. 4262306a36Sopenharmony_ciLet's give a small figure: 4362306a36Sopenharmony_ci 4462306a36Sopenharmony_ci========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== 4562306a36Sopenharmony_cibyte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 4662306a36Sopenharmony_cibyte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 4762306a36Sopenharmony_cibyte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 4862306a36Sopenharmony_cibyte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 4962306a36Sopenharmony_cibyte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 5062306a36Sopenharmony_ci... 5162306a36Sopenharmony_cibyte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 5262306a36Sopenharmony_cibyte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 5362306a36Sopenharmony_ci cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 5462306a36Sopenharmony_ci cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 5562306a36Sopenharmony_ci cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 5662306a36Sopenharmony_ci========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ciThis figure represents a sector of 256 bytes. 5962306a36Sopenharmony_cicp is my abbreviation for column parity, rp for row parity. 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ciLet's start to explain column parity. 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ci- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. 6462306a36Sopenharmony_ci 6562306a36Sopenharmony_ci so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ciSimilarly cp1 is the sum of all bit1, bit3, bit5 and bit7. 6862306a36Sopenharmony_ci 6962306a36Sopenharmony_ci- cp2 is the parity over bit0, bit1, bit4 and bit5 7062306a36Sopenharmony_ci- cp3 is the parity over bit2, bit3, bit6 and bit7. 7162306a36Sopenharmony_ci- cp4 is the parity over bit0, bit1, bit2 and bit3. 7262306a36Sopenharmony_ci- cp5 is the parity over bit4, bit5, bit6 and bit7. 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ciNote that each of cp0 .. cp5 is exactly one bit. 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_ciRow parity actually works almost the same. 7762306a36Sopenharmony_ci 7862306a36Sopenharmony_ci- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) 7962306a36Sopenharmony_ci- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) 8062306a36Sopenharmony_ci- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... 8162306a36Sopenharmony_ci (so handle two bytes, then skip 2 bytes). 8262306a36Sopenharmony_ci- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) 8362306a36Sopenharmony_ci- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. 8462306a36Sopenharmony_ci 8562306a36Sopenharmony_ci so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) 8662306a36Sopenharmony_ci- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ciThe story now becomes quite boring. I guess you get the idea. 8962306a36Sopenharmony_ci 9062306a36Sopenharmony_ci- rp6 covers 8 bytes then skips 8 etc 9162306a36Sopenharmony_ci- rp7 skips 8 bytes then covers 8 etc 9262306a36Sopenharmony_ci- rp8 covers 16 bytes then skips 16 etc 9362306a36Sopenharmony_ci- rp9 skips 16 bytes then covers 16 etc 9462306a36Sopenharmony_ci- rp10 covers 32 bytes then skips 32 etc 9562306a36Sopenharmony_ci- rp11 skips 32 bytes then covers 32 etc 9662306a36Sopenharmony_ci- rp12 covers 64 bytes then skips 64 etc 9762306a36Sopenharmony_ci- rp13 skips 64 bytes then covers 64 etc 9862306a36Sopenharmony_ci- rp14 covers 128 bytes then skips 128 9962306a36Sopenharmony_ci- rp15 skips 128 bytes then covers 128 10062306a36Sopenharmony_ci 10162306a36Sopenharmony_ciIn the end the parity bits are grouped together in three bytes as 10262306a36Sopenharmony_cifollows: 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ci===== ===== ===== ===== ===== ===== ===== ===== ===== 10562306a36Sopenharmony_ciECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 10662306a36Sopenharmony_ci===== ===== ===== ===== ===== ===== ===== ===== ===== 10762306a36Sopenharmony_ciECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 10862306a36Sopenharmony_ciECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 10962306a36Sopenharmony_ciECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 11062306a36Sopenharmony_ci===== ===== ===== ===== ===== ===== ===== ===== ===== 11162306a36Sopenharmony_ci 11262306a36Sopenharmony_ciI detected after writing this that ST application note AN1823 11362306a36Sopenharmony_ci(http://www.st.com/stonline/) gives a much 11462306a36Sopenharmony_cinicer picture.(but they use line parity as term where I use row parity) 11562306a36Sopenharmony_ciOh well, I'm graphically challenged, so suffer with me for a moment :-) 11662306a36Sopenharmony_ci 11762306a36Sopenharmony_ciAnd I could not reuse the ST picture anyway for copyright reasons. 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ciAttempt 0 12162306a36Sopenharmony_ci========= 12262306a36Sopenharmony_ci 12362306a36Sopenharmony_ciImplementing the parity calculation is pretty simple. 12462306a36Sopenharmony_ciIn C pseudocode:: 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ci for (i = 0; i < 256; i++) 12762306a36Sopenharmony_ci { 12862306a36Sopenharmony_ci if (i & 0x01) 12962306a36Sopenharmony_ci rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; 13062306a36Sopenharmony_ci else 13162306a36Sopenharmony_ci rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0; 13262306a36Sopenharmony_ci if (i & 0x02) 13362306a36Sopenharmony_ci rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; 13462306a36Sopenharmony_ci else 13562306a36Sopenharmony_ci rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; 13662306a36Sopenharmony_ci if (i & 0x04) 13762306a36Sopenharmony_ci rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; 13862306a36Sopenharmony_ci else 13962306a36Sopenharmony_ci rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; 14062306a36Sopenharmony_ci if (i & 0x08) 14162306a36Sopenharmony_ci rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; 14262306a36Sopenharmony_ci else 14362306a36Sopenharmony_ci rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; 14462306a36Sopenharmony_ci if (i & 0x10) 14562306a36Sopenharmony_ci rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; 14662306a36Sopenharmony_ci else 14762306a36Sopenharmony_ci rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; 14862306a36Sopenharmony_ci if (i & 0x20) 14962306a36Sopenharmony_ci rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; 15062306a36Sopenharmony_ci else 15162306a36Sopenharmony_ci rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; 15262306a36Sopenharmony_ci if (i & 0x40) 15362306a36Sopenharmony_ci rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; 15462306a36Sopenharmony_ci else 15562306a36Sopenharmony_ci rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; 15662306a36Sopenharmony_ci if (i & 0x80) 15762306a36Sopenharmony_ci rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; 15862306a36Sopenharmony_ci else 15962306a36Sopenharmony_ci rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; 16062306a36Sopenharmony_ci cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; 16162306a36Sopenharmony_ci cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; 16262306a36Sopenharmony_ci cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; 16362306a36Sopenharmony_ci cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 16462306a36Sopenharmony_ci cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 16562306a36Sopenharmony_ci cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 16662306a36Sopenharmony_ci } 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_ciAnalysis 0 17062306a36Sopenharmony_ci========== 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciC does have bitwise operators but not really operators to do the above 17362306a36Sopenharmony_ciefficiently (and most hardware has no such instructions either). 17462306a36Sopenharmony_ciTherefore without implementing this it was clear that the code above was 17562306a36Sopenharmony_cinot going to bring me a Nobel prize :-) 17662306a36Sopenharmony_ci 17762306a36Sopenharmony_ciFortunately the exclusive or operation is commutative, so we can combine 17862306a36Sopenharmony_cithe values in any order. So instead of calculating all the bits 17962306a36Sopenharmony_ciindividually, let us try to rearrange things. 18062306a36Sopenharmony_ciFor the column parity this is easy. We can just xor the bytes and in the 18162306a36Sopenharmony_ciend filter out the relevant bits. This is pretty nice as it will bring 18262306a36Sopenharmony_ciall cp calculation out of the for loop. 18362306a36Sopenharmony_ci 18462306a36Sopenharmony_ciSimilarly we can first xor the bytes for the various rows. 18562306a36Sopenharmony_ciThis leads to: 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ci 18862306a36Sopenharmony_ciAttempt 1 18962306a36Sopenharmony_ci========= 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ci:: 19262306a36Sopenharmony_ci 19362306a36Sopenharmony_ci const char parity[256] = { 19462306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 19562306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 19662306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 19762306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 19862306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 19962306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 20062306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 20162306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 20262306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 20362306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 20462306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 20562306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 20662306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 20762306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 20862306a36Sopenharmony_ci 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 20962306a36Sopenharmony_ci 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 21062306a36Sopenharmony_ci }; 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ci void ecc1(const unsigned char *buf, unsigned char *code) 21362306a36Sopenharmony_ci { 21462306a36Sopenharmony_ci int i; 21562306a36Sopenharmony_ci const unsigned char *bp = buf; 21662306a36Sopenharmony_ci unsigned char cur; 21762306a36Sopenharmony_ci unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; 21862306a36Sopenharmony_ci unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; 21962306a36Sopenharmony_ci unsigned char par; 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ci par = 0; 22262306a36Sopenharmony_ci rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; 22362306a36Sopenharmony_ci rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; 22462306a36Sopenharmony_ci rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; 22562306a36Sopenharmony_ci rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; 22662306a36Sopenharmony_ci 22762306a36Sopenharmony_ci for (i = 0; i < 256; i++) 22862306a36Sopenharmony_ci { 22962306a36Sopenharmony_ci cur = *bp++; 23062306a36Sopenharmony_ci par ^= cur; 23162306a36Sopenharmony_ci if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; 23262306a36Sopenharmony_ci if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; 23362306a36Sopenharmony_ci if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; 23462306a36Sopenharmony_ci if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; 23562306a36Sopenharmony_ci if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; 23662306a36Sopenharmony_ci if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; 23762306a36Sopenharmony_ci if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; 23862306a36Sopenharmony_ci if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; 23962306a36Sopenharmony_ci } 24062306a36Sopenharmony_ci code[0] = 24162306a36Sopenharmony_ci (parity[rp7] << 7) | 24262306a36Sopenharmony_ci (parity[rp6] << 6) | 24362306a36Sopenharmony_ci (parity[rp5] << 5) | 24462306a36Sopenharmony_ci (parity[rp4] << 4) | 24562306a36Sopenharmony_ci (parity[rp3] << 3) | 24662306a36Sopenharmony_ci (parity[rp2] << 2) | 24762306a36Sopenharmony_ci (parity[rp1] << 1) | 24862306a36Sopenharmony_ci (parity[rp0]); 24962306a36Sopenharmony_ci code[1] = 25062306a36Sopenharmony_ci (parity[rp15] << 7) | 25162306a36Sopenharmony_ci (parity[rp14] << 6) | 25262306a36Sopenharmony_ci (parity[rp13] << 5) | 25362306a36Sopenharmony_ci (parity[rp12] << 4) | 25462306a36Sopenharmony_ci (parity[rp11] << 3) | 25562306a36Sopenharmony_ci (parity[rp10] << 2) | 25662306a36Sopenharmony_ci (parity[rp9] << 1) | 25762306a36Sopenharmony_ci (parity[rp8]); 25862306a36Sopenharmony_ci code[2] = 25962306a36Sopenharmony_ci (parity[par & 0xf0] << 7) | 26062306a36Sopenharmony_ci (parity[par & 0x0f] << 6) | 26162306a36Sopenharmony_ci (parity[par & 0xcc] << 5) | 26262306a36Sopenharmony_ci (parity[par & 0x33] << 4) | 26362306a36Sopenharmony_ci (parity[par & 0xaa] << 3) | 26462306a36Sopenharmony_ci (parity[par & 0x55] << 2); 26562306a36Sopenharmony_ci code[0] = ~code[0]; 26662306a36Sopenharmony_ci code[1] = ~code[1]; 26762306a36Sopenharmony_ci code[2] = ~code[2]; 26862306a36Sopenharmony_ci } 26962306a36Sopenharmony_ci 27062306a36Sopenharmony_ciStill pretty straightforward. The last three invert statements are there to 27162306a36Sopenharmony_cigive a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash 27262306a36Sopenharmony_ciall data is 0xff, so the checksum then matches. 27362306a36Sopenharmony_ci 27462306a36Sopenharmony_ciI also introduced the parity lookup. I expected this to be the fastest 27562306a36Sopenharmony_ciway to calculate the parity, but I will investigate alternatives later 27662306a36Sopenharmony_cion. 27762306a36Sopenharmony_ci 27862306a36Sopenharmony_ci 27962306a36Sopenharmony_ciAnalysis 1 28062306a36Sopenharmony_ci========== 28162306a36Sopenharmony_ci 28262306a36Sopenharmony_ciThe code works, but is not terribly efficient. On my system it took 28362306a36Sopenharmony_cialmost 4 times as much time as the linux driver code. But hey, if it was 28462306a36Sopenharmony_ci*that* easy this would have been done long before. 28562306a36Sopenharmony_ciNo pain. no gain. 28662306a36Sopenharmony_ci 28762306a36Sopenharmony_ciFortunately there is plenty of room for improvement. 28862306a36Sopenharmony_ci 28962306a36Sopenharmony_ciIn step 1 we moved from bit-wise calculation to byte-wise calculation. 29062306a36Sopenharmony_ciHowever in C we can also use the unsigned long data type and virtually 29162306a36Sopenharmony_cievery modern microprocessor supports 32 bit operations, so why not try 29262306a36Sopenharmony_cito write our code in such a way that we process data in 32 bit chunks. 29362306a36Sopenharmony_ci 29462306a36Sopenharmony_ciOf course this means some modification as the row parity is byte by 29562306a36Sopenharmony_cibyte. A quick analysis: 29662306a36Sopenharmony_cifor the column parity we use the par variable. When extending to 32 bits 29762306a36Sopenharmony_ciwe can in the end easily calculate rp0 and rp1 from it. 29862306a36Sopenharmony_ci(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 29962306a36Sopenharmony_cirespectively, from MSB to LSB) 30062306a36Sopenharmony_cialso rp2 and rp3 can be easily retrieved from par as rp3 covers the 30162306a36Sopenharmony_cifirst two MSBs and rp2 covers the last two LSBs. 30262306a36Sopenharmony_ci 30362306a36Sopenharmony_ciNote that of course now the loop is executed only 64 times (256/4). 30462306a36Sopenharmony_ciAnd note that care must taken wrt byte ordering. The way bytes are 30562306a36Sopenharmony_ciordered in a long is machine dependent, and might affect us. 30662306a36Sopenharmony_ciAnyway, if there is an issue: this code is developed on x86 (to be 30762306a36Sopenharmony_ciprecise: a DELL PC with a D920 Intel CPU) 30862306a36Sopenharmony_ci 30962306a36Sopenharmony_ciAnd of course the performance might depend on alignment, but I expect 31062306a36Sopenharmony_cithat the I/O buffers in the nand driver are aligned properly (and 31162306a36Sopenharmony_ciotherwise that should be fixed to get maximum performance). 31262306a36Sopenharmony_ci 31362306a36Sopenharmony_ciLet's give it a try... 31462306a36Sopenharmony_ci 31562306a36Sopenharmony_ci 31662306a36Sopenharmony_ciAttempt 2 31762306a36Sopenharmony_ci========= 31862306a36Sopenharmony_ci 31962306a36Sopenharmony_ci:: 32062306a36Sopenharmony_ci 32162306a36Sopenharmony_ci extern const char parity[256]; 32262306a36Sopenharmony_ci 32362306a36Sopenharmony_ci void ecc2(const unsigned char *buf, unsigned char *code) 32462306a36Sopenharmony_ci { 32562306a36Sopenharmony_ci int i; 32662306a36Sopenharmony_ci const unsigned long *bp = (unsigned long *)buf; 32762306a36Sopenharmony_ci unsigned long cur; 32862306a36Sopenharmony_ci unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; 32962306a36Sopenharmony_ci unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; 33062306a36Sopenharmony_ci unsigned long par; 33162306a36Sopenharmony_ci 33262306a36Sopenharmony_ci par = 0; 33362306a36Sopenharmony_ci rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; 33462306a36Sopenharmony_ci rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; 33562306a36Sopenharmony_ci rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; 33662306a36Sopenharmony_ci rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; 33762306a36Sopenharmony_ci 33862306a36Sopenharmony_ci for (i = 0; i < 64; i++) 33962306a36Sopenharmony_ci { 34062306a36Sopenharmony_ci cur = *bp++; 34162306a36Sopenharmony_ci par ^= cur; 34262306a36Sopenharmony_ci if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; 34362306a36Sopenharmony_ci if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; 34462306a36Sopenharmony_ci if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; 34562306a36Sopenharmony_ci if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; 34662306a36Sopenharmony_ci if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; 34762306a36Sopenharmony_ci if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; 34862306a36Sopenharmony_ci } 34962306a36Sopenharmony_ci /* 35062306a36Sopenharmony_ci we need to adapt the code generation for the fact that rp vars are now 35162306a36Sopenharmony_ci long; also the column parity calculation needs to be changed. 35262306a36Sopenharmony_ci we'll bring rp4 to 15 back to single byte entities by shifting and 35362306a36Sopenharmony_ci xoring 35462306a36Sopenharmony_ci */ 35562306a36Sopenharmony_ci rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; 35662306a36Sopenharmony_ci rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; 35762306a36Sopenharmony_ci rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; 35862306a36Sopenharmony_ci rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; 35962306a36Sopenharmony_ci rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; 36062306a36Sopenharmony_ci rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; 36162306a36Sopenharmony_ci rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; 36262306a36Sopenharmony_ci rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; 36362306a36Sopenharmony_ci rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; 36462306a36Sopenharmony_ci rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; 36562306a36Sopenharmony_ci rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; 36662306a36Sopenharmony_ci rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; 36762306a36Sopenharmony_ci rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; 36862306a36Sopenharmony_ci rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; 36962306a36Sopenharmony_ci par ^= (par >> 16); 37062306a36Sopenharmony_ci rp1 = (par >> 8); rp1 &= 0xff; 37162306a36Sopenharmony_ci rp0 = (par & 0xff); 37262306a36Sopenharmony_ci par ^= (par >> 8); par &= 0xff; 37362306a36Sopenharmony_ci 37462306a36Sopenharmony_ci code[0] = 37562306a36Sopenharmony_ci (parity[rp7] << 7) | 37662306a36Sopenharmony_ci (parity[rp6] << 6) | 37762306a36Sopenharmony_ci (parity[rp5] << 5) | 37862306a36Sopenharmony_ci (parity[rp4] << 4) | 37962306a36Sopenharmony_ci (parity[rp3] << 3) | 38062306a36Sopenharmony_ci (parity[rp2] << 2) | 38162306a36Sopenharmony_ci (parity[rp1] << 1) | 38262306a36Sopenharmony_ci (parity[rp0]); 38362306a36Sopenharmony_ci code[1] = 38462306a36Sopenharmony_ci (parity[rp15] << 7) | 38562306a36Sopenharmony_ci (parity[rp14] << 6) | 38662306a36Sopenharmony_ci (parity[rp13] << 5) | 38762306a36Sopenharmony_ci (parity[rp12] << 4) | 38862306a36Sopenharmony_ci (parity[rp11] << 3) | 38962306a36Sopenharmony_ci (parity[rp10] << 2) | 39062306a36Sopenharmony_ci (parity[rp9] << 1) | 39162306a36Sopenharmony_ci (parity[rp8]); 39262306a36Sopenharmony_ci code[2] = 39362306a36Sopenharmony_ci (parity[par & 0xf0] << 7) | 39462306a36Sopenharmony_ci (parity[par & 0x0f] << 6) | 39562306a36Sopenharmony_ci (parity[par & 0xcc] << 5) | 39662306a36Sopenharmony_ci (parity[par & 0x33] << 4) | 39762306a36Sopenharmony_ci (parity[par & 0xaa] << 3) | 39862306a36Sopenharmony_ci (parity[par & 0x55] << 2); 39962306a36Sopenharmony_ci code[0] = ~code[0]; 40062306a36Sopenharmony_ci code[1] = ~code[1]; 40162306a36Sopenharmony_ci code[2] = ~code[2]; 40262306a36Sopenharmony_ci } 40362306a36Sopenharmony_ci 40462306a36Sopenharmony_ciThe parity array is not shown any more. Note also that for these 40562306a36Sopenharmony_ciexamples I kinda deviated from my regular programming style by allowing 40662306a36Sopenharmony_cimultiple statements on a line, not using { } in then and else blocks 40762306a36Sopenharmony_ciwith only a single statement and by using operators like ^= 40862306a36Sopenharmony_ci 40962306a36Sopenharmony_ci 41062306a36Sopenharmony_ciAnalysis 2 41162306a36Sopenharmony_ci========== 41262306a36Sopenharmony_ci 41362306a36Sopenharmony_ciThe code (of course) works, and hurray: we are a little bit faster than 41462306a36Sopenharmony_cithe linux driver code (about 15%). But wait, don't cheer too quickly. 41562306a36Sopenharmony_ciThere is more to be gained. 41662306a36Sopenharmony_ciIf we look at e.g. rp14 and rp15 we see that we either xor our data with 41762306a36Sopenharmony_cirp14 or with rp15. However we also have par which goes over all data. 41862306a36Sopenharmony_ciThis means there is no need to calculate rp14 as it can be calculated from 41962306a36Sopenharmony_cirp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15; 42062306a36Sopenharmony_ci(or if desired we can avoid calculating rp15 and calculate it from 42162306a36Sopenharmony_cirp14). That is why some places refer to inverse parity. 42262306a36Sopenharmony_ciOf course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. 42362306a36Sopenharmony_ciEffectively this means we can eliminate the else clause from the if 42462306a36Sopenharmony_cistatements. Also we can optimise the calculation in the end a little bit 42562306a36Sopenharmony_ciby going from long to byte first. Actually we can even avoid the table 42662306a36Sopenharmony_cilookups 42762306a36Sopenharmony_ci 42862306a36Sopenharmony_ciAttempt 3 42962306a36Sopenharmony_ci========= 43062306a36Sopenharmony_ci 43162306a36Sopenharmony_ciOdd replaced:: 43262306a36Sopenharmony_ci 43362306a36Sopenharmony_ci if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; 43462306a36Sopenharmony_ci if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; 43562306a36Sopenharmony_ci if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; 43662306a36Sopenharmony_ci if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; 43762306a36Sopenharmony_ci if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; 43862306a36Sopenharmony_ci if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; 43962306a36Sopenharmony_ci 44062306a36Sopenharmony_ciwith:: 44162306a36Sopenharmony_ci 44262306a36Sopenharmony_ci if (i & 0x01) rp5 ^= cur; 44362306a36Sopenharmony_ci if (i & 0x02) rp7 ^= cur; 44462306a36Sopenharmony_ci if (i & 0x04) rp9 ^= cur; 44562306a36Sopenharmony_ci if (i & 0x08) rp11 ^= cur; 44662306a36Sopenharmony_ci if (i & 0x10) rp13 ^= cur; 44762306a36Sopenharmony_ci if (i & 0x20) rp15 ^= cur; 44862306a36Sopenharmony_ci 44962306a36Sopenharmony_ciand outside the loop added:: 45062306a36Sopenharmony_ci 45162306a36Sopenharmony_ci rp4 = par ^ rp5; 45262306a36Sopenharmony_ci rp6 = par ^ rp7; 45362306a36Sopenharmony_ci rp8 = par ^ rp9; 45462306a36Sopenharmony_ci rp10 = par ^ rp11; 45562306a36Sopenharmony_ci rp12 = par ^ rp13; 45662306a36Sopenharmony_ci rp14 = par ^ rp15; 45762306a36Sopenharmony_ci 45862306a36Sopenharmony_ciAnd after that the code takes about 30% more time, although the number of 45962306a36Sopenharmony_cistatements is reduced. This is also reflected in the assembly code. 46062306a36Sopenharmony_ci 46162306a36Sopenharmony_ci 46262306a36Sopenharmony_ciAnalysis 3 46362306a36Sopenharmony_ci========== 46462306a36Sopenharmony_ci 46562306a36Sopenharmony_ciVery weird. Guess it has to do with caching or instruction parallellism 46662306a36Sopenharmony_cior so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting 46762306a36Sopenharmony_ciobservation was that this one is only 30% slower (according to time) 46862306a36Sopenharmony_ciexecuting the code as my 3Ghz D920 processor. 46962306a36Sopenharmony_ci 47062306a36Sopenharmony_ciWell, it was expected not to be easy so maybe instead move to a 47162306a36Sopenharmony_cidifferent track: let's move back to the code from attempt2 and do some 47262306a36Sopenharmony_ciloop unrolling. This will eliminate a few if statements. I'll try 47362306a36Sopenharmony_cidifferent amounts of unrolling to see what works best. 47462306a36Sopenharmony_ci 47562306a36Sopenharmony_ci 47662306a36Sopenharmony_ciAttempt 4 47762306a36Sopenharmony_ci========= 47862306a36Sopenharmony_ci 47962306a36Sopenharmony_ciUnrolled the loop 1, 2, 3 and 4 times. 48062306a36Sopenharmony_ciFor 4 the code starts with:: 48162306a36Sopenharmony_ci 48262306a36Sopenharmony_ci for (i = 0; i < 4; i++) 48362306a36Sopenharmony_ci { 48462306a36Sopenharmony_ci cur = *bp++; 48562306a36Sopenharmony_ci par ^= cur; 48662306a36Sopenharmony_ci rp4 ^= cur; 48762306a36Sopenharmony_ci rp6 ^= cur; 48862306a36Sopenharmony_ci rp8 ^= cur; 48962306a36Sopenharmony_ci rp10 ^= cur; 49062306a36Sopenharmony_ci if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; 49162306a36Sopenharmony_ci if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; 49262306a36Sopenharmony_ci cur = *bp++; 49362306a36Sopenharmony_ci par ^= cur; 49462306a36Sopenharmony_ci rp5 ^= cur; 49562306a36Sopenharmony_ci rp6 ^= cur; 49662306a36Sopenharmony_ci ... 49762306a36Sopenharmony_ci 49862306a36Sopenharmony_ci 49962306a36Sopenharmony_ciAnalysis 4 50062306a36Sopenharmony_ci========== 50162306a36Sopenharmony_ci 50262306a36Sopenharmony_ciUnrolling once gains about 15% 50362306a36Sopenharmony_ci 50462306a36Sopenharmony_ciUnrolling twice keeps the gain at about 15% 50562306a36Sopenharmony_ci 50662306a36Sopenharmony_ciUnrolling three times gives a gain of 30% compared to attempt 2. 50762306a36Sopenharmony_ci 50862306a36Sopenharmony_ciUnrolling four times gives a marginal improvement compared to unrolling 50962306a36Sopenharmony_cithree times. 51062306a36Sopenharmony_ci 51162306a36Sopenharmony_ciI decided to proceed with a four time unrolled loop anyway. It was my gut 51262306a36Sopenharmony_cifeeling that in the next steps I would obtain additional gain from it. 51362306a36Sopenharmony_ci 51462306a36Sopenharmony_ciThe next step was triggered by the fact that par contains the xor of all 51562306a36Sopenharmony_cibytes and rp4 and rp5 each contain the xor of half of the bytes. 51662306a36Sopenharmony_ciSo in effect par = rp4 ^ rp5. But as xor is commutative we can also say 51762306a36Sopenharmony_cithat rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can 51862306a36Sopenharmony_cieliminate rp5 (or rp4, but I already foresaw another optimisation). 51962306a36Sopenharmony_ciThe same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. 52062306a36Sopenharmony_ci 52162306a36Sopenharmony_ci 52262306a36Sopenharmony_ciAttempt 5 52362306a36Sopenharmony_ci========= 52462306a36Sopenharmony_ci 52562306a36Sopenharmony_ciEffectively so all odd digit rp assignments in the loop were removed. 52662306a36Sopenharmony_ciThis included the else clause of the if statements. 52762306a36Sopenharmony_ciOf course after the loop we need to correct things by adding code like:: 52862306a36Sopenharmony_ci 52962306a36Sopenharmony_ci rp5 = par ^ rp4; 53062306a36Sopenharmony_ci 53162306a36Sopenharmony_ciAlso the initial assignments (rp5 = 0; etc) could be removed. 53262306a36Sopenharmony_ciAlong the line I also removed the initialisation of rp0/1/2/3. 53362306a36Sopenharmony_ci 53462306a36Sopenharmony_ci 53562306a36Sopenharmony_ciAnalysis 5 53662306a36Sopenharmony_ci========== 53762306a36Sopenharmony_ci 53862306a36Sopenharmony_ciMeasurements showed this was a good move. The run-time roughly halved 53962306a36Sopenharmony_cicompared with attempt 4 with 4 times unrolled, and we only require 1/3rd 54062306a36Sopenharmony_ciof the processor time compared to the current code in the linux kernel. 54162306a36Sopenharmony_ci 54262306a36Sopenharmony_ciHowever, still I thought there was more. I didn't like all the if 54362306a36Sopenharmony_cistatements. Why not keep a running parity and only keep the last if 54462306a36Sopenharmony_cistatement. Time for yet another version! 54562306a36Sopenharmony_ci 54662306a36Sopenharmony_ci 54762306a36Sopenharmony_ciAttempt 6 54862306a36Sopenharmony_ci========= 54962306a36Sopenharmony_ci 55062306a36Sopenharmony_ciTHe code within the for loop was changed to:: 55162306a36Sopenharmony_ci 55262306a36Sopenharmony_ci for (i = 0; i < 4; i++) 55362306a36Sopenharmony_ci { 55462306a36Sopenharmony_ci cur = *bp++; tmppar = cur; rp4 ^= cur; 55562306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; 55662306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; 55762306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; 55862306a36Sopenharmony_ci 55962306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; 56062306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= cur; 56162306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; 56262306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; 56362306a36Sopenharmony_ci 56462306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; 56562306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; 56662306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; 56762306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp8 ^= cur; 56862306a36Sopenharmony_ci 56962306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; 57062306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= cur; 57162306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; 57262306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; 57362306a36Sopenharmony_ci 57462306a36Sopenharmony_ci par ^= tmppar; 57562306a36Sopenharmony_ci if ((i & 0x1) == 0) rp12 ^= tmppar; 57662306a36Sopenharmony_ci if ((i & 0x2) == 0) rp14 ^= tmppar; 57762306a36Sopenharmony_ci } 57862306a36Sopenharmony_ci 57962306a36Sopenharmony_ciAs you can see tmppar is used to accumulate the parity within a for 58062306a36Sopenharmony_ciiteration. In the last 3 statements is added to par and, if needed, 58162306a36Sopenharmony_cito rp12 and rp14. 58262306a36Sopenharmony_ci 58362306a36Sopenharmony_ciWhile making the changes I also found that I could exploit that tmppar 58462306a36Sopenharmony_cicontains the running parity for this iteration. So instead of having: 58562306a36Sopenharmony_cirp4 ^= cur; rp6 ^= cur; 58662306a36Sopenharmony_ciI removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next 58762306a36Sopenharmony_cistatement. A similar change was done for rp8 and rp10 58862306a36Sopenharmony_ci 58962306a36Sopenharmony_ci 59062306a36Sopenharmony_ciAnalysis 6 59162306a36Sopenharmony_ci========== 59262306a36Sopenharmony_ci 59362306a36Sopenharmony_ciMeasuring this code again showed big gain. When executing the original 59462306a36Sopenharmony_cilinux code 1 million times, this took about 1 second on my system. 59562306a36Sopenharmony_ci(using time to measure the performance). After this iteration I was back 59662306a36Sopenharmony_cito 0.075 sec. Actually I had to decide to start measuring over 10 59762306a36Sopenharmony_cimillion iterations in order not to lose too much accuracy. This one 59862306a36Sopenharmony_cidefinitely seemed to be the jackpot! 59962306a36Sopenharmony_ci 60062306a36Sopenharmony_ciThere is a little bit more room for improvement though. There are three 60162306a36Sopenharmony_ciplaces with statements:: 60262306a36Sopenharmony_ci 60362306a36Sopenharmony_ci rp4 ^= cur; rp6 ^= cur; 60462306a36Sopenharmony_ci 60562306a36Sopenharmony_ciIt seems more efficient to also maintain a variable rp4_6 in the while 60662306a36Sopenharmony_ciloop; This eliminates 3 statements per loop. Of course after the loop we 60762306a36Sopenharmony_cineed to correct by adding:: 60862306a36Sopenharmony_ci 60962306a36Sopenharmony_ci rp4 ^= rp4_6; 61062306a36Sopenharmony_ci rp6 ^= rp4_6 61162306a36Sopenharmony_ci 61262306a36Sopenharmony_ciFurthermore there are 4 sequential assignments to rp8. This can be 61362306a36Sopenharmony_ciencoded slightly more efficiently by saving tmppar before those 4 lines 61462306a36Sopenharmony_ciand later do rp8 = rp8 ^ tmppar ^ notrp8; 61562306a36Sopenharmony_ci(where notrp8 is the value of rp8 before those 4 lines). 61662306a36Sopenharmony_ciAgain a use of the commutative property of xor. 61762306a36Sopenharmony_ciTime for a new test! 61862306a36Sopenharmony_ci 61962306a36Sopenharmony_ci 62062306a36Sopenharmony_ciAttempt 7 62162306a36Sopenharmony_ci========= 62262306a36Sopenharmony_ci 62362306a36Sopenharmony_ciThe new code now looks like:: 62462306a36Sopenharmony_ci 62562306a36Sopenharmony_ci for (i = 0; i < 4; i++) 62662306a36Sopenharmony_ci { 62762306a36Sopenharmony_ci cur = *bp++; tmppar = cur; rp4 ^= cur; 62862306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; 62962306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; 63062306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; 63162306a36Sopenharmony_ci 63262306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; 63362306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= cur; 63462306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; 63562306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; 63662306a36Sopenharmony_ci 63762306a36Sopenharmony_ci notrp8 = tmppar; 63862306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; 63962306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= cur; 64062306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; 64162306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; 64262306a36Sopenharmony_ci rp8 = rp8 ^ tmppar ^ notrp8; 64362306a36Sopenharmony_ci 64462306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; 64562306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp6 ^= cur; 64662306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; rp4 ^= cur; 64762306a36Sopenharmony_ci cur = *bp++; tmppar ^= cur; 64862306a36Sopenharmony_ci 64962306a36Sopenharmony_ci par ^= tmppar; 65062306a36Sopenharmony_ci if ((i & 0x1) == 0) rp12 ^= tmppar; 65162306a36Sopenharmony_ci if ((i & 0x2) == 0) rp14 ^= tmppar; 65262306a36Sopenharmony_ci } 65362306a36Sopenharmony_ci rp4 ^= rp4_6; 65462306a36Sopenharmony_ci rp6 ^= rp4_6; 65562306a36Sopenharmony_ci 65662306a36Sopenharmony_ci 65762306a36Sopenharmony_ciNot a big change, but every penny counts :-) 65862306a36Sopenharmony_ci 65962306a36Sopenharmony_ci 66062306a36Sopenharmony_ciAnalysis 7 66162306a36Sopenharmony_ci========== 66262306a36Sopenharmony_ci 66362306a36Sopenharmony_ciActually this made things worse. Not very much, but I don't want to move 66462306a36Sopenharmony_ciinto the wrong direction. Maybe something to investigate later. Could 66562306a36Sopenharmony_cihave to do with caching again. 66662306a36Sopenharmony_ci 66762306a36Sopenharmony_ciGuess that is what there is to win within the loop. Maybe unrolling one 66862306a36Sopenharmony_cimore time will help. I'll keep the optimisations from 7 for now. 66962306a36Sopenharmony_ci 67062306a36Sopenharmony_ci 67162306a36Sopenharmony_ciAttempt 8 67262306a36Sopenharmony_ci========= 67362306a36Sopenharmony_ci 67462306a36Sopenharmony_ciUnrolled the loop one more time. 67562306a36Sopenharmony_ci 67662306a36Sopenharmony_ci 67762306a36Sopenharmony_ciAnalysis 8 67862306a36Sopenharmony_ci========== 67962306a36Sopenharmony_ci 68062306a36Sopenharmony_ciThis makes things worse. Let's stick with attempt 6 and continue from there. 68162306a36Sopenharmony_ciAlthough it seems that the code within the loop cannot be optimised 68262306a36Sopenharmony_cifurther there is still room to optimize the generation of the ecc codes. 68362306a36Sopenharmony_ciWe can simply calculate the total parity. If this is 0 then rp4 = rp5 68462306a36Sopenharmony_cietc. If the parity is 1, then rp4 = !rp5; 68562306a36Sopenharmony_ci 68662306a36Sopenharmony_ciBut if rp4 = rp5 we do not need rp5 etc. We can just write the even bits 68762306a36Sopenharmony_ciin the result byte and then do something like:: 68862306a36Sopenharmony_ci 68962306a36Sopenharmony_ci code[0] |= (code[0] << 1); 69062306a36Sopenharmony_ci 69162306a36Sopenharmony_ciLets test this. 69262306a36Sopenharmony_ci 69362306a36Sopenharmony_ci 69462306a36Sopenharmony_ciAttempt 9 69562306a36Sopenharmony_ci========= 69662306a36Sopenharmony_ci 69762306a36Sopenharmony_ciChanged the code but again this slightly degrades performance. Tried all 69862306a36Sopenharmony_cikind of other things, like having dedicated parity arrays to avoid the 69962306a36Sopenharmony_cishift after parity[rp7] << 7; No gain. 70062306a36Sopenharmony_ciChange the lookup using the parity array by using shift operators (e.g. 70162306a36Sopenharmony_cireplace parity[rp7] << 7 with:: 70262306a36Sopenharmony_ci 70362306a36Sopenharmony_ci rp7 ^= (rp7 << 4); 70462306a36Sopenharmony_ci rp7 ^= (rp7 << 2); 70562306a36Sopenharmony_ci rp7 ^= (rp7 << 1); 70662306a36Sopenharmony_ci rp7 &= 0x80; 70762306a36Sopenharmony_ci 70862306a36Sopenharmony_ciNo gain. 70962306a36Sopenharmony_ci 71062306a36Sopenharmony_ciThe only marginal change was inverting the parity bits, so we can remove 71162306a36Sopenharmony_cithe last three invert statements. 71262306a36Sopenharmony_ci 71362306a36Sopenharmony_ciAh well, pity this does not deliver more. Then again 10 million 71462306a36Sopenharmony_ciiterations using the linux driver code takes between 13 and 13.5 71562306a36Sopenharmony_ciseconds, whereas my code now takes about 0.73 seconds for those 10 71662306a36Sopenharmony_cimillion iterations. So basically I've improved the performance by a 71762306a36Sopenharmony_cifactor 18 on my system. Not that bad. Of course on different hardware 71862306a36Sopenharmony_ciyou will get different results. No warranties! 71962306a36Sopenharmony_ci 72062306a36Sopenharmony_ciBut of course there is no such thing as a free lunch. The codesize almost 72162306a36Sopenharmony_citripled (from 562 bytes to 1434 bytes). Then again, it is not that much. 72262306a36Sopenharmony_ci 72362306a36Sopenharmony_ci 72462306a36Sopenharmony_ciCorrecting errors 72562306a36Sopenharmony_ci================= 72662306a36Sopenharmony_ci 72762306a36Sopenharmony_ciFor correcting errors I again used the ST application note as a starter, 72862306a36Sopenharmony_cibut I also peeked at the existing code. 72962306a36Sopenharmony_ci 73062306a36Sopenharmony_ciThe algorithm itself is pretty straightforward. Just xor the given and 73162306a36Sopenharmony_cithe calculated ecc. If all bytes are 0 there is no problem. If 11 bits 73262306a36Sopenharmony_ciare 1 we have one correctable bit error. If there is 1 bit 1, we have an 73362306a36Sopenharmony_cierror in the given ecc code. 73462306a36Sopenharmony_ci 73562306a36Sopenharmony_ciIt proved to be fastest to do some table lookups. Performance gain 73662306a36Sopenharmony_ciintroduced by this is about a factor 2 on my system when a repair had to 73762306a36Sopenharmony_cibe done, and 1% or so if no repair had to be done. 73862306a36Sopenharmony_ci 73962306a36Sopenharmony_ciCode size increased from 330 bytes to 686 bytes for this function. 74062306a36Sopenharmony_ci(gcc 4.2, -O3) 74162306a36Sopenharmony_ci 74262306a36Sopenharmony_ci 74362306a36Sopenharmony_ciConclusion 74462306a36Sopenharmony_ci========== 74562306a36Sopenharmony_ci 74662306a36Sopenharmony_ciThe gain when calculating the ecc is tremendous. Om my development hardware 74762306a36Sopenharmony_cia speedup of a factor of 18 for ecc calculation was achieved. On a test on an 74862306a36Sopenharmony_ciembedded system with a MIPS core a factor 7 was obtained. 74962306a36Sopenharmony_ci 75062306a36Sopenharmony_ciOn a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor 75162306a36Sopenharmony_ci5 (big endian mode, gcc 4.1.2, -O3) 75262306a36Sopenharmony_ci 75362306a36Sopenharmony_ciFor correction not much gain could be obtained (as bitflips are rare). Then 75462306a36Sopenharmony_ciagain there are also much less cycles spent there. 75562306a36Sopenharmony_ci 75662306a36Sopenharmony_ciIt seems there is not much more gain possible in this, at least when 75762306a36Sopenharmony_ciprogrammed in C. Of course it might be possible to squeeze something more 75862306a36Sopenharmony_ciout of it with an assembler program, but due to pipeline behaviour etc 75962306a36Sopenharmony_cithis is very tricky (at least for intel hw). 76062306a36Sopenharmony_ci 76162306a36Sopenharmony_ciAuthor: Frans Meulenbroeks 76262306a36Sopenharmony_ci 76362306a36Sopenharmony_ciCopyright (C) 2008 Koninklijke Philips Electronics NV. 764