162306a36Sopenharmony_ci==========================
262306a36Sopenharmony_ciNAND Error-correction Code
362306a36Sopenharmony_ci==========================
462306a36Sopenharmony_ci
562306a36Sopenharmony_ciIntroduction
662306a36Sopenharmony_ci============
762306a36Sopenharmony_ci
862306a36Sopenharmony_ciHaving looked at the linux mtd/nand Hamming software ECC engine driver
962306a36Sopenharmony_ciI felt there was room for optimisation. I bashed the code for a few hours
1062306a36Sopenharmony_ciperforming tricks like table lookup removing superfluous code etc.
1162306a36Sopenharmony_ciAfter that the speed was increased by 35-40%.
1262306a36Sopenharmony_ciStill I was not too happy as I felt there was additional room for improvement.
1362306a36Sopenharmony_ci
1462306a36Sopenharmony_ciBad! I was hooked.
1562306a36Sopenharmony_ciI decided to annotate my steps in this file. Perhaps it is useful to someone
1662306a36Sopenharmony_cior someone learns something from it.
1762306a36Sopenharmony_ci
1862306a36Sopenharmony_ci
1962306a36Sopenharmony_ciThe problem
2062306a36Sopenharmony_ci===========
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ciNAND flash (at least SLC one) typically has sectors of 256 bytes.
2362306a36Sopenharmony_ciHowever NAND flash is not extremely reliable so some error detection
2462306a36Sopenharmony_ci(and sometimes correction) is needed.
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ciThis is done by means of a Hamming code. I'll try to explain it in
2762306a36Sopenharmony_cilaymans terms (and apologies to all the pro's in the field in case I do
2862306a36Sopenharmony_cinot use the right terminology, my coding theory class was almost 30
2962306a36Sopenharmony_ciyears ago, and I must admit it was not one of my favourites).
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ciAs I said before the ecc calculation is performed on sectors of 256
3262306a36Sopenharmony_cibytes. This is done by calculating several parity bits over the rows and
3362306a36Sopenharmony_cicolumns. The parity used is even parity which means that the parity bit = 1
3462306a36Sopenharmony_ciif the data over which the parity is calculated is 1 and the parity bit = 0
3562306a36Sopenharmony_ciif the data over which the parity is calculated is 0. So the total
3662306a36Sopenharmony_cinumber of bits over the data over which the parity is calculated + the
3762306a36Sopenharmony_ciparity bit is even. (see wikipedia if you can't follow this).
3862306a36Sopenharmony_ciParity is often calculated by means of an exclusive or operation,
3962306a36Sopenharmony_cisometimes also referred to as xor. In C the operator for xor is ^
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciBack to ecc.
4262306a36Sopenharmony_ciLet's give a small figure:
4362306a36Sopenharmony_ci
4462306a36Sopenharmony_ci=========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ====
4562306a36Sopenharmony_cibyte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
4662306a36Sopenharmony_cibyte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
4762306a36Sopenharmony_cibyte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
4862306a36Sopenharmony_cibyte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
4962306a36Sopenharmony_cibyte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
5062306a36Sopenharmony_ci...
5162306a36Sopenharmony_cibyte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
5262306a36Sopenharmony_cibyte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
5362306a36Sopenharmony_ci           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
5462306a36Sopenharmony_ci           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
5562306a36Sopenharmony_ci           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
5662306a36Sopenharmony_ci=========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ====
5762306a36Sopenharmony_ci
5862306a36Sopenharmony_ciThis figure represents a sector of 256 bytes.
5962306a36Sopenharmony_cicp is my abbreviation for column parity, rp for row parity.
6062306a36Sopenharmony_ci
6162306a36Sopenharmony_ciLet's start to explain column parity.
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ci- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
6462306a36Sopenharmony_ci
6562306a36Sopenharmony_ci  so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ciSimilarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
6862306a36Sopenharmony_ci
6962306a36Sopenharmony_ci- cp2 is the parity over bit0, bit1, bit4 and bit5
7062306a36Sopenharmony_ci- cp3 is the parity over bit2, bit3, bit6 and bit7.
7162306a36Sopenharmony_ci- cp4 is the parity over bit0, bit1, bit2 and bit3.
7262306a36Sopenharmony_ci- cp5 is the parity over bit4, bit5, bit6 and bit7.
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ciNote that each of cp0 .. cp5 is exactly one bit.
7562306a36Sopenharmony_ci
7662306a36Sopenharmony_ciRow parity actually works almost the same.
7762306a36Sopenharmony_ci
7862306a36Sopenharmony_ci- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
7962306a36Sopenharmony_ci- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
8062306a36Sopenharmony_ci- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
8162306a36Sopenharmony_ci  (so handle two bytes, then skip 2 bytes).
8262306a36Sopenharmony_ci- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
8362306a36Sopenharmony_ci- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ci  so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
8662306a36Sopenharmony_ci- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ciThe story now becomes quite boring. I guess you get the idea.
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ci- rp6 covers 8 bytes then skips 8 etc
9162306a36Sopenharmony_ci- rp7 skips 8 bytes then covers 8 etc
9262306a36Sopenharmony_ci- rp8 covers 16 bytes then skips 16 etc
9362306a36Sopenharmony_ci- rp9 skips 16 bytes then covers 16 etc
9462306a36Sopenharmony_ci- rp10 covers 32 bytes then skips 32 etc
9562306a36Sopenharmony_ci- rp11 skips 32 bytes then covers 32 etc
9662306a36Sopenharmony_ci- rp12 covers 64 bytes then skips 64 etc
9762306a36Sopenharmony_ci- rp13 skips 64 bytes then covers 64 etc
9862306a36Sopenharmony_ci- rp14 covers 128 bytes then skips 128
9962306a36Sopenharmony_ci- rp15 skips 128 bytes then covers 128
10062306a36Sopenharmony_ci
10162306a36Sopenharmony_ciIn the end the parity bits are grouped together in three bytes as
10262306a36Sopenharmony_cifollows:
10362306a36Sopenharmony_ci
10462306a36Sopenharmony_ci=====  ===== ===== ===== ===== ===== ===== ===== =====
10562306a36Sopenharmony_ciECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
10662306a36Sopenharmony_ci=====  ===== ===== ===== ===== ===== ===== ===== =====
10762306a36Sopenharmony_ciECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
10862306a36Sopenharmony_ciECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
10962306a36Sopenharmony_ciECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
11062306a36Sopenharmony_ci=====  ===== ===== ===== ===== ===== ===== ===== =====
11162306a36Sopenharmony_ci
11262306a36Sopenharmony_ciI detected after writing this that ST application note AN1823
11362306a36Sopenharmony_ci(http://www.st.com/stonline/) gives a much
11462306a36Sopenharmony_cinicer picture.(but they use line parity as term where I use row parity)
11562306a36Sopenharmony_ciOh well, I'm graphically challenged, so suffer with me for a moment :-)
11662306a36Sopenharmony_ci
11762306a36Sopenharmony_ciAnd I could not reuse the ST picture anyway for copyright reasons.
11862306a36Sopenharmony_ci
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ciAttempt 0
12162306a36Sopenharmony_ci=========
12262306a36Sopenharmony_ci
12362306a36Sopenharmony_ciImplementing the parity calculation is pretty simple.
12462306a36Sopenharmony_ciIn C pseudocode::
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ci  for (i = 0; i < 256; i++)
12762306a36Sopenharmony_ci  {
12862306a36Sopenharmony_ci    if (i & 0x01)
12962306a36Sopenharmony_ci       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
13062306a36Sopenharmony_ci    else
13162306a36Sopenharmony_ci       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0;
13262306a36Sopenharmony_ci    if (i & 0x02)
13362306a36Sopenharmony_ci       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
13462306a36Sopenharmony_ci    else
13562306a36Sopenharmony_ci       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
13662306a36Sopenharmony_ci    if (i & 0x04)
13762306a36Sopenharmony_ci      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
13862306a36Sopenharmony_ci    else
13962306a36Sopenharmony_ci      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
14062306a36Sopenharmony_ci    if (i & 0x08)
14162306a36Sopenharmony_ci      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
14262306a36Sopenharmony_ci    else
14362306a36Sopenharmony_ci      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
14462306a36Sopenharmony_ci    if (i & 0x10)
14562306a36Sopenharmony_ci      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
14662306a36Sopenharmony_ci    else
14762306a36Sopenharmony_ci      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
14862306a36Sopenharmony_ci    if (i & 0x20)
14962306a36Sopenharmony_ci      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
15062306a36Sopenharmony_ci    else
15162306a36Sopenharmony_ci      rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
15262306a36Sopenharmony_ci    if (i & 0x40)
15362306a36Sopenharmony_ci      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
15462306a36Sopenharmony_ci    else
15562306a36Sopenharmony_ci      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
15662306a36Sopenharmony_ci    if (i & 0x80)
15762306a36Sopenharmony_ci      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
15862306a36Sopenharmony_ci    else
15962306a36Sopenharmony_ci      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
16062306a36Sopenharmony_ci    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
16162306a36Sopenharmony_ci    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
16262306a36Sopenharmony_ci    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
16362306a36Sopenharmony_ci    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
16462306a36Sopenharmony_ci    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
16562306a36Sopenharmony_ci    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
16662306a36Sopenharmony_ci  }
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ciAnalysis 0
17062306a36Sopenharmony_ci==========
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciC does have bitwise operators but not really operators to do the above
17362306a36Sopenharmony_ciefficiently (and most hardware has no such instructions either).
17462306a36Sopenharmony_ciTherefore without implementing this it was clear that the code above was
17562306a36Sopenharmony_cinot going to bring me a Nobel prize :-)
17662306a36Sopenharmony_ci
17762306a36Sopenharmony_ciFortunately the exclusive or operation is commutative, so we can combine
17862306a36Sopenharmony_cithe values in any order. So instead of calculating all the bits
17962306a36Sopenharmony_ciindividually, let us try to rearrange things.
18062306a36Sopenharmony_ciFor the column parity this is easy. We can just xor the bytes and in the
18162306a36Sopenharmony_ciend filter out the relevant bits. This is pretty nice as it will bring
18262306a36Sopenharmony_ciall cp calculation out of the for loop.
18362306a36Sopenharmony_ci
18462306a36Sopenharmony_ciSimilarly we can first xor the bytes for the various rows.
18562306a36Sopenharmony_ciThis leads to:
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ci
18862306a36Sopenharmony_ciAttempt 1
18962306a36Sopenharmony_ci=========
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ci::
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ci  const char parity[256] = {
19462306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
19562306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
19662306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
19762306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
19862306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
19962306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
20062306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
20162306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
20262306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
20362306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
20462306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
20562306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
20662306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
20762306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
20862306a36Sopenharmony_ci      1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
20962306a36Sopenharmony_ci      0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
21062306a36Sopenharmony_ci  };
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ci  void ecc1(const unsigned char *buf, unsigned char *code)
21362306a36Sopenharmony_ci  {
21462306a36Sopenharmony_ci      int i;
21562306a36Sopenharmony_ci      const unsigned char *bp = buf;
21662306a36Sopenharmony_ci      unsigned char cur;
21762306a36Sopenharmony_ci      unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
21862306a36Sopenharmony_ci      unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
21962306a36Sopenharmony_ci      unsigned char par;
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ci      par = 0;
22262306a36Sopenharmony_ci      rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
22362306a36Sopenharmony_ci      rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
22462306a36Sopenharmony_ci      rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
22562306a36Sopenharmony_ci      rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
22662306a36Sopenharmony_ci
22762306a36Sopenharmony_ci      for (i = 0; i < 256; i++)
22862306a36Sopenharmony_ci      {
22962306a36Sopenharmony_ci          cur = *bp++;
23062306a36Sopenharmony_ci          par ^= cur;
23162306a36Sopenharmony_ci          if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
23262306a36Sopenharmony_ci          if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
23362306a36Sopenharmony_ci          if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
23462306a36Sopenharmony_ci          if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
23562306a36Sopenharmony_ci          if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
23662306a36Sopenharmony_ci          if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
23762306a36Sopenharmony_ci          if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
23862306a36Sopenharmony_ci          if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
23962306a36Sopenharmony_ci      }
24062306a36Sopenharmony_ci      code[0] =
24162306a36Sopenharmony_ci          (parity[rp7] << 7) |
24262306a36Sopenharmony_ci          (parity[rp6] << 6) |
24362306a36Sopenharmony_ci          (parity[rp5] << 5) |
24462306a36Sopenharmony_ci          (parity[rp4] << 4) |
24562306a36Sopenharmony_ci          (parity[rp3] << 3) |
24662306a36Sopenharmony_ci          (parity[rp2] << 2) |
24762306a36Sopenharmony_ci          (parity[rp1] << 1) |
24862306a36Sopenharmony_ci          (parity[rp0]);
24962306a36Sopenharmony_ci      code[1] =
25062306a36Sopenharmony_ci          (parity[rp15] << 7) |
25162306a36Sopenharmony_ci          (parity[rp14] << 6) |
25262306a36Sopenharmony_ci          (parity[rp13] << 5) |
25362306a36Sopenharmony_ci          (parity[rp12] << 4) |
25462306a36Sopenharmony_ci          (parity[rp11] << 3) |
25562306a36Sopenharmony_ci          (parity[rp10] << 2) |
25662306a36Sopenharmony_ci          (parity[rp9]  << 1) |
25762306a36Sopenharmony_ci          (parity[rp8]);
25862306a36Sopenharmony_ci      code[2] =
25962306a36Sopenharmony_ci          (parity[par & 0xf0] << 7) |
26062306a36Sopenharmony_ci          (parity[par & 0x0f] << 6) |
26162306a36Sopenharmony_ci          (parity[par & 0xcc] << 5) |
26262306a36Sopenharmony_ci          (parity[par & 0x33] << 4) |
26362306a36Sopenharmony_ci          (parity[par & 0xaa] << 3) |
26462306a36Sopenharmony_ci          (parity[par & 0x55] << 2);
26562306a36Sopenharmony_ci      code[0] = ~code[0];
26662306a36Sopenharmony_ci      code[1] = ~code[1];
26762306a36Sopenharmony_ci      code[2] = ~code[2];
26862306a36Sopenharmony_ci  }
26962306a36Sopenharmony_ci
27062306a36Sopenharmony_ciStill pretty straightforward. The last three invert statements are there to
27162306a36Sopenharmony_cigive a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
27262306a36Sopenharmony_ciall data is 0xff, so the checksum then matches.
27362306a36Sopenharmony_ci
27462306a36Sopenharmony_ciI also introduced the parity lookup. I expected this to be the fastest
27562306a36Sopenharmony_ciway to calculate the parity, but I will investigate alternatives later
27662306a36Sopenharmony_cion.
27762306a36Sopenharmony_ci
27862306a36Sopenharmony_ci
27962306a36Sopenharmony_ciAnalysis 1
28062306a36Sopenharmony_ci==========
28162306a36Sopenharmony_ci
28262306a36Sopenharmony_ciThe code works, but is not terribly efficient. On my system it took
28362306a36Sopenharmony_cialmost 4 times as much time as the linux driver code. But hey, if it was
28462306a36Sopenharmony_ci*that* easy this would have been done long before.
28562306a36Sopenharmony_ciNo pain. no gain.
28662306a36Sopenharmony_ci
28762306a36Sopenharmony_ciFortunately there is plenty of room for improvement.
28862306a36Sopenharmony_ci
28962306a36Sopenharmony_ciIn step 1 we moved from bit-wise calculation to byte-wise calculation.
29062306a36Sopenharmony_ciHowever in C we can also use the unsigned long data type and virtually
29162306a36Sopenharmony_cievery modern microprocessor supports 32 bit operations, so why not try
29262306a36Sopenharmony_cito write our code in such a way that we process data in 32 bit chunks.
29362306a36Sopenharmony_ci
29462306a36Sopenharmony_ciOf course this means some modification as the row parity is byte by
29562306a36Sopenharmony_cibyte. A quick analysis:
29662306a36Sopenharmony_cifor the column parity we use the par variable. When extending to 32 bits
29762306a36Sopenharmony_ciwe can in the end easily calculate rp0 and rp1 from it.
29862306a36Sopenharmony_ci(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
29962306a36Sopenharmony_cirespectively, from MSB to LSB)
30062306a36Sopenharmony_cialso rp2 and rp3 can be easily retrieved from par as rp3 covers the
30162306a36Sopenharmony_cifirst two MSBs and rp2 covers the last two LSBs.
30262306a36Sopenharmony_ci
30362306a36Sopenharmony_ciNote that of course now the loop is executed only 64 times (256/4).
30462306a36Sopenharmony_ciAnd note that care must taken wrt byte ordering. The way bytes are
30562306a36Sopenharmony_ciordered in a long is machine dependent, and might affect us.
30662306a36Sopenharmony_ciAnyway, if there is an issue: this code is developed on x86 (to be
30762306a36Sopenharmony_ciprecise: a DELL PC with a D920 Intel CPU)
30862306a36Sopenharmony_ci
30962306a36Sopenharmony_ciAnd of course the performance might depend on alignment, but I expect
31062306a36Sopenharmony_cithat the I/O buffers in the nand driver are aligned properly (and
31162306a36Sopenharmony_ciotherwise that should be fixed to get maximum performance).
31262306a36Sopenharmony_ci
31362306a36Sopenharmony_ciLet's give it a try...
31462306a36Sopenharmony_ci
31562306a36Sopenharmony_ci
31662306a36Sopenharmony_ciAttempt 2
31762306a36Sopenharmony_ci=========
31862306a36Sopenharmony_ci
31962306a36Sopenharmony_ci::
32062306a36Sopenharmony_ci
32162306a36Sopenharmony_ci  extern const char parity[256];
32262306a36Sopenharmony_ci
32362306a36Sopenharmony_ci  void ecc2(const unsigned char *buf, unsigned char *code)
32462306a36Sopenharmony_ci  {
32562306a36Sopenharmony_ci      int i;
32662306a36Sopenharmony_ci      const unsigned long *bp = (unsigned long *)buf;
32762306a36Sopenharmony_ci      unsigned long cur;
32862306a36Sopenharmony_ci      unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
32962306a36Sopenharmony_ci      unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
33062306a36Sopenharmony_ci      unsigned long par;
33162306a36Sopenharmony_ci
33262306a36Sopenharmony_ci      par = 0;
33362306a36Sopenharmony_ci      rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
33462306a36Sopenharmony_ci      rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
33562306a36Sopenharmony_ci      rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
33662306a36Sopenharmony_ci      rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
33762306a36Sopenharmony_ci
33862306a36Sopenharmony_ci      for (i = 0; i < 64; i++)
33962306a36Sopenharmony_ci      {
34062306a36Sopenharmony_ci          cur = *bp++;
34162306a36Sopenharmony_ci          par ^= cur;
34262306a36Sopenharmony_ci          if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
34362306a36Sopenharmony_ci          if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
34462306a36Sopenharmony_ci          if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
34562306a36Sopenharmony_ci          if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
34662306a36Sopenharmony_ci          if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
34762306a36Sopenharmony_ci          if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
34862306a36Sopenharmony_ci      }
34962306a36Sopenharmony_ci      /*
35062306a36Sopenharmony_ci         we need to adapt the code generation for the fact that rp vars are now
35162306a36Sopenharmony_ci         long; also the column parity calculation needs to be changed.
35262306a36Sopenharmony_ci         we'll bring rp4 to 15 back to single byte entities by shifting and
35362306a36Sopenharmony_ci         xoring
35462306a36Sopenharmony_ci      */
35562306a36Sopenharmony_ci      rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
35662306a36Sopenharmony_ci      rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
35762306a36Sopenharmony_ci      rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
35862306a36Sopenharmony_ci      rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
35962306a36Sopenharmony_ci      rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
36062306a36Sopenharmony_ci      rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
36162306a36Sopenharmony_ci      rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
36262306a36Sopenharmony_ci      rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
36362306a36Sopenharmony_ci      rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
36462306a36Sopenharmony_ci      rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
36562306a36Sopenharmony_ci      rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
36662306a36Sopenharmony_ci      rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
36762306a36Sopenharmony_ci      rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
36862306a36Sopenharmony_ci      rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
36962306a36Sopenharmony_ci      par ^= (par >> 16);
37062306a36Sopenharmony_ci      rp1 = (par >> 8); rp1 &= 0xff;
37162306a36Sopenharmony_ci      rp0 = (par & 0xff);
37262306a36Sopenharmony_ci      par ^= (par >> 8); par &= 0xff;
37362306a36Sopenharmony_ci
37462306a36Sopenharmony_ci      code[0] =
37562306a36Sopenharmony_ci          (parity[rp7] << 7) |
37662306a36Sopenharmony_ci          (parity[rp6] << 6) |
37762306a36Sopenharmony_ci          (parity[rp5] << 5) |
37862306a36Sopenharmony_ci          (parity[rp4] << 4) |
37962306a36Sopenharmony_ci          (parity[rp3] << 3) |
38062306a36Sopenharmony_ci          (parity[rp2] << 2) |
38162306a36Sopenharmony_ci          (parity[rp1] << 1) |
38262306a36Sopenharmony_ci          (parity[rp0]);
38362306a36Sopenharmony_ci      code[1] =
38462306a36Sopenharmony_ci          (parity[rp15] << 7) |
38562306a36Sopenharmony_ci          (parity[rp14] << 6) |
38662306a36Sopenharmony_ci          (parity[rp13] << 5) |
38762306a36Sopenharmony_ci          (parity[rp12] << 4) |
38862306a36Sopenharmony_ci          (parity[rp11] << 3) |
38962306a36Sopenharmony_ci          (parity[rp10] << 2) |
39062306a36Sopenharmony_ci          (parity[rp9]  << 1) |
39162306a36Sopenharmony_ci          (parity[rp8]);
39262306a36Sopenharmony_ci      code[2] =
39362306a36Sopenharmony_ci          (parity[par & 0xf0] << 7) |
39462306a36Sopenharmony_ci          (parity[par & 0x0f] << 6) |
39562306a36Sopenharmony_ci          (parity[par & 0xcc] << 5) |
39662306a36Sopenharmony_ci          (parity[par & 0x33] << 4) |
39762306a36Sopenharmony_ci          (parity[par & 0xaa] << 3) |
39862306a36Sopenharmony_ci          (parity[par & 0x55] << 2);
39962306a36Sopenharmony_ci      code[0] = ~code[0];
40062306a36Sopenharmony_ci      code[1] = ~code[1];
40162306a36Sopenharmony_ci      code[2] = ~code[2];
40262306a36Sopenharmony_ci  }
40362306a36Sopenharmony_ci
40462306a36Sopenharmony_ciThe parity array is not shown any more. Note also that for these
40562306a36Sopenharmony_ciexamples I kinda deviated from my regular programming style by allowing
40662306a36Sopenharmony_cimultiple statements on a line, not using { } in then and else blocks
40762306a36Sopenharmony_ciwith only a single statement and by using operators like ^=
40862306a36Sopenharmony_ci
40962306a36Sopenharmony_ci
41062306a36Sopenharmony_ciAnalysis 2
41162306a36Sopenharmony_ci==========
41262306a36Sopenharmony_ci
41362306a36Sopenharmony_ciThe code (of course) works, and hurray: we are a little bit faster than
41462306a36Sopenharmony_cithe linux driver code (about 15%). But wait, don't cheer too quickly.
41562306a36Sopenharmony_ciThere is more to be gained.
41662306a36Sopenharmony_ciIf we look at e.g. rp14 and rp15 we see that we either xor our data with
41762306a36Sopenharmony_cirp14 or with rp15. However we also have par which goes over all data.
41862306a36Sopenharmony_ciThis means there is no need to calculate rp14 as it can be calculated from
41962306a36Sopenharmony_cirp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15;
42062306a36Sopenharmony_ci(or if desired we can avoid calculating rp15 and calculate it from
42162306a36Sopenharmony_cirp14).  That is why some places refer to inverse parity.
42262306a36Sopenharmony_ciOf course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
42362306a36Sopenharmony_ciEffectively this means we can eliminate the else clause from the if
42462306a36Sopenharmony_cistatements. Also we can optimise the calculation in the end a little bit
42562306a36Sopenharmony_ciby going from long to byte first. Actually we can even avoid the table
42662306a36Sopenharmony_cilookups
42762306a36Sopenharmony_ci
42862306a36Sopenharmony_ciAttempt 3
42962306a36Sopenharmony_ci=========
43062306a36Sopenharmony_ci
43162306a36Sopenharmony_ciOdd replaced::
43262306a36Sopenharmony_ci
43362306a36Sopenharmony_ci          if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
43462306a36Sopenharmony_ci          if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
43562306a36Sopenharmony_ci          if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
43662306a36Sopenharmony_ci          if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
43762306a36Sopenharmony_ci          if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
43862306a36Sopenharmony_ci          if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
43962306a36Sopenharmony_ci
44062306a36Sopenharmony_ciwith::
44162306a36Sopenharmony_ci
44262306a36Sopenharmony_ci          if (i & 0x01) rp5 ^= cur;
44362306a36Sopenharmony_ci          if (i & 0x02) rp7 ^= cur;
44462306a36Sopenharmony_ci          if (i & 0x04) rp9 ^= cur;
44562306a36Sopenharmony_ci          if (i & 0x08) rp11 ^= cur;
44662306a36Sopenharmony_ci          if (i & 0x10) rp13 ^= cur;
44762306a36Sopenharmony_ci          if (i & 0x20) rp15 ^= cur;
44862306a36Sopenharmony_ci
44962306a36Sopenharmony_ciand outside the loop added::
45062306a36Sopenharmony_ci
45162306a36Sopenharmony_ci          rp4  = par ^ rp5;
45262306a36Sopenharmony_ci          rp6  = par ^ rp7;
45362306a36Sopenharmony_ci          rp8  = par ^ rp9;
45462306a36Sopenharmony_ci          rp10  = par ^ rp11;
45562306a36Sopenharmony_ci          rp12  = par ^ rp13;
45662306a36Sopenharmony_ci          rp14  = par ^ rp15;
45762306a36Sopenharmony_ci
45862306a36Sopenharmony_ciAnd after that the code takes about 30% more time, although the number of
45962306a36Sopenharmony_cistatements is reduced. This is also reflected in the assembly code.
46062306a36Sopenharmony_ci
46162306a36Sopenharmony_ci
46262306a36Sopenharmony_ciAnalysis 3
46362306a36Sopenharmony_ci==========
46462306a36Sopenharmony_ci
46562306a36Sopenharmony_ciVery weird. Guess it has to do with caching or instruction parallellism
46662306a36Sopenharmony_cior so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
46762306a36Sopenharmony_ciobservation was that this one is only 30% slower (according to time)
46862306a36Sopenharmony_ciexecuting the code as my 3Ghz D920 processor.
46962306a36Sopenharmony_ci
47062306a36Sopenharmony_ciWell, it was expected not to be easy so maybe instead move to a
47162306a36Sopenharmony_cidifferent track: let's move back to the code from attempt2 and do some
47262306a36Sopenharmony_ciloop unrolling. This will eliminate a few if statements. I'll try
47362306a36Sopenharmony_cidifferent amounts of unrolling to see what works best.
47462306a36Sopenharmony_ci
47562306a36Sopenharmony_ci
47662306a36Sopenharmony_ciAttempt 4
47762306a36Sopenharmony_ci=========
47862306a36Sopenharmony_ci
47962306a36Sopenharmony_ciUnrolled the loop 1, 2, 3 and 4 times.
48062306a36Sopenharmony_ciFor 4 the code starts with::
48162306a36Sopenharmony_ci
48262306a36Sopenharmony_ci    for (i = 0; i < 4; i++)
48362306a36Sopenharmony_ci    {
48462306a36Sopenharmony_ci        cur = *bp++;
48562306a36Sopenharmony_ci        par ^= cur;
48662306a36Sopenharmony_ci        rp4 ^= cur;
48762306a36Sopenharmony_ci        rp6 ^= cur;
48862306a36Sopenharmony_ci        rp8 ^= cur;
48962306a36Sopenharmony_ci        rp10 ^= cur;
49062306a36Sopenharmony_ci        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
49162306a36Sopenharmony_ci        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
49262306a36Sopenharmony_ci        cur = *bp++;
49362306a36Sopenharmony_ci        par ^= cur;
49462306a36Sopenharmony_ci        rp5 ^= cur;
49562306a36Sopenharmony_ci        rp6 ^= cur;
49662306a36Sopenharmony_ci        ...
49762306a36Sopenharmony_ci
49862306a36Sopenharmony_ci
49962306a36Sopenharmony_ciAnalysis 4
50062306a36Sopenharmony_ci==========
50162306a36Sopenharmony_ci
50262306a36Sopenharmony_ciUnrolling once gains about 15%
50362306a36Sopenharmony_ci
50462306a36Sopenharmony_ciUnrolling twice keeps the gain at about 15%
50562306a36Sopenharmony_ci
50662306a36Sopenharmony_ciUnrolling three times gives a gain of 30% compared to attempt 2.
50762306a36Sopenharmony_ci
50862306a36Sopenharmony_ciUnrolling four times gives a marginal improvement compared to unrolling
50962306a36Sopenharmony_cithree times.
51062306a36Sopenharmony_ci
51162306a36Sopenharmony_ciI decided to proceed with a four time unrolled loop anyway. It was my gut
51262306a36Sopenharmony_cifeeling that in the next steps I would obtain additional gain from it.
51362306a36Sopenharmony_ci
51462306a36Sopenharmony_ciThe next step was triggered by the fact that par contains the xor of all
51562306a36Sopenharmony_cibytes and rp4 and rp5 each contain the xor of half of the bytes.
51662306a36Sopenharmony_ciSo in effect par = rp4 ^ rp5. But as xor is commutative we can also say
51762306a36Sopenharmony_cithat rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
51862306a36Sopenharmony_cieliminate rp5 (or rp4, but I already foresaw another optimisation).
51962306a36Sopenharmony_ciThe same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
52062306a36Sopenharmony_ci
52162306a36Sopenharmony_ci
52262306a36Sopenharmony_ciAttempt 5
52362306a36Sopenharmony_ci=========
52462306a36Sopenharmony_ci
52562306a36Sopenharmony_ciEffectively so all odd digit rp assignments in the loop were removed.
52662306a36Sopenharmony_ciThis included the else clause of the if statements.
52762306a36Sopenharmony_ciOf course after the loop we need to correct things by adding code like::
52862306a36Sopenharmony_ci
52962306a36Sopenharmony_ci    rp5 = par ^ rp4;
53062306a36Sopenharmony_ci
53162306a36Sopenharmony_ciAlso the initial assignments (rp5 = 0; etc) could be removed.
53262306a36Sopenharmony_ciAlong the line I also removed the initialisation of rp0/1/2/3.
53362306a36Sopenharmony_ci
53462306a36Sopenharmony_ci
53562306a36Sopenharmony_ciAnalysis 5
53662306a36Sopenharmony_ci==========
53762306a36Sopenharmony_ci
53862306a36Sopenharmony_ciMeasurements showed this was a good move. The run-time roughly halved
53962306a36Sopenharmony_cicompared with attempt 4 with 4 times unrolled, and we only require 1/3rd
54062306a36Sopenharmony_ciof the processor time compared to the current code in the linux kernel.
54162306a36Sopenharmony_ci
54262306a36Sopenharmony_ciHowever, still I thought there was more. I didn't like all the if
54362306a36Sopenharmony_cistatements. Why not keep a running parity and only keep the last if
54462306a36Sopenharmony_cistatement. Time for yet another version!
54562306a36Sopenharmony_ci
54662306a36Sopenharmony_ci
54762306a36Sopenharmony_ciAttempt 6
54862306a36Sopenharmony_ci=========
54962306a36Sopenharmony_ci
55062306a36Sopenharmony_ciTHe code within the for loop was changed to::
55162306a36Sopenharmony_ci
55262306a36Sopenharmony_ci    for (i = 0; i < 4; i++)
55362306a36Sopenharmony_ci    {
55462306a36Sopenharmony_ci        cur = *bp++; tmppar  = cur; rp4 ^= cur;
55562306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
55662306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
55762306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
55862306a36Sopenharmony_ci
55962306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
56062306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
56162306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
56262306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
56362306a36Sopenharmony_ci
56462306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
56562306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
56662306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
56762306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
56862306a36Sopenharmony_ci
56962306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
57062306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
57162306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
57262306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur;
57362306a36Sopenharmony_ci
57462306a36Sopenharmony_ci        par ^= tmppar;
57562306a36Sopenharmony_ci        if ((i & 0x1) == 0) rp12 ^= tmppar;
57662306a36Sopenharmony_ci        if ((i & 0x2) == 0) rp14 ^= tmppar;
57762306a36Sopenharmony_ci    }
57862306a36Sopenharmony_ci
57962306a36Sopenharmony_ciAs you can see tmppar is used to accumulate the parity within a for
58062306a36Sopenharmony_ciiteration. In the last 3 statements is added to par and, if needed,
58162306a36Sopenharmony_cito rp12 and rp14.
58262306a36Sopenharmony_ci
58362306a36Sopenharmony_ciWhile making the changes I also found that I could exploit that tmppar
58462306a36Sopenharmony_cicontains the running parity for this iteration. So instead of having:
58562306a36Sopenharmony_cirp4 ^= cur; rp6 ^= cur;
58662306a36Sopenharmony_ciI removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next
58762306a36Sopenharmony_cistatement. A similar change was done for rp8 and rp10
58862306a36Sopenharmony_ci
58962306a36Sopenharmony_ci
59062306a36Sopenharmony_ciAnalysis 6
59162306a36Sopenharmony_ci==========
59262306a36Sopenharmony_ci
59362306a36Sopenharmony_ciMeasuring this code again showed big gain. When executing the original
59462306a36Sopenharmony_cilinux code 1 million times, this took about 1 second on my system.
59562306a36Sopenharmony_ci(using time to measure the performance). After this iteration I was back
59662306a36Sopenharmony_cito 0.075 sec. Actually I had to decide to start measuring over 10
59762306a36Sopenharmony_cimillion iterations in order not to lose too much accuracy. This one
59862306a36Sopenharmony_cidefinitely seemed to be the jackpot!
59962306a36Sopenharmony_ci
60062306a36Sopenharmony_ciThere is a little bit more room for improvement though. There are three
60162306a36Sopenharmony_ciplaces with statements::
60262306a36Sopenharmony_ci
60362306a36Sopenharmony_ci	rp4 ^= cur; rp6 ^= cur;
60462306a36Sopenharmony_ci
60562306a36Sopenharmony_ciIt seems more efficient to also maintain a variable rp4_6 in the while
60662306a36Sopenharmony_ciloop; This eliminates 3 statements per loop. Of course after the loop we
60762306a36Sopenharmony_cineed to correct by adding::
60862306a36Sopenharmony_ci
60962306a36Sopenharmony_ci	rp4 ^= rp4_6;
61062306a36Sopenharmony_ci	rp6 ^= rp4_6
61162306a36Sopenharmony_ci
61262306a36Sopenharmony_ciFurthermore there are 4 sequential assignments to rp8. This can be
61362306a36Sopenharmony_ciencoded slightly more efficiently by saving tmppar before those 4 lines
61462306a36Sopenharmony_ciand later do rp8 = rp8 ^ tmppar ^ notrp8;
61562306a36Sopenharmony_ci(where notrp8 is the value of rp8 before those 4 lines).
61662306a36Sopenharmony_ciAgain a use of the commutative property of xor.
61762306a36Sopenharmony_ciTime for a new test!
61862306a36Sopenharmony_ci
61962306a36Sopenharmony_ci
62062306a36Sopenharmony_ciAttempt 7
62162306a36Sopenharmony_ci=========
62262306a36Sopenharmony_ci
62362306a36Sopenharmony_ciThe new code now looks like::
62462306a36Sopenharmony_ci
62562306a36Sopenharmony_ci    for (i = 0; i < 4; i++)
62662306a36Sopenharmony_ci    {
62762306a36Sopenharmony_ci        cur = *bp++; tmppar  = cur; rp4 ^= cur;
62862306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
62962306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
63062306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
63162306a36Sopenharmony_ci
63262306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
63362306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
63462306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
63562306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
63662306a36Sopenharmony_ci
63762306a36Sopenharmony_ci        notrp8 = tmppar;
63862306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
63962306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
64062306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
64162306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur;
64262306a36Sopenharmony_ci        rp8 = rp8 ^ tmppar ^ notrp8;
64362306a36Sopenharmony_ci
64462306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
64562306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
64662306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
64762306a36Sopenharmony_ci        cur = *bp++; tmppar ^= cur;
64862306a36Sopenharmony_ci
64962306a36Sopenharmony_ci        par ^= tmppar;
65062306a36Sopenharmony_ci        if ((i & 0x1) == 0) rp12 ^= tmppar;
65162306a36Sopenharmony_ci        if ((i & 0x2) == 0) rp14 ^= tmppar;
65262306a36Sopenharmony_ci    }
65362306a36Sopenharmony_ci    rp4 ^= rp4_6;
65462306a36Sopenharmony_ci    rp6 ^= rp4_6;
65562306a36Sopenharmony_ci
65662306a36Sopenharmony_ci
65762306a36Sopenharmony_ciNot a big change, but every penny counts :-)
65862306a36Sopenharmony_ci
65962306a36Sopenharmony_ci
66062306a36Sopenharmony_ciAnalysis 7
66162306a36Sopenharmony_ci==========
66262306a36Sopenharmony_ci
66362306a36Sopenharmony_ciActually this made things worse. Not very much, but I don't want to move
66462306a36Sopenharmony_ciinto the wrong direction. Maybe something to investigate later. Could
66562306a36Sopenharmony_cihave to do with caching again.
66662306a36Sopenharmony_ci
66762306a36Sopenharmony_ciGuess that is what there is to win within the loop. Maybe unrolling one
66862306a36Sopenharmony_cimore time will help. I'll keep the optimisations from 7 for now.
66962306a36Sopenharmony_ci
67062306a36Sopenharmony_ci
67162306a36Sopenharmony_ciAttempt 8
67262306a36Sopenharmony_ci=========
67362306a36Sopenharmony_ci
67462306a36Sopenharmony_ciUnrolled the loop one more time.
67562306a36Sopenharmony_ci
67662306a36Sopenharmony_ci
67762306a36Sopenharmony_ciAnalysis 8
67862306a36Sopenharmony_ci==========
67962306a36Sopenharmony_ci
68062306a36Sopenharmony_ciThis makes things worse. Let's stick with attempt 6 and continue from there.
68162306a36Sopenharmony_ciAlthough it seems that the code within the loop cannot be optimised
68262306a36Sopenharmony_cifurther there is still room to optimize the generation of the ecc codes.
68362306a36Sopenharmony_ciWe can simply calculate the total parity. If this is 0 then rp4 = rp5
68462306a36Sopenharmony_cietc. If the parity is 1, then rp4 = !rp5;
68562306a36Sopenharmony_ci
68662306a36Sopenharmony_ciBut if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
68762306a36Sopenharmony_ciin the result byte and then do something like::
68862306a36Sopenharmony_ci
68962306a36Sopenharmony_ci    code[0] |= (code[0] << 1);
69062306a36Sopenharmony_ci
69162306a36Sopenharmony_ciLets test this.
69262306a36Sopenharmony_ci
69362306a36Sopenharmony_ci
69462306a36Sopenharmony_ciAttempt 9
69562306a36Sopenharmony_ci=========
69662306a36Sopenharmony_ci
69762306a36Sopenharmony_ciChanged the code but again this slightly degrades performance. Tried all
69862306a36Sopenharmony_cikind of other things, like having dedicated parity arrays to avoid the
69962306a36Sopenharmony_cishift after parity[rp7] << 7; No gain.
70062306a36Sopenharmony_ciChange the lookup using the parity array by using shift operators (e.g.
70162306a36Sopenharmony_cireplace parity[rp7] << 7 with::
70262306a36Sopenharmony_ci
70362306a36Sopenharmony_ci	rp7 ^= (rp7 << 4);
70462306a36Sopenharmony_ci	rp7 ^= (rp7 << 2);
70562306a36Sopenharmony_ci	rp7 ^= (rp7 << 1);
70662306a36Sopenharmony_ci	rp7 &= 0x80;
70762306a36Sopenharmony_ci
70862306a36Sopenharmony_ciNo gain.
70962306a36Sopenharmony_ci
71062306a36Sopenharmony_ciThe only marginal change was inverting the parity bits, so we can remove
71162306a36Sopenharmony_cithe last three invert statements.
71262306a36Sopenharmony_ci
71362306a36Sopenharmony_ciAh well, pity this does not deliver more. Then again 10 million
71462306a36Sopenharmony_ciiterations using the linux driver code takes between 13 and 13.5
71562306a36Sopenharmony_ciseconds, whereas my code now takes about 0.73 seconds for those 10
71662306a36Sopenharmony_cimillion iterations. So basically I've improved the performance by a
71762306a36Sopenharmony_cifactor 18 on my system. Not that bad. Of course on different hardware
71862306a36Sopenharmony_ciyou will get different results. No warranties!
71962306a36Sopenharmony_ci
72062306a36Sopenharmony_ciBut of course there is no such thing as a free lunch. The codesize almost
72162306a36Sopenharmony_citripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
72262306a36Sopenharmony_ci
72362306a36Sopenharmony_ci
72462306a36Sopenharmony_ciCorrecting errors
72562306a36Sopenharmony_ci=================
72662306a36Sopenharmony_ci
72762306a36Sopenharmony_ciFor correcting errors I again used the ST application note as a starter,
72862306a36Sopenharmony_cibut I also peeked at the existing code.
72962306a36Sopenharmony_ci
73062306a36Sopenharmony_ciThe algorithm itself is pretty straightforward. Just xor the given and
73162306a36Sopenharmony_cithe calculated ecc. If all bytes are 0 there is no problem. If 11 bits
73262306a36Sopenharmony_ciare 1 we have one correctable bit error. If there is 1 bit 1, we have an
73362306a36Sopenharmony_cierror in the given ecc code.
73462306a36Sopenharmony_ci
73562306a36Sopenharmony_ciIt proved to be fastest to do some table lookups. Performance gain
73662306a36Sopenharmony_ciintroduced by this is about a factor 2 on my system when a repair had to
73762306a36Sopenharmony_cibe done, and 1% or so if no repair had to be done.
73862306a36Sopenharmony_ci
73962306a36Sopenharmony_ciCode size increased from 330 bytes to 686 bytes for this function.
74062306a36Sopenharmony_ci(gcc 4.2, -O3)
74162306a36Sopenharmony_ci
74262306a36Sopenharmony_ci
74362306a36Sopenharmony_ciConclusion
74462306a36Sopenharmony_ci==========
74562306a36Sopenharmony_ci
74662306a36Sopenharmony_ciThe gain when calculating the ecc is tremendous. Om my development hardware
74762306a36Sopenharmony_cia speedup of a factor of 18 for ecc calculation was achieved. On a test on an
74862306a36Sopenharmony_ciembedded system with a MIPS core a factor 7 was obtained.
74962306a36Sopenharmony_ci
75062306a36Sopenharmony_ciOn a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
75162306a36Sopenharmony_ci5 (big endian mode, gcc 4.1.2, -O3)
75262306a36Sopenharmony_ci
75362306a36Sopenharmony_ciFor correction not much gain could be obtained (as bitflips are rare). Then
75462306a36Sopenharmony_ciagain there are also much less cycles spent there.
75562306a36Sopenharmony_ci
75662306a36Sopenharmony_ciIt seems there is not much more gain possible in this, at least when
75762306a36Sopenharmony_ciprogrammed in C. Of course it might be possible to squeeze something more
75862306a36Sopenharmony_ciout of it with an assembler program, but due to pipeline behaviour etc
75962306a36Sopenharmony_cithis is very tricky (at least for intel hw).
76062306a36Sopenharmony_ci
76162306a36Sopenharmony_ciAuthor: Frans Meulenbroeks
76262306a36Sopenharmony_ci
76362306a36Sopenharmony_ciCopyright (C) 2008 Koninklijke Philips Electronics NV.
764