• Home
  • Raw
  • Download

Lines Matching +full:1 +full:- +full:to +full:- +full:1

2 NAND Error-correction Code
11 After that the speed was increased by 35-40%.
15 I decided to annotate my steps in this file. Perhaps it is useful to someone
26 This is done by means of a Hamming code. I'll try to explain it in
27 laymans terms (and apologies to all the pro's in the field in case I do
33 columns. The parity used is even parity which means that the parity bit = 1
34 if the data over which the parity is calculated is 1 and the parity bit = 0
39 sometimes also referred to as xor. In C the operator for xor is ^
41 Back to ecc.
46 byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14
61 Let's start to explain column parity.
63 - cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
69 - cp2 is the parity over bit0, bit1, bit4 and bit5
70 - cp3 is the parity over bit2, bit3, bit6 and bit7.
71 - cp4 is the parity over bit0, bit1, bit2 and bit3.
72 - cp5 is the parity over bit4, bit5, bit6 and bit7.
78 - rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
79 - rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
80 - rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
82 - rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
83 - for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
85 so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
86 - and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
90 - rp6 covers 8 bytes then skips 8 etc
91 - rp7 skips 8 bytes then covers 8 etc
92 - rp8 covers 16 bytes then skips 16 etc
93 - rp9 skips 16 bytes then covers 16 etc
94 - rp10 covers 32 bytes then skips 32 etc
95 - rp11 skips 32 bytes then covers 32 etc
96 - rp12 covers 64 bytes then skips 64 etc
97 - rp13 skips 64 bytes then covers 64 etc
98 - rp14 covers 128 bytes then skips 128
99 - rp15 skips 128 bytes then covers 128
105 ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
108 ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08
109 ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1
115 Oh well, I'm graphically challenged, so suffer with me for a moment :-)
172 C does have bitwise operators but not really operators to do the above
175 not going to bring me a Nobel prize :-)
179 individually, let us try to rearrange things.
185 This leads to:
188 Attempt 1
194 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
195 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
196 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
197 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
198 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
199 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
200 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
201 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
202 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
203 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
204 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
205 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
206 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
207 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
208 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
209 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
247 (parity[rp1] << 1) |
249 code[1] =
256 (parity[rp9] << 1) |
266 code[1] = ~code[1];
270 Still pretty straightforward. The last three invert statements are there to
274 I also introduced the parity lookup. I expected this to be the fastest
275 way to calculate the parity, but I will investigate alternatives later
279 Analysis 1
289 In step 1 we moved from bit-wise calculation to byte-wise calculation.
292 to write our code in such a way that we process data in 32 bit chunks.
296 for the column parity we use the par variable. When extending to 32 bits
298 (because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
299 respectively, from MSB to LSB)
306 Anyway, if there is an issue: this code is developed on x86 (to be
311 otherwise that should be fixed to get maximum performance).
350 we need to adapt the code generation for the fact that rp vars are now
351 long; also the column parity calculation needs to be changed.
352 we'll bring rp4 to 15 back to single byte entities by shifting and
381 (parity[rp1] << 1) |
383 code[1] =
390 (parity[rp9] << 1) |
400 code[1] = ~code[1];
415 There is more to be gained.
418 This means there is no need to calculate rp14 as it can be calculated from
421 rp14). That is why some places refer to inverse parity.
425 by going from long to byte first. Actually we can even avoid the table
465 Very weird. Guess it has to do with caching or instruction parallelism
467 observation was that this one is only 30% slower (according to time)
470 Well, it was expected not to be easy so maybe instead move to a
471 different track: let's move back to the code from attempt2 and do some
473 different amounts of unrolling to see what works best.
479 Unrolled the loop 1, 2, 3 and 4 times.
506 Unrolling three times gives a gain of 30% compared to attempt 2.
508 Unrolling four times gives a marginal improvement compared to unrolling
511 I decided to proceed with a four time unrolled loop anyway. It was my gut
517 that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
527 Of course after the loop we need to correct things by adding code like::
532 Along the line I also removed the initialisation of rp0/1/2/3.
538 Measurements showed this was a good move. The run-time roughly halved
539 compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
540 of the processor time compared to the current code in the linux kernel.
550 THe code within the for loop was changed to::
579 As you can see tmppar is used to accumulate the parity within a for
580 iteration. In the last 3 statements is added to par and, if needed,
581 to rp12 and rp14.
594 linux code 1 million times, this took about 1 second on my system.
595 (using time to measure the performance). After this iteration I was back
596 to 0.075 sec. Actually I had to decide to start measuring over 10
597 million iterations in order not to lose too much accuracy. This one
598 definitely seemed to be the jackpot!
605 It seems more efficient to also maintain a variable rp4_6 in the while
607 need to correct by adding::
612 Furthermore there are 4 sequential assignments to rp8. This can be
657 Not a big change, but every penny counts :-)
663 Actually this made things worse. Not very much, but I don't want to move
664 into the wrong direction. Maybe something to investigate later. Could
665 have to do with caching again.
667 Guess that is what there is to win within the loop. Maybe unrolling one
682 further there is still room to optimize the generation of the ecc codes.
684 etc. If the parity is 1, then rp4 = !rp5;
689 code[0] |= (code[0] << 1);
698 kind of other things, like having dedicated parity arrays to avoid the
705 rp7 ^= (rp7 << 1);
721 tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
732 are 1 we have one correctable bit error. If there is 1 bit 1, we have an
735 It proved to be fastest to do some table lookups. Performance gain
736 introduced by this is about a factor 2 on my system when a repair had to
737 be done, and 1% or so if no repair had to be done.
739 Code size increased from 330 bytes to 686 bytes for this function.
740 (gcc 4.2, -O3)
751 5 (big endian mode, gcc 4.1.2, -O3)
757 programmed in C. Of course it might be possible to squeeze something more
758 out of it with an assembler program, but due to pipeline behaviour etc