1First up, let me say I don't like writing in assembler. It is not portable, 2dependant on the particular CPU architecture release and is generally a pig 3to debug and get right. Having said that, the x86 architecture is probably 4the most important for speed due to number of boxes and since 5it appears to be the worst architecture to to get 6good C compilers for. So due to this, I have lowered myself to do 7assembler for the inner DES routines in libdes :-). 8 9The file to implement in assembler is des_enc.c. Replace the following 104 functions 11des_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt); 12des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt); 13des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 14des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 15 16They encrypt/decrypt the 64 bits held in 'data' using 17the 'ks' key schedules. The only difference between the 4 functions is that 18des_encrypt2() does not perform IP() or FP() on the data (this is an 19optimization for when doing triple DES and des_encrypt3() and des_decrypt3() 20perform triple des. The triple DES routines are in here because it does 21make a big difference to have them located near the des_encrypt2 function 22at link time.. 23 24Now as we all know, there are lots of different operating systems running on 25x86 boxes, and unfortunately they normally try to make sure their assembler 26formating is not the same as the other peoples. 27The 4 main formats I know of are 28Microsoft Windows 95/Windows NT 29Elf Includes Linux and FreeBSD(?). 30a.out The older Linux. 31Solaris Same as Elf but different comments :-(. 32 33Now I was not overly keen to write 4 different copies of the same code, 34so I wrote a few perl routines to output the correct assembler, given 35a target assembler type. This code is ugly and is just a hack. 36The libraries are x86unix.pl and x86ms.pl. 37des586.pl, des686.pl and des-som[23].pl are the programs to actually 38generate the assembler. 39 40So to generate elf assembler 41perl des-som3.pl elf >dx86-elf.s 42For Windows 95/NT 43perl des-som2.pl win32 >win32.asm 44 45[ update 4 Jan 1996 ] 46I have added another way to do things. 47perl des-som3.pl cpp >dx86-cpp.s 48generates a file that will be included by dx86unix.cpp when it is compiled. 49To build for elf, a.out, solaris, bsdi etc, 50cc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o 51cc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o 52cc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o 53cc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o 54This was done to cut down the number of files in the distribution. 55 56Now the ugly part. I acquired my copy of Intels 57"Optimization's For Intel's 32-Bit Processors" and found a few interesting 58things. First, the aim of the exersize is to 'extract' one byte at a time 59from a word and do an array lookup. This involves getting the byte from 60the 4 locations in the word and moving it to a new word and doing the lookup. 61The most obvious way to do this is 62xor eax, eax # clear word 63movb al, cl # get low byte 64xor edi DWORD PTR 0x100+des_SP[eax] # xor in word 65movb al, ch # get next byte 66xor edi DWORD PTR 0x300+des_SP[eax] # xor in word 67shr ecx 16 68which seems ok. For the pentium, this system appears to be the best. 69One has to do instruction interleaving to keep both functional units 70operating, but it is basically very efficient. 71 72Now the crunch. When a full register is used after a partial write, eg. 73mov al, cl 74xor edi, DWORD PTR 0x100+des_SP[eax] 75386 - 1 cycle stall 76486 - 1 cycle stall 77586 - 0 cycle stall 78686 - at least 7 cycle stall (page 22 of the above mentioned document). 79 80So the technique that produces the best results on a pentium, according to 81the documentation, will produce hideous results on a pentium pro. 82 83To get around this, des686.pl will generate code that is not as fast on 84a pentium, should be very good on a pentium pro. 85mov eax, ecx # copy word 86shr ecx, 8 # line up next byte 87and eax, 0fch # mask byte 88xor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup 89mov eax, ecx # get word 90shr ecx 8 # line up next byte 91and eax, 0fch # mask byte 92xor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup 93 94Due to the execution units in the pentium, this actually works quite well. 95For a pentium pro it should be very good. This is the type of output 96Visual C++ generates. 97 98There is a third option. instead of using 99mov al, ch 100which is bad on the pentium pro, one may be able to use 101movzx eax, ch 102which may not incur the partial write penalty. On the pentium, 103this instruction takes 4 cycles so is not worth using but on the 104pentium pro it appears it may be worth while. I need access to one to 105experiment :-). 106 107eric (20 Oct 1996) 108 10922 Nov 1996 - I have asked people to run the 2 different version on pentium 110pros and it appears that the intel documentation is wrong. The 111mov al,bh is still faster on a pentium pro, so just use the des586.pl 112install des686.pl 113 1143 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these 115functions into des_enc.c because it does make a massive performance 116difference on some boxes to have the functions code located close to 117the des_encrypt2() function. 118 1199 Jan 1997 - des-som2.pl is now the correct perl script to use for 120pentiums. It contains an inner loop from 121Svend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at 122273,000 per second. He had a previous version at 250,000 and the best 123I was able to get was 203,000. The content has not changed, this is all 124due to instruction sequencing (and actual instructions choice) which is able 125to keep both functional units of the pentium going. 126We may have lost the ugly register usage restrictions when x86 went 32 bit 127but for the pentium it has been replaced by evil instruction ordering tricks. 128 12913 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf. 130raw DES at 281,000 per second on a pentium 100. 131 132