83 lines
2.7 KiB
Plaintext
83 lines
2.7 KiB
Plaintext
If you have a Pentium 4 or clone processor that supports the SSE2
|
|
extensions, then using these instructions can be faster.
|
|
|
|
The file sse2.mcs is provided as a plug-in alternative for ms86.mcs, and
|
|
gccsse2.mcs is provided as an alternative for gcc386.mcs
|
|
|
|
Using the COMBA or KCM methods and these provided macros, PCs will execute
|
|
big number code up to 60% faster. Ideal for a Pentium 4 based Crypto server.
|
|
See kcmcomba.txt
|
|
|
|
It is the programmers responsibility to ensure that their hardware and their
|
|
compiler supports SSE2 extensions.
|
|
|
|
Tested with latest Microsoft (use sse2.mcs) and GCC compilers (V3.3 or greater
|
|
- use gccsse2.mcs)
|
|
|
|
The key instruction is PMULUDQ which multiplies two pairs of 32-bit numbers in
|
|
a single instruction. Unfortunately trying to exploit this capability is very
|
|
difficult. But even just using it for a single multiplication is faster than
|
|
the standard x386 MUL instruction. However SSE2 instructions do not support a
|
|
carry flag :(. But the PADDQ instruction adds 64-bit numbers.
|
|
|
|
Consider the following trick:-
|
|
|
|
The 64-bit result of a PMULUDQ is written to a 128-bit SSE2 register thus
|
|
|
|
< 32 bits >
|
|
|
|
+--------+---------+----------+-----------+
|
|
| | | | |
|
|
|00000000|000000000| Hi | Lo |
|
|
| | | | |
|
|
+--------+---------+----------+-----------+
|
|
|
|
<---------------- 128 bits --------------->
|
|
|
|
|
|
Now shuffle this (using PSHUFD) so it becomes
|
|
|
|
|
|
+--------+---------+----------+-----------+
|
|
| | | | |
|
|
|00000000| Hi |0000000000| Lo |
|
|
| | | | |
|
|
+--------+---------+----------+-----------+
|
|
|
|
|
|
Now accumulate (by simple addition) partial products like these
|
|
(see makemcs.txt) in an SSE2 register, using the PADDQ instruction
|
|
|
|
+--------+---------+----------+-----------+
|
|
| | | | |
|
|
|00000CHi| SumHi |0000000CLo| SumLo |
|
|
| | | | |
|
|
+--------+---------+----------+-----------+
|
|
|
|
|
|
where CHi and CLo are accumulated carries from each half
|
|
|
|
At the bottom of each column of partial products, the sum for the column is
|
|
SumLo, and the Carry for the next column is the sum of
|
|
|
|
|
|
+--------+---------+----------+-----------+
|
|
| | | | |
|
|
| 0 | 0 |0000000CHi| SumHi |
|
|
| | | | |
|
|
+--------+---------+----------+-----------+
|
|
|
|
and
|
|
|
|
|
|
+--------+---------+----------+-----------+
|
|
| | | | |
|
|
| 0 | 0 | 0 |00000000Clo|
|
|
| | | | |
|
|
+--------+---------+----------+-----------+
|
|
|
|
|
|
This can easily be achieved using the available shift instructions and PADDQ.
|
|
|
|
|