397 lines
12 KiB
Plaintext
397 lines
12 KiB
Plaintext
This file offers some guidance for developers who need to create assembly
|
|
language Macros for an unsupported processor to support the Comba or KCM
|
|
methods for modular multiplication. Note that the "standard" C build of MIRACL
|
|
may be fast enough, or else the provided C macros may be sufficient. Use c.mcs
|
|
or c1.mcs, but if mr_qltype is defined in mirdef.h, use c3.mcs
|
|
|
|
An .mcs file contains C or assembly language macros to implement some simple
|
|
operations on big number digits. On most modern Load/Store RISC processors
|
|
the optimal assembly language required is fortunately quite generic - and
|
|
that's why this approach is possible. The Instruction set need only support
|
|
unsigned multiply, add and subtract. It should also support standard indexed
|
|
addressing modes, where the effective memory address is calculated by adding
|
|
an offset to a register. The compiler should ideally support in-line assembly.
|
|
|
|
Correct memory address offsets and label names are automatically inserted by
|
|
the mex utility. It uses the standard C function "fprintf" to expand the
|
|
macros, and hence inserts appropriate numbers where-ever %d appears in the
|
|
macro. This implies that if the symbol % should be part of assembly language
|
|
syntax, it must be replaced by %%.
|
|
|
|
These macros when extracted by the mex.c program from the specified .mcs
|
|
file are used to implement some fast algorithms.
|
|
|
|
MULTIPLY
|
|
|
|
a2 a1 a0
|
|
b2 b1 b0
|
|
--------------
|
|
a2.b0 a1.b0 a0.b0
|
|
a2.b1 a1.b1 a0.b1
|
|
a2.b2 a1.b2 a0.b2
|
|
------------------------------
|
|
c4 c3 c2 c1 c0
|
|
|
|
|
|
Here (a1.b0) is a partial product. As we add up each column, the result is
|
|
accumulated in a triple-precision register x|y|z. When the column is totalled
|
|
z is written to memory and 0|x|y becomes the "carry" to the next column. Note
|
|
that x will always be "small", less than the number of partial products in
|
|
the longest column.
|
|
|
|
The algorithm for fast multi-precision multiplication multiplies a*b=c and
|
|
requires the following macros.
|
|
|
|
MUL_START
|
|
|
|
Push any registers if necessary, Initialise pointers to a, b and c,
|
|
and zero the triple-precision register required to accumulate the
|
|
partial products.
|
|
|
|
STEP
|
|
|
|
Calculate a partial product and add it to the triple-register.
|
|
The array offsets from the pointers to a and b are inserted
|
|
automatically by the mex utility.
|
|
|
|
MFIN
|
|
|
|
Store total for this column z, and calculate the carry for the next.
|
|
|
|
MUL_END
|
|
|
|
Store the left over carry in the left-most column, and pop registers
|
|
if necessary.
|
|
|
|
|
|
MULTUP
|
|
|
|
Here we just need the first (lower) half of c=a*b; So we use the same
|
|
MUL_START, STEP, and MFIN macros as above. However no carry is needed from
|
|
the last column, so a simpler LAST macro is employed here.
|
|
|
|
SQUARE
|
|
|
|
Multiprecision squaring can be done nearly twice as fast, as partial products
|
|
appear twice in most columns, but only need to be calculated once. Observe
|
|
that if a=b in the above example then a2.b0 = a0.b2 in the third column.
|
|
|
|
This algorithm finds c=a*a
|
|
|
|
SQR_START
|
|
|
|
Push registers, form pointers to a and c, zeroise triple register
|
|
|
|
DSTEP
|
|
|
|
Calculate a partial product and add it twice to column total in
|
|
the triple register.
|
|
|
|
SELF
|
|
|
|
Calculate a "diagonal element", e.g. (a1.a1) and add to total.
|
|
|
|
SFIN
|
|
|
|
Store total for this column z, and calculate the carry for the next
|
|
|
|
SQR_END
|
|
|
|
Store the left over carry in the left-most column, and pop registers
|
|
if necessary.
|
|
|
|
REDC
|
|
|
|
Montgomery's modular reduction algorithm calculates a%=b; It also accesses
|
|
the pre-computed variable "ndash".
|
|
|
|
REDC_START
|
|
|
|
Push registers, get pointers to a and b, and put ndash in a register.
|
|
Initialise the triple register x|y|z to 0|0|a[0]
|
|
|
|
RFINU
|
|
|
|
Multiply ndash*z and store lower half of result in a[i] for i-th
|
|
column. Multiply this number now by b[0]. Add the result to the
|
|
triple register x|y|z. Set the triple register to reflect the
|
|
"carry" to the next column 0|x|y. Add a[i+1] to the triple register.
|
|
|
|
RFIND
|
|
|
|
Store z into a[i]. Set triple register to reflect the "carry"
|
|
to the next column. Add a[i+1] to the triple register.
|
|
|
|
REDC_END
|
|
|
|
Store the left-over carry in the left-most two columns.
|
|
|
|
This algorithm also uses the STEP macro as described above.
|
|
|
|
ADDITION
|
|
|
|
This algorithm does a simple element-by-element adition c=a+b,
|
|
propagating the carries.
|
|
|
|
ADD_START
|
|
|
|
Gets pointers to a, b and c. Adds the first two elements
|
|
c[0]=a[0]+b[0]
|
|
|
|
ADD
|
|
|
|
Adds-with-carry c[i]=a[i]+b[i]
|
|
|
|
ADD_END
|
|
|
|
If the carry flag is still set, set the variable "carry"=1,
|
|
otherwise =0
|
|
|
|
INCREMENT
|
|
|
|
Very similar to the above, but this time calculates a+=b
|
|
|
|
INC_START
|
|
|
|
Gets pointers to a and b. Adds the first elements a[0]+=b[0]
|
|
|
|
INC
|
|
|
|
Adds-with-carry a[i]+=b[i]
|
|
|
|
INC_END
|
|
|
|
If the carry flag is still set, set the variable "carry=1",
|
|
otherwise 0
|
|
|
|
SUBTRACTION
|
|
|
|
Find c=a-b, propagating borrows
|
|
|
|
SUB_START
|
|
|
|
Gets pointers to a, b and c. Subtracts the first two elements
|
|
c[0]=a[0]-b[0]
|
|
|
|
SUB
|
|
|
|
Subtracts-with-borrow c[i]=a[i]-b[i]
|
|
|
|
SUB_END
|
|
|
|
If there is a "borrow" outstanding, set the variable "carry"=1,
|
|
otherwise =0. **NOTE** This MAY be indicated by carry flag=1 OR
|
|
by carry flag=0 - it depends on architecture, so be careful.
|
|
|
|
DECREMENT
|
|
|
|
Find a-=b, propagating borrows
|
|
|
|
DEC_START
|
|
|
|
Gets pointers to a and b. Subtracts the first two elements
|
|
a[0]-=b[0]
|
|
|
|
DEC
|
|
|
|
Subtracts-with-borrow a[i]-=b[i]
|
|
|
|
DEC_END
|
|
|
|
If there is a "borrow" outstanding, set the variable "carry"=1,
|
|
otherwise =0. **NOTE** This MAY be indicated by carry flag=1 OR
|
|
by carry flag=0 - it depends on architecture, so be careful.
|
|
|
|
SUMMATION
|
|
|
|
Adds c=a+b in a "for" loop. Each time around the loop a fixed block of
|
|
digits are added. A total of n such blocks are to be added.
|
|
|
|
KADD_START
|
|
|
|
Initialise pointers to a, b and c. Move n into a register. Set
|
|
the carry flag to zero. Provide a label for looping back to. Note
|
|
that label numbers are automatically inserted by the mex utility.
|
|
|
|
KASL
|
|
|
|
Decrement the n register. If its zero jump to label at the end of
|
|
this macro. If its not, advance the pointer registers to point to
|
|
the next block, and branch back to label in KADD_START.
|
|
**NOTE** It is vitally important that this macro does NOT affect
|
|
the carry flag
|
|
|
|
KADD_END
|
|
|
|
If the carry flag is still set, set the variable "carry=1",
|
|
otherwise 0
|
|
|
|
This algorithm also uses the ADD macro. See above
|
|
|
|
|
|
INCREMENTATION
|
|
|
|
Adds a+=b in a "for" loop. Each time around the loop a fixed block of
|
|
digits are added. A total of n such blocks are to be added.
|
|
|
|
KINC_START
|
|
|
|
Initialise pointers to a and b. Move n into a register. Set
|
|
the carry flag to zero. Provide a label for looping back to. Note
|
|
that label numbers are automatically inserted by the mex utility.
|
|
|
|
KIDL
|
|
|
|
Decrement the n register. If its zero jump to label at the end of
|
|
this macro. If its not, advance the pointer registers to point to
|
|
the next block, and branch back to label in KINC_START.
|
|
**NOTE** It is vitally important that this macro does NOT affect
|
|
the carry flag
|
|
|
|
KINC_END
|
|
|
|
If the carry flag is still set, set the variable "carry=1",
|
|
otherwise 0
|
|
|
|
This algorithm also uses the INC macro. See above
|
|
|
|
|
|
DECREMENTATION
|
|
|
|
Subtracts a-=b in a "for" loop. Each time around the loop a fixed block of
|
|
digits are subtracted. A total of n such blocks are to be subtracted.
|
|
|
|
KDEC_START
|
|
|
|
Initialise pointers to a and b. Move n into a register. Set
|
|
the carry flag to that state which indicates no borrow. Provide a
|
|
label for looping back to. Note that label numbers are automatically
|
|
inserted by the mex utility.
|
|
|
|
KDEC_END
|
|
|
|
Set the variable "carry" to 1 if the carry flag is in that state
|
|
which reflects an outstanding "borrow", otherwise 0
|
|
|
|
This algorithm also uses the KIDL and DEC macros. See above.
|
|
|
|
|
|
NEW! - April 2002
|
|
|
|
Interleaved steps can be used in multiplication to allow for improved
|
|
instruction scheduling. This could be a lot faster if the multiply unit takes
|
|
more than 1 clock cycle. The idea is to expose more ILP (Instruction Level
|
|
Parallelism) for a modern pipelined (and possibly super-scaler) load-store
|
|
processor to chew on.
|
|
|
|
In the calculation of the sum of partial products in a column, replace
|
|
|
|
STEPM - Multiply
|
|
STEPA - Add
|
|
STEPM - Multiply
|
|
STEPA - Add
|
|
STEPM - Multiply
|
|
STEPA - Add
|
|
STEPM - Multiply
|
|
STEPA - Add
|
|
|
|
with the interleaved
|
|
|
|
STEP1M - Multiply 1
|
|
|
|
STEP2M - Multiply 2
|
|
STEP1A - Add 1
|
|
STEP1M - Multiply 1
|
|
STEP2A - Add 2
|
|
STEP2M - Multiply 2
|
|
STEP1A - Add 1
|
|
|
|
STEP2A - Add 2
|
|
|
|
The same applies to DSTEP for squaring.
|
|
|
|
In this way the multiply instruction gets more time to complete to its
|
|
destination registers.
|
|
|
|
If the MEX program sees that STEP1M macro is present as well as STEP,
|
|
it will permit scheduling with the -s flag as above. Of course this requires
|
|
that there to be enough registers - STEP1x and STEP2x should ideally use
|
|
different registers.
|
|
|
|
If this is going to help it is particularly important that the destination
|
|
registers of the multiply steps STEP1M and STEP2M be distinct.
|
|
|
|
Many processors use hardware dynamic scheduling (like the Pentium II).
|
|
For such a processor scheduling the code like this will have little
|
|
effect. The ARM assembler re-orders the code automatically for
|
|
optimum scheduling, so again scheduling the code like this will have little
|
|
impact.
|
|
|
|
Some example *.mcs files in this format are included - see c.mcs and
|
|
arm.mcs for example.
|
|
|
|
This idea could be extended in a fairly obvious way, for example
|
|
|
|
STEP1M - Multiply 1
|
|
STEP2M - Multiply 2
|
|
STEP3M - Multiply 3
|
|
|
|
STEP1A - Add 1
|
|
STEP1M - Multiply 1
|
|
STEP2A - Add 2
|
|
STEP2M - Multiply 2
|
|
STEP3A - Add 3
|
|
STEP3M - Multiply 3
|
|
STEP1A - Add 1
|
|
STEP1M - Multiply 1
|
|
|
|
STEP2A - Add 2
|
|
STEP3A - Add 3
|
|
STEP1A - Add 1
|
|
|
|
This is currently NOT supported, it would require a relatively simple
|
|
modification to mex.c
|
|
|
|
This might be justified in the case of a very deeply pipelined multiplier.
|
|
The idea would be that Multiply 1, 2, and 3 might all be active
|
|
simultaneously.
|
|
|
|
|
|
|
|
New features 12/10/2007
|
|
|
|
NEW: PMUL, PMUL_START and PMUL_END macros to support fast reduction for
|
|
pseudo mersenne prime moduli.
|
|
|
|
|
|
PMULT
|
|
|
|
When reducing modulo p=2^m-d, where m is a multiple of the word length and
|
|
d is single precision, we have the useful identity 2^m=d mod p. The product
|
|
of two field elements can be considered as 2^m.U+L, where U is the upper
|
|
half of the product, and L is the lower half. Therefore the reduced value
|
|
is 2^m.U+L mod p = dU+L. The dU component can in turn be considered as
|
|
2^m.x+Y, where x will be single precision. So the final result is d.x+Y+L.
|
|
Finally a couple of subtractions of p might be required to get a result
|
|
less than p. The function PMULT calculates d.x and Y
|
|
|
|
PMUL_START
|
|
|
|
Initialises pointers to the top half of the product a[], and arrays
|
|
to hold d.x (b[]) and Y (c[]). Moves d to a register.
|
|
|
|
PMUL
|
|
|
|
Sets c[i]=a[i]*d and propagates carries. Sets b[i]=0
|
|
|
|
PMUL_END
|
|
|
|
Sets b[0] (and b[1]) to d.x
|
|
|
|
NEW: Macros to support Hybrid method - see http://eprint.iacr.org/2007/299
|
|
Basically a 2x2 or 4x4 block of partial products are calculated together
|
|
which reduces memory accesses. Uses more registers.
|
|
|
|
|
|
|