This file offers some guidance for developers who need to create assembly language Macros for an unsupported processor to support the Comba or KCM methods for modular multiplication. Note that the "standard" C build of MIRACL may be fast enough, or else the provided C macros may be sufficient. Use c.mcs or c1.mcs, but if mr_qltype is defined in mirdef.h, use c3.mcs An .mcs file contains C or assembly language macros to implement some simple operations on big number digits. On most modern Load/Store RISC processors the optimal assembly language required is fortunately quite generic - and that's why this approach is possible. The Instruction set need only support unsigned multiply, add and subtract. It should also support standard indexed addressing modes, where the effective memory address is calculated by adding an offset to a register. The compiler should ideally support in-line assembly. Correct memory address offsets and label names are automatically inserted by the mex utility. It uses the standard C function "fprintf" to expand the macros, and hence inserts appropriate numbers where-ever %d appears in the macro. This implies that if the symbol % should be part of assembly language syntax, it must be replaced by %%. These macros when extracted by the mex.c program from the specified .mcs file are used to implement some fast algorithms. MULTIPLY a2 a1 a0 b2 b1 b0 -------------- a2.b0 a1.b0 a0.b0 a2.b1 a1.b1 a0.b1 a2.b2 a1.b2 a0.b2 ------------------------------ c4 c3 c2 c1 c0 Here (a1.b0) is a partial product. As we add up each column, the result is accumulated in a triple-precision register x|y|z. When the column is totalled z is written to memory and 0|x|y becomes the "carry" to the next column. Note that x will always be "small", less than the number of partial products in the longest column. The algorithm for fast multi-precision multiplication multiplies a*b=c and requires the following macros. MUL_START Push any registers if necessary, Initialise pointers to a, b and c, and zero the triple-precision register required to accumulate the partial products. STEP Calculate a partial product and add it to the triple-register. The array offsets from the pointers to a and b are inserted automatically by the mex utility. MFIN Store total for this column z, and calculate the carry for the next. MUL_END Store the left over carry in the left-most column, and pop registers if necessary. MULTUP Here we just need the first (lower) half of c=a*b; So we use the same MUL_START, STEP, and MFIN macros as above. However no carry is needed from the last column, so a simpler LAST macro is employed here. SQUARE Multiprecision squaring can be done nearly twice as fast, as partial products appear twice in most columns, but only need to be calculated once. Observe that if a=b in the above example then a2.b0 = a0.b2 in the third column. This algorithm finds c=a*a SQR_START Push registers, form pointers to a and c, zeroise triple register DSTEP Calculate a partial product and add it twice to column total in the triple register. SELF Calculate a "diagonal element", e.g. (a1.a1) and add to total. SFIN Store total for this column z, and calculate the carry for the next SQR_END Store the left over carry in the left-most column, and pop registers if necessary. REDC Montgomery's modular reduction algorithm calculates a%=b; It also accesses the pre-computed variable "ndash". REDC_START Push registers, get pointers to a and b, and put ndash in a register. Initialise the triple register x|y|z to 0|0|a[0] RFINU Multiply ndash*z and store lower half of result in a[i] for i-th column. Multiply this number now by b[0]. Add the result to the triple register x|y|z. Set the triple register to reflect the "carry" to the next column 0|x|y. Add a[i+1] to the triple register. RFIND Store z into a[i]. Set triple register to reflect the "carry" to the next column. Add a[i+1] to the triple register. REDC_END Store the left-over carry in the left-most two columns. This algorithm also uses the STEP macro as described above. ADDITION This algorithm does a simple element-by-element adition c=a+b, propagating the carries. ADD_START Gets pointers to a, b and c. Adds the first two elements c[0]=a[0]+b[0] ADD Adds-with-carry c[i]=a[i]+b[i] ADD_END If the carry flag is still set, set the variable "carry"=1, otherwise =0 INCREMENT Very similar to the above, but this time calculates a+=b INC_START Gets pointers to a and b. Adds the first elements a[0]+=b[0] INC Adds-with-carry a[i]+=b[i] INC_END If the carry flag is still set, set the variable "carry=1", otherwise 0 SUBTRACTION Find c=a-b, propagating borrows SUB_START Gets pointers to a, b and c. Subtracts the first two elements c[0]=a[0]-b[0] SUB Subtracts-with-borrow c[i]=a[i]-b[i] SUB_END If there is a "borrow" outstanding, set the variable "carry"=1, otherwise =0. **NOTE** This MAY be indicated by carry flag=1 OR by carry flag=0 - it depends on architecture, so be careful. DECREMENT Find a-=b, propagating borrows DEC_START Gets pointers to a and b. Subtracts the first two elements a[0]-=b[0] DEC Subtracts-with-borrow a[i]-=b[i] DEC_END If there is a "borrow" outstanding, set the variable "carry"=1, otherwise =0. **NOTE** This MAY be indicated by carry flag=1 OR by carry flag=0 - it depends on architecture, so be careful. SUMMATION Adds c=a+b in a "for" loop. Each time around the loop a fixed block of digits are added. A total of n such blocks are to be added. KADD_START Initialise pointers to a, b and c. Move n into a register. Set the carry flag to zero. Provide a label for looping back to. Note that label numbers are automatically inserted by the mex utility. KASL Decrement the n register. If its zero jump to label at the end of this macro. If its not, advance the pointer registers to point to the next block, and branch back to label in KADD_START. **NOTE** It is vitally important that this macro does NOT affect the carry flag KADD_END If the carry flag is still set, set the variable "carry=1", otherwise 0 This algorithm also uses the ADD macro. See above INCREMENTATION Adds a+=b in a "for" loop. Each time around the loop a fixed block of digits are added. A total of n such blocks are to be added. KINC_START Initialise pointers to a and b. Move n into a register. Set the carry flag to zero. Provide a label for looping back to. Note that label numbers are automatically inserted by the mex utility. KIDL Decrement the n register. If its zero jump to label at the end of this macro. If its not, advance the pointer registers to point to the next block, and branch back to label in KINC_START. **NOTE** It is vitally important that this macro does NOT affect the carry flag KINC_END If the carry flag is still set, set the variable "carry=1", otherwise 0 This algorithm also uses the INC macro. See above DECREMENTATION Subtracts a-=b in a "for" loop. Each time around the loop a fixed block of digits are subtracted. A total of n such blocks are to be subtracted. KDEC_START Initialise pointers to a and b. Move n into a register. Set the carry flag to that state which indicates no borrow. Provide a label for looping back to. Note that label numbers are automatically inserted by the mex utility. KDEC_END Set the variable "carry" to 1 if the carry flag is in that state which reflects an outstanding "borrow", otherwise 0 This algorithm also uses the KIDL and DEC macros. See above. NEW! - April 2002 Interleaved steps can be used in multiplication to allow for improved instruction scheduling. This could be a lot faster if the multiply unit takes more than 1 clock cycle. The idea is to expose more ILP (Instruction Level Parallelism) for a modern pipelined (and possibly super-scaler) load-store processor to chew on. In the calculation of the sum of partial products in a column, replace STEPM - Multiply STEPA - Add STEPM - Multiply STEPA - Add STEPM - Multiply STEPA - Add STEPM - Multiply STEPA - Add with the interleaved STEP1M - Multiply 1 STEP2M - Multiply 2 STEP1A - Add 1 STEP1M - Multiply 1 STEP2A - Add 2 STEP2M - Multiply 2 STEP1A - Add 1 STEP2A - Add 2 The same applies to DSTEP for squaring. In this way the multiply instruction gets more time to complete to its destination registers. If the MEX program sees that STEP1M macro is present as well as STEP, it will permit scheduling with the -s flag as above. Of course this requires that there to be enough registers - STEP1x and STEP2x should ideally use different registers. If this is going to help it is particularly important that the destination registers of the multiply steps STEP1M and STEP2M be distinct. Many processors use hardware dynamic scheduling (like the Pentium II). For such a processor scheduling the code like this will have little effect. The ARM assembler re-orders the code automatically for optimum scheduling, so again scheduling the code like this will have little impact. Some example *.mcs files in this format are included - see c.mcs and arm.mcs for example. This idea could be extended in a fairly obvious way, for example STEP1M - Multiply 1 STEP2M - Multiply 2 STEP3M - Multiply 3 STEP1A - Add 1 STEP1M - Multiply 1 STEP2A - Add 2 STEP2M - Multiply 2 STEP3A - Add 3 STEP3M - Multiply 3 STEP1A - Add 1 STEP1M - Multiply 1 STEP2A - Add 2 STEP3A - Add 3 STEP1A - Add 1 This is currently NOT supported, it would require a relatively simple modification to mex.c This might be justified in the case of a very deeply pipelined multiplier. The idea would be that Multiply 1, 2, and 3 might all be active simultaneously. New features 12/10/2007 NEW: PMUL, PMUL_START and PMUL_END macros to support fast reduction for pseudo mersenne prime moduli. PMULT When reducing modulo p=2^m-d, where m is a multiple of the word length and d is single precision, we have the useful identity 2^m=d mod p. The product of two field elements can be considered as 2^m.U+L, where U is the upper half of the product, and L is the lower half. Therefore the reduced value is 2^m.U+L mod p = dU+L. The dU component can in turn be considered as 2^m.x+Y, where x will be single precision. So the final result is d.x+Y+L. Finally a couple of subtractions of p might be required to get a result less than p. The function PMULT calculates d.x and Y PMUL_START Initialises pointers to the top half of the product a[], and arrays to hold d.x (b[]) and Y (c[]). Moves d to a register. PMUL Sets c[i]=a[i]*d and propagates carries. Sets b[i]=0 PMUL_END Sets b[0] (and b[1]) to d.x NEW: Macros to support Hybrid method - see http://eprint.iacr.org/2007/299 Basically a 2x2 or 4x4 block of partial products are calculated together which reduces memory accesses. Uses more registers.