In embedded applications on low-powered processors, performance is a big issue. Using either the KCM or Comba methods as described here can increase speeds 4-fold. To use the super-fast KCM (for 2048-bit RSA 1024-bit DH and DSS) and Comba (for 1024 bit RSA and GF(p) Elliptic curves) methods you will need to create the file mrkcm.c or the file mrcomba.c for inclusion in the MIRACL library. This is done by inserting 'macros' from a ?.mcs file into the template files mrkcm.tpl, or mrcomba.tpl. This is done automatically using the MEX utility. A c.mcs file is supplied, which contains C macros. Also c1.mcs which uses an alternate approach. See also cs.mcs (read the comments at the top). If a quad-length type is available (mr_qltype defined in mirdef.h), use c2.mcs However the best performance is usually achieved by using assembly language macros. This requires your compiler to support in-line assembly. For example the file ms86.mcs inserts Pentium assembly language macros for use with Microsoft or Borland compilers. The file gcc386.mcs does the same for the gcc compiler. If your PC supports SSE2 extensions, for example if it is a Pentium 4, then instead use either sse2.mcs or gccsse2.mcs (see sse2.txt). The file arm.mcs does the same for the popular 32-bit ARM processor. Other .mcs files for other processors/compilers may be available. See makemcs.txt for instructions for creating your own. New! The files c.mcs and arm.mcs now allow "interleaved" multiplication steps to facilitate improved processor scheduling - see makemcs.txt. The macro expansion is carried out automatically by the supplied program MEX.C. You must compile and run this program. If you use the config.c utility it will advise you on the parameters to use. Note that although config.c should be compiled and run on the target processor, mex.c can be compiled and run on any workstation. For example c:>mex 6 ms86 mrcomba creates a file mrcomba.c from mrcomba.tpl and ms86.mcs. The Comba method will then be optimised for a modulus of 6*32 = 192 bits on a Pentium computer. Typically this might be used for an implementation of elliptic curves over GF(p) for p a 192 bit prime. Note that the code generated in mrcomba.c or mrkcm.c may benefit to a small extent from some manual post-optimisation. Re-ordering instructions may help for certain processors. c:>mex 16 ms86 mrkcm creates a file mrkcm.c from mrkcm.tpl and ms86.mcs. The KCM method will then be optimised for moduli of sizes 512, 1024, 2048 bits etc. Typically this might be used for a fast implementation of RSA, DSS or Diffie-Hellman. For the Comba method only it is possible to implement special modular reduction methods for a modulus p of a particular form. Two types of special modulus are supported, Generalised Mersenne Primes, and Pseudo-Mersenne Primes. To make use of this feature MR_SPECIAL must be defined in mirdef.h. Generalised Mersenne Primes are also known as Solinas primes. These are of a form like for example 2#224-2#96+1. Note that the exponents are multiples of a 32-bit word length. Many of the NIST recommended primes are of this form. In the file mrcomba.tpl code can be found to implement fast reduction with respect to many different GM primes, and for many different word lengths. If the particular one you want is not there, it is not hard to implement it yourself by manually editing the file mrcomba.tpl Pseudo Mersenne Primes are also known as Crandall Primes. These are of a form like for example 2^160-57, where 160 is a multiple of the word length, and the constant 57 is small enough to fit into one computer word. Moduli of this form are automatically supported if you define MR_PSEUDO_MERSENNE in mirdef.h As always it is best to use config.c, which guides you through all of this. You will find it valuable to run through this whole process on a standard PC using perhaps the Microsoft C/C++ compiler, just to get familiar with the config.c and mex.c utilities. If you are embarking on an embedded project using a processor for which a .mcs file does not exist, you will have to write your own, or be content with the C macros. Note that this approach is likely to be optimal only on processors that support an unsigned multiply instruction. This is probably the case with the majority of embedded processors (e.g. ARM, 68000 variants etc). It is also important that the compiler support inline assembly, via something like asm(" "); or __asm { } constructs in C. However other approaches are possible, for example using C "intrinsics" works well for the itanium - see itanium.mcs To write your own .mcs file, use c.mcs or arm.mcs as models. For more background read the article ftp.computing.dcu.ie/pub/crypto/timings.doc The macro-expansion mechanism has been designed to make it as easy as possible for the developer to write optimal code for best performance. See makemcs.txt