1*This is the original README for the p256-m repository. Please note that as 2only a subset of p256-m's files are present in Mbed TLS, this README may refer 3to files that are not present/relevant here.* 4 5p256-m is a minimalistic implementation of ECDH and ECDSA on NIST P-256, 6especially suited to constrained 32-bit environments. It's written in standard 7C, with optional bits of assembly for Arm Cortex-M and Cortex-A CPUs. 8 9Its design is guided by the following goals in this order: 10 111. correctness & security; 122. low code size & RAM usage; 133. runtime performance. 14 15Most cryptographic implementations care more about speed than footprint, and 16some might even risk weakening security for more speed. p256-m was written 17because I wanted to see what happened when reversing the usual emphasis. 18 19The result is a full implementation of ECDH and ECDSA in **less than 3KiB of 20code**, using **less than 768 bytes of RAM**, with comparable performance 21to existing implementations (see below) - in less than 700 LOC. 22 23_Contents of this Readme:_ 24 25- [Correctness](#correctness) 26- [Security](#security) 27- [Code size](#code-size) 28- [RAM usage](#ram-usage) 29- [Runtime performance](#runtime-performance) 30- [Comparison with other implementations](#comparison-with-other-implementations) 31- [Design overview](#design-overview) 32- [Notes about other curves](#notes-about-other-curves) 33- [Notes about other platforms](#notes-about-other-platforms) 34 35## Correctness 36 37**API design:** 38 39- The API is minimal: only 4 public functions. 40- Each public function fully validates its inputs and returns specific errors. 41- The API uses arrays of octets for all input and output. 42 43**Testing:** 44 45- p256-m is validated against multiple test vectors from various RFCs and 46 NIST. 47- In addition, crafted inputs are used for negative testing and to reach 48 corner cases. 49- Two test suites are provided: one for closed-box testing (using only the 50 public API), one for open-box testing (for unit-testing internal functions, 51and reaching more error cases by exploiting knowledge of how the RNG is used). 52- The resulting branch coverage is maximal: closed-box testing reaches all 53 branches except four; three of them are reached by open-box testing using a 54rigged RNG; the last branch could only be reached by computing a discrete log 55on P-256... See `coverage.sh`. 56- Testing also uses dynamic analysis: valgrind, ASan, MemSan, UBSan. 57 58**Code quality:** 59 60- The code is standard C99; it builds without warnings with `clang 61 -Weverything` and `gcc -Wall -Wextra -pedantic`. 62- The code is small and well documented, including internal APIs: with the 63 header file, it's less than 700 lines of code, and more lines of comments 64than of code. 65- However it _has not been reviewed_ independently so far, as this is a 66 personal project. 67 68**Short Weierstrass pitfalls:** 69 70Its has been [pointed out](https://safecurves.cr.yp.to/) that the NIST curves, 71and indeed all Short Weierstrass curves, have a number of pitfalls including 72risk for the implementation to: 73 74- "produce incorrect results for some rare curve points" - this is avoided by 75 carefully checking the validity domain of formulas used throughout the code; 76- "leak secret data when the input isn't a curve point" - this is avoided by 77 validating that points lie on the curve every time a point is deserialized. 78 79## Security 80 81In addition to the above correctness claims, p256-m has the following 82properties: 83 84- it has no branch depending (even indirectly) on secret data; 85- it has no memory access depending (even indirectly) on secret data. 86 87These properties are checked using valgrind and MemSan with the ideas 88behind [ctgrind](https://github.com/agl/ctgrind), see `consttime.sh`. 89 90In addition to avoiding branches and memory accesses depending on secret data, 91p256-m also avoid instructions (or library functions) whose execution time 92depends on the value of operands on cores of interest. Namely, it never uses 93integer division, and for multiplication by default it only uses 16x16->32 bit 94unsigned multiplication. On cores which have a constant-time 32x32->64 bit 95unsigned multiplication instruction, the symbol `MUL64_IS_CONSTANT_TIME` can 96be defined by the user at compile-time to take advantage of it in order to 97improve performance and code size. (On Cortex-M and Cortex-A cores wtih GCC or 98Clang this is not necessary, since inline assembly is used instead.) 99 100As a result, p256-m should be secure against the following classes of attackers: 101 1021. attackers who can only manipulate the input and observe the output; 1032. attackers who can also measure the total computation time of the operation; 1043. attackers who can also observe and manipulate micro-architectural features 105 such as the cache or branch predictor with arbitrary precision. 106 107However, p256-m makes no attempt to protect against: 108 1094. passive physical attackers who can record traces of physical emissions 110 (power, EM, sound) of the CPU while it manipulates secrets; 1115. active physical attackers who can also inject faults in the computation. 112 113(Note: p256-m should actually be secure against SPA, by virtue of being fully 114constant-flow, but is not expected to resist any other physical attack.) 115 116**Warning:** p256-m requires an externally-provided RNG function. If that 117function is not cryptographically secure, then neither is p256-m's key 118generation or ECDSA signature generation. 119 120_Note:_ p256-m also follows best practices such as securely erasing secret 121data on the stack before returning. 122 123## Code size 124 125Compiled with 126[ARM-GCC 9](https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads), 127with `-mthumb -Os`, here are samples of code sizes reached on selected cores: 128 129- Cortex-M0: 2988 bytes 130- Cortex-M4: 2900 bytes 131- Cortex-A7: 2924 bytes 132 133Clang was also tried but tends to generate larger code (by about 10%). For 134details, see `sizes.sh`. 135 136**What's included:** 137 138- Full input validation and (de)serialisation of input/outputs to/from bytes. 139- Cleaning up secret values from the stack before returning from a function. 140- The code has no dependency on libc functions or the toolchain's runtime 141 library (such as helpers for long multiply); this can be checked for the 142Arm-GCC toolchain with the `deps.sh` script. 143 144**What's excluded:** 145 146- A secure RNG function needs to be provided externally, see 147 `p256_generate_random()` in `p256-m.h`. 148 149## RAM usage 150 151p256-m doesn't use any dynamic memory (on the heap), only the stack. Here's 152how much stack is used by each of its 4 public functions on selected cores: 153 154| Function | Cortex-M0 | Cortex-M4 | Cortex-A7 | 155| ------------------------- | --------: | --------: | --------: | 156| `p256_gen_keypair` | 608 | 564 | 564 | 157| `p256_ecdh_shared_secret` | 640 | 596 | 596 | 158| `p256_ecdsa_sign` | 664 | 604 | 604 | 159| `p256_ecdsa_verify` | 752 | 700 | 700 | 160 161For details, see `stack.sh`, `wcs.py` and `libc.msu` (the above figures assume 162that the externally-provided RNG function uses at most 384 bytes of stack). 163 164## Runtime performance 165 166Here are the timings of each public function in milliseconds measured on 167platforms based on a selection of cores: 168 169- Cortex-M0 at 48 MHz: STM32F091 board running Mbed OS 6 170- Cortex-M4 at 100 MHz: STM32F411 board running Mbed OS 6 171- Cortex-A7 at 900 MHz: Raspberry Pi 2B running Raspbian Buster 172 173| Function | Cortex-M0 | Cortex-M4 | Cortex-A7 | 174| ------------------------- | --------: | --------: | --------: | 175| `p256_gen_keypair` | 921 | 145 | 11 | 176| `p256_ecdh_shared_secret` | 922 | 144 | 11 | 177| `p256_ecdsa_sign` | 990 | 155 | 12 | 178| `p256_ecdsa_verify` | 1976 | 309 | 24 | 179| Sum of the above | 4809 | 753 | 59 | 180 181The sum of these operations corresponds to a TLS handshake using ECDHE-ECDSA 182with mutual authentication based on raw public keys or directly-trusted 183certificates (otherwise, add one 'verify' for each link in the peer's 184certificate chain). 185 186_Note_: the above figures where obtained by compiling with GCC, which is able 187to use inline assembly. Without that inline assembly (22 lines for Cortex-M0, 1881 line for Cortex-M4), the code would be roughly 2 times slower on those 189platforms. (The effect is much less important on the Cortex-A7 core.) 190 191For details, see `bench.sh`, `benchmark.c` and `on-target-benchmark/`. 192 193## Comparison with other implementations 194 195The most relevant/convenient implementation for comparisons is 196[TinyCrypt](https://github.com/intel/tinycrypt), as it's also a standalone 197implementation of ECDH and ECDSA on P-256 only, that also targets constrained 198devices. Other implementations tend to implement many curves and build on a 199shared bignum/MPI module (possibly also supporting RSA), which makes fair 200comparisons less convenient. 201 202The scripts used for TinyCrypt measurements are available in [this 203branch](https://github.com/mpg/tinycrypt/tree/measurements), based on version 2040.2.8. 205 206**Code size** 207 208| Core | p256-m | TinyCrypt | 209| --------- | -----: | --------: | 210| Cortex-M0 | 2988 | 6134 | 211| Cortex-M4 | 2900 | 5934 | 212| Cortex-A7 | 2924 | 5934 | 213 214**RAM usage** 215 216TinyCrypto also uses no heap, only the stack. Here's the RAM used by each 217operation on a Cortex-M0 core: 218 219| operation | p256-m | TinyCrypt | 220| ------------------ | -----: | --------: | 221| key generation | 608 | 824 | 222| ECDH shared secret | 640 | 728 | 223| ECDSA sign | 664 | 880 | 224| ECDSA verify | 752 | 824 | 225 226On a Cortex-M4 or Cortex-A7 core (identical numbers): 227 228| operation | p256-m | TinyCrypt | 229| ------------------ | -----: | --------: | 230| key generation | 564 | 796 | 231| ECDH shared secret | 596 | 700 | 232| ECDSA sign | 604 | 844 | 233| ECDSA verify | 700 | 808 | 234 235**Runtime performance** 236 237Here are the timings of each operation in milliseconds measured on 238platforms based on a selection of cores: 239 240_Cortex-M0_ at 48 MHz: STM32F091 board running Mbed OS 6 241 242| Operation | p256-m | TinyCrypt | 243| ------------------ | -----: | --------: | 244| Key generation | 921 | 979 | 245| ECDH shared secret | 922 | 975 | 246| ECDSA sign | 990 | 1009 | 247| ECDSA verify | 1976 | 1130 | 248| Sum of those 4 | 4809 | 4093 | 249 250_Cortex-M4_ at 100 MHz: STM32F411 board running Mbed OS 6 251 252| Operation | p256-m | TinyCrypt | 253| ------------------ | -----: | --------: | 254| Key generation | 145 | 178 | 255| ECDH shared secret | 144 | 177 | 256| ECDSA sign | 155 | 188 | 257| ECDSA verify | 309 | 210 | 258| Sum of those 4 | 753 | 753 | 259 260_Cortex-A7_ at 900 MHz: Raspberry Pi 2B running Raspbian Buster 261 262| Operation | p256-m | TinyCrypt | 263| ------------------ | -----: | --------: | 264| Key generation | 11 | 13 | 265| ECDH shared secret | 11 | 13 | 266| ECDSA sign | 12 | 14 | 267| ECDSA verify | 24 | 15 | 268| Sum of those 4 | 59 | 55 | 269 270_64-bit Intel_ (i7-6500U at 2.50GHz) laptop running Ubuntu 20.04 271 272Note: results in microseconds (previous benchmarks in milliseconds) 273 274| Operation | p256-m | TinyCrypt | 275| ------------------ | -----: | --------: | 276| Key generation | 1060 | 1627 | 277| ECDH shared secret | 1060 | 1611 | 278| ECDSA sign | 1136 | 1712 | 279| ECDSA verify | 2279 | 1888 | 280| Sum of those 4 | 5535 | 6838 | 281 282**Other differences** 283 284- While p256-m fully validates all inputs, Tinycrypt's ECDH shared secret 285 function doesn't include validation of the peer's public key, which should be 286done separately by the user for static ECDH (there are attacks [when users 287forget](https://link.springer.com/chapter/10.1007/978-3-319-24174-6_21)). 288- The two implementations have slightly different security characteristics: 289 p256-m is fully constant-time from the ground up so should be more robust 290than TinyCrypt against powerful local attackers (such as an untrusted OS 291attacking a secure enclave); on the other hand TinyCrypt includes coordinate 292randomisation which protects against some passive physical attacks (such as 293DPA, see Table 3, column C9 of [this 294paper](https://www.esat.kuleuven.be/cosic/publications/article-2293.pdf#page=12)), 295which p256-m completely ignores. 296- TinyCrypt's code looks like it could easily be expanded to support other 297 curves, while p256-m has much more hard-coded to minimize code size (see 298"Notes about other curves" below). 299- TinyCrypt uses a specialised routine for reduction modulo the curve prime, 300 exploiting its structure as a Solinas prime, which should be faster than the 301generic Montgomery reduction used by p256-m, but other factors appear to 302compensate for that. 303- TinyCrypt uses Co-Z Jacobian formulas for point operation, which should be 304 faster (though a bit larger) than the mixed affine-Jacobian formulas 305used by p256-m, but again other factors appear to compensate for that. 306- p256-m uses bits of inline assembly for 64-bit multiplication on the 307 platforms used for benchmarking, while TinyCrypt uses only C (and the 308compiler's runtime library). 309- TinyCrypt uses a specialised routine based on Shamir's trick for 310 ECDSA verification, which gives much better performance than the generic 311code that p256-m uses in order to minimize code size. 312 313## Design overview 314 315The implementation is contained in a single file to keep most functions static 316and allow for more optimisations. It is organized in multiple layers: 317 318- Fixed-width multi-precision arithmetic 319- Fixed-width modular arithmetic 320- Operations on curve points 321- Operations with scalars 322- The public API 323 324**Multi-precision arithmetic.** 325 326Large integers are represented as arrays of `uint32_t` limbs. When carries may 327occur, casts to `uint64_t` are used to nudge the compiler towards using the 328CPU's carry flag. When overflow may occur, functions return a carry flag. 329 330This layer contains optional assembly for Cortex-M and Cortex-A cores, for the 331internal `u32_muladd64()` function, as well as two pure C versions of this 332function, depending on whether `MUL64_IS_CONSTANT_TIME`. 333 334This layer's API consists of: 335 336- addition, subtraction; 337- multiply-and-add, shift by one limb (for Montgomery multiplication); 338- conditional assignment, assignment of a small value; 339- comparison of two values for equality, comparison to 0 for equality; 340- (de)serialization as big-endian arrays of bytes. 341 342**Modular arithmetic.** 343 344All modular operations are done in the Montgomery domain, that is x is 345represented by `x * 2^256 mod m`; integers need to be converted to that domain 346before computations, and back from it afterwards. Montgomery constants 347associated to the curve's p and n are pre-computed and stored in static 348structures. 349 350Modular inversion is computed using Fermat's little theorem to get 351constant-time behaviour with respect to the value being inverted. 352 353This layer's API consists of: 354 355- the curve's constants p and n (and associated Montgomery constants); 356- modular addition, subtraction, multiplication, and inversion; 357- assignment of a small value; 358- conversion to/from Montgomery domain; 359- (de)serialization to/from bytes with integrated range checking and 360 Montgomery domain conversion. 361 362**Operations on curve points.** 363 364Curve points are represented using either affine or Jacobian coordinates; 365affine coordinates are extended to represent 0 as (0,0). Individual 366coordinates are always in the Montgomery domain. 367 368Not all formulas associated with affine or Jacobian coordinates are complete; 369great care is taken to document and satisfy each function's pre-conditions. 370 371This layer's API consists of: 372 373- curve constants: b from the equation, the base point's coordinates; 374- point validity check (on the curve and not 0); 375- Jacobian to affine coordinate conversion; 376- point doubling in Jacobian coordinates (complete formulas); 377- point addition in mixed affine-Jacobian coordinates (P not in {0, Q, -Q}); 378- point addition-or-doubling in affine coordinates (leaky version, only used 379 for ECDSA verify where all data is public); 380- (de)serialization to/from bytes with integrated validity checking 381 382**Scalar operations.** 383 384The crucial function here is scalar multiplication. It uses a signed binary 385ladder, which is a variant of the good old double-and-add algorithm where an 386addition/subtraction is performed at each step. Again, care is taken to make 387sure the pre-conditions for the addition formulas are always satisfied. The 388signed binary ladder only works if the scalar is odd; this is ensured by 389negating both the scalar (mod n) and the input point if necessary. 390 391This layer's API consists of: 392 393- scalar multiplication 394- de-serialization from bytes with integrated range checking 395- generation of a scalar and its associated public key 396 397**Public API.** 398 399This layer builds on the others, but unlike them, all inputs and outputs are 400byte arrays. Key generation and ECDH shared secret computation are thin 401wrappers around internal functions, just taking care of format conversions and 402errors. The ECDSA functions have more non-trivial logic. 403 404This layer's API consists of: 405 406- key-pair generation 407- ECDH shared secret computation 408- ECDSA signature creation 409- ECDSA signature verification 410 411**Testing.** 412 413A self-contained, straightforward, pure-Python implementation was first 414produced as a warm-up and to help check intermediate values. Test vectors from 415various sources are embedded and used to validate the implementation. 416 417This implementation, `p256.py`, is used by a second Python script, 418`gen-test-data.py`, to generate additional data for both positive and negative 419testing, available from a C header file, that is then used by the closed-box 420and open-box test programs. 421 422p256-m can be compiled with extra instrumentation to mark secret data and 423allow either valgrind or MemSan to check that no branch or memory access 424depends on it (even indirectly). Macros are defined for this purpose near the 425top of the file. 426 427**Tested platforms.** 428 429There are 4 versions of the internal function `u32_muladd64`: two assembly 430versions, for Cortex-M/A cores with or without the DSP extension, and two 431pure-C versions, depending on whether `MUL64_IS_CONSTANT_TIME`. 432 433Tests are run on the following platforms: 434 435- `make` on x64 tests the pure-C version without `MUL64_IS_CONSTANT_TIME` 436 (with Clang). 437- `./consttime.sh` on x64 tests both pure-C versions (with Clang). 438- `make` on Arm v7-A (Raspberry Pi 2) tests the Arm-DSP assembly version (with 439 Clang). 440- `on-target-*box` on boards based on Cortex-M0 and M4 cores test both 441 assembly versions (with GCC). 442 443In addition: 444 445- `sizes.sh` builds the code for three Arm cores with GCC and Clang. 446- `deps.sh` checks for external dependencies with GCC. 447 448## Notes about other curves 449 450It should be clear that minimal code size can only be reached by specializing 451the implementation to the curve at hand. Here's a list of things in the 452implementation that are specific to the NIST P-256 curve, and how the 453implementation could be changed to expand to other curves, layer by layer (see 454"Design Overview" above). 455 456**Fixed-width multi-precision arithmetic:** 457 458- The number of limbs is hard-coded to 8. For other 256-bit curves, nothing to 459 change. For a curve of another size, hard-code to another value. For multiple 460curves of various sizes, add a parameter to each function specifying the 461number of limbs; when declaring arrays, always use the maximum number of 462limbs. 463 464**Fixed-width modular arithmetic:** 465 466- The values of the curve's constant p and n, and their associated Montgomery 467 constants, are hard-coded. For another curve, just hard-code the new constants. 468For multiple other curves, define all the constants, and from this layer's API 469only keep the functions that already accept a `mod` parameter (that is, remove 470convenience functions `m256_xxx_p()`). 471- The number of limbs is again hard-coded to 8. See above, but it order to 472 support multiple sizes there is no need to add a new parameter to functions 473in this layer: the existing `mod` parameter can include the number of limbs as 474well. 475 476**Operations on curve points:** 477 478- The values of the curve's constants b (constant term from the equation) and 479 gx, gy (coordinates of the base point) are hard-coded. For another curve, 480 hard-code the other values. For multiple curves, define each curve's value and 481add a "curve id" parameter to all functions in this layer. 482- The value of the curve's constant a is implicitly hard-coded to `-3` by using 483 a standard optimisation to save one multiplication in the first step of 484`point_double()`. For curves that don't have a == -3, replace that with the 485normal computation. 486- The fact that b != 0 in the curve equation is used indirectly, to ensure 487 that (0, 0) is not a point on the curve and re-use that value to represent 488the point 0. As far as I know, all Short Weierstrass curves standardized so 489far have b != 0. 490- The shape of the curve is assumed to be Short Weierstrass. For other curve 491 shapes (Montgomery, (twisted) Edwards), this layer would probably look very 492different (both implementation and API). 493 494**Scalar operations:** 495 496- If multiple curves are to be supported, all function in this layer need to 497 gain a new "curve id" parameter. 498- This layer assumes that the bit size of the curve's order n is the same as 499 that of the modulus p. This is true of most curves standardized so far, the 500only exception being secp224k1. If that curve were to be supported, the 501representation of `n` and scalars would need adapting to allow for an extra 502limb. 503- The bit size of the curve's order is hard-coded in `scalar_mult()`. For 504 multiple curves, this should be deduced from the "curve id" parameter. 505- The `scalar_mult()` function exploits the fact that the second least 506 significant bit of the curve's order n is set in order to avoid a special 507case. For curve orders that don't meet this criterion, we can just handle that 508special case (multiplication by +-2) separately (always compute that and 509conditionally assign it to the result). 510- The shape of the curve is again assumed to be Short Weierstrass. For other curve 511 shapes (Montgomery, (twisted) Edwards), this layer would probably have a 512very different implementation. 513 514**Public API:** 515 516- For multiple curves, all functions in this layer would need to gain a "curve 517 id" parameter and handle variable-sized input/output. 518- The shape of the curve is again assumed to be Short Weierstrass. For other curve 519 shapes (Montgomery, (twisted) Edwards), the ECDH API would probably look 520quite similar (with differences in the size of public keys), but the ECDSA API 521wouldn't apply and an EdDSA API would look pretty different. 522 523## Notes about other platforms 524 525While p256-m is standard C99, it is written with constrained 32-bit platforms 526in mind and makes a few assumptions about the platform: 527 528- The types `uint8_t`, `uint16_t`, `uint32_t` and `uint64_t` exist. 529- 32-bit unsigned addition and subtraction with carry are constant time. 530- 16x16->32-bit unsigned multiplication is available and constant time. 531 532Also, on platforms on which 64-bit addition and subtraction with carry, or 533even 64x64->128-bit multiplication, are available, p256-m makes no use of 534them, though they could significantly improve performance. 535 536This could be improved by replacing uses of arrays of `uint32_t` with a 537defined type throughout the internal APIs, and then on 64-bit platforms define 538that type to be an array of `uint64_t` instead, and making the obvious 539adaptations in the multi-precision arithmetic layer. 540 541Finally, the optional assembly code (which boosts performance by a factor 2 on 542tested Cortex-M CPUs, while slightly reducing code size and stack usage) is 543currently only available with compilers that support GCC's extended asm 544syntax (which includes GCC and Clang). 545