-rw-r--r-- 16529 librandombytes-20230126/doc/security.md raw
The standard security goal for a random-number generator (RNG) is that no feasible computation can distinguish the RNG output from true randomness, i.e., from a uniform random string of the same length. This allows applications to treat the RNG output as true randomness. Beyond merely asking for indistinguishability, some applications ask for "forward secrecy". This means that the RNG output is indistinguishable from true randomness for an attacker who sees the subsequent state of the entire device, including the internal state of the RNG. A typical explanation of the importance of forward secrecy is as follows: "A rogue country's spy agency has intercepted ciphertexts that a whistleblower has sent to a journalist. Agents then grab the whistleblower's computer while the computer is unlocked, take it away for analysis, and see that the whistleblower deleted the messages after sending them. Can the agency use the computer's current RNG state to reconstruct old keys and decrypt the ciphertexts?" Sometimes there are also requests for "backward secrecy". A strict interpretation of backward secrecy says that an attacker who sees the state of the entire device cannot distinguish the _next_ RNG output from uniform. A weaker variant is for secrecy to _eventually_ be restored after a compromise. Either way, backward secrecy is advertised as providing "self-healing", "post-compromise security", etc. Beware that this assumes a questionable concept of compromise as a one-time event rather than a naturally persistent state; meanwhile the complications of trying to achieve "post-compromise security" have a long history of distracting attention from the more fundamental task of preventing compromises in the first place. Whether or not backward secrecy is desired, there are many ways for RNGs to fail to provide forward secrecy, or to fail to even reach the standard security goal. For example: * After source code for the RC4 stream cipher was posted anonymously in 1994, RC4 was rapidly shown to flunk the standard definition of cipher security. Before and after this, RC4 was pushed by NSA (which had set up [special procedures](https://cr.yp.to/export/dtn/V3N4_10_92.pdf) to allow RC4 export licenses) and by the RSA company. RC4 remained in wide use for two decades for TLS encryption and for RNGs. * Around 2005, NSA paid the RSA company [$10 million](https://www.reuters.com/article/us-usa-security-rsa-idUSBRE9BJ1C220131220) to use Dual EC as the default RNG in RSA's BSAFE library. Dual EC was an RNG designed to be [predictable by NSA](https://projectbullrun.org/dual-ec/index.html). * Newer RNGs have often turned out to be feasible to distinguish from uniform. Common contributing factors include inadequate attention to security (which is hard to measure) and much more attention to performance (which is much easier to measure). For example, a [2020 attack](https://tosc.iacr.org/index.php/ToSC/article/view/8700) uses just 2^57^ CPU cycles to recover the secret "seed" and secret "increment" of the "PCG64" RNG. * Many RNG constructions advertised as "provably secure" constructions based on well-studied cryptographic primitives turn out to provide much lower security levels than desired. For example, AES-256-CTR, the usual "provably secure" way to use AES-256 as a stream cipher and RNG, provably flunks a simple collision test after 2^68^ bytes of output. This is not a sharp cutoff: one needs a much smaller limit on the output length to avoid having this attack succeed with probability measurably above 0. * Even after the removal of Dual EC, NIST's "SP 800-90" RNG standard remains unnecessarily complicated and unnecessarily difficult to analyze. A partial analysis in 2018 showed that SP 800-90A [failed to provide](https://eprint.iacr.org/2018/349) some of its claimed forward-secrecy properties. Also, because of performance problems in how NIST handles short outputs, applications are encouraged to add their own larger output buffers, typically compromising forward secrecy. Furthermore, RNG security can be compromised by implementation failures. For example: * The AES design encourages implementors to use "T-table" implementations on many platforms, exposing secret keys to [cache-timing attacks](https://cr.yp.to/papers.html#cachetiming). * If a virtual machine generates RNG output (perhaps used for a nonce sent to an attacker) and is then restored from a snapshot, many types of RNGs will generate the same RNG output again (perhaps then used for a secret key). This recycling compromises the security of various applications that would otherwise be safe. * Similarly, many types of RNGs will produce identical results in parent and child processes after `fork()`, again compromising security of various applications that would otherwise be safe. * In 2006, the Debian OS distribution incorrectly removed essential lines of code from the OpenSSL RNG, producing easily breakable keys in various cryptographic applications. Other Debian-based distributions, such as Ubuntu, copied the change. The Debian error was publicly discovered and corrected in 2008. A [review](https://fahrplan.events.ccc.de/congress/2008/Fahrplan/events/2995.en.html) showed that the error was triggered by warnings from a code-analysis tool that did not understand what was going on in the RNG code. * In 2012, two [independent](https://eprint.iacr.org/2012/064) [papers](https://factorable.net/) showed that a noticeable percentage of RSA public keys available on the Internet were factorable because two keys had factors in common. The second paper observed vulnerable keys from Linux, FreeBSD, and Windows, and gave a convincing explanation of how randomness would repeat on embedded Linux systems: * The Linux kernel accumulated entropy very slowly on small devices. * Keys were generated by the device on first boot, starting very early in the boot process. * `/dev/urandom`, the source of randomness, returned data without waiting for 256 bits of entropy to be collected. Another available mechanism, `/dev/random`, waited for entropy but was justifably avoided by many applications. This mechanism had the serious misfeature of also waiting for further entropy for every subsequent call; slow entropy collection thus turned `/dev/random` into a [problematic](https://github.com/openssl/openssl/issues/9078) bottleneck. * In 2018, a Linux kernel bug was [discovered](https://lkml.org/lkml/2018/4/12/711) that would result in `/dev/random` not waiting long enough for entropy to be collected. RNG failures are often used as motivation for the development of further RNGs, but if this process is not carefully managed then it increases the load on reviewers and encourages further security problems. librandombytes does not provide a new RNG; it is a wrapper around existing RNGs. It does not wrap every available RNG; it limits the number of options to simplify review. It takes the maximally centralized option, the OS kernel's RNG, by default; it provides one backup option, the OpenSSL RNG, just in case this is critical for system performance. Usage of the OS kernel's RNG has an imperfect track record, as illustrated by the papers from 2012 cited above. This is concerning. However, there are reasons to believe that risks have been reduced: * There is increasing adoption of OpenBSD's approach of provisioning all devices with initial randomness at installation time, for example by having a computer generate initial randomness for a device that it is flashing or for a new virtual-machine image. * There is increasing usage of mechanisms to quickly collect entropy, even on small devices. This serves as a backup if initial randomness fails, and can rescue forward secrecy when flash does not adequately erase randomness that was stored for the next boot. Also, people trying to achieve backward secrecy require these mechanisms. * There is increasing usage of mechanisms for virtual machines to collect entropy from the host machine, such as `virtio-rng`. * It is increasingly common for CPUs to include RNG hardware, although this hardware is inherently difficult to audit and [easy to sabotage](https://www.iacr.org/archive/ches2013/80860203/80860203.pdf). A [BIOS bug](https://lore.kernel.org/all/776cb5c2d33e7fd0d2893904724c0e52b394f24a.1565817448.git.thomas.lendacky@amd.com/T/#u) removed all entropy from the RNG on AMD CPUs after suspend/resume; the bug wasn't addressed for five years. * There is increasing availability of `getrandom` and/or `getentropy`. These functions wait for initial entropy to be collected (without the `/dev/random` misfeature of also waiting for further entropy for every subsequent call). This avoids security problems in the disaster scenario of initial randomness failing and entropy collection being slow. Some history: * 2014: `getentropy` was introduced in OpenBSD 5.6, and `getrandom` was introduced (as a variant of `getentropy`) in Linux kernel 3.17. * 2015: `getentropy` and `getrandom` were added to [Solaris 11.3](https://blogs.oracle.com/solaris/post/solaris-new-system-calls-getentropy2-and-getrandom2). * 2017: `getentropy` and `getrandom` were added to [glibc 2.25](https://sourceware.org/legacy-ml/libc-alpha/2017-02/msg00079.html), with `getentropy` provided as a wrapper around `getrandom` on Linux systems. * 2018: `getentropy` and `getrandom` were added to [FreeBSD 12.0](https://www.freebsd.org/releases/12.0R/relnotes/). FreeBSD already changed `/dev/urandom` [in 2000](https://papers.freebsd.org/2017/vbsdcon/gurney-A_Deep_Dive_into_FreeBSDs_Kernel_RNG.files/gurney-A_Deep_Dive_into_FreeBSDs_Kernel_RNG.pdf) to wait for initial entropy to be collected, already bringing the same benefit without requiring applications to adopt a new interface. For OpenBSD, randomness was always available from installation time, so `/dev/urandom` never had a reason to wait. (But `getrandom` and `getentropy` have separate advantages of not relying on opening a file; this simplifies `chroot` setups and avoids various failure cases.) * There is increasing convergence upon the ChaCha20 stream cipher as the foundation of random-number generation. ChaCha20 is a 2008 variant of the 2005 Salsa20 stream cipher, and is one of just two ciphers in TLS 1.3. ChaCha20 is stronger against known attacks than the other TLS 1.3 cipher, AES-256-CTR; recall that AES-256-CTR flunks a feasible collision test. ChaCha20 also does not have AES's problem of encouraging variable-time software. Regarding possible attack improvements, the challenge of breaking scaled-down versions of Salsa20 and ChaCha20 has attracted 20 attacks from [43 cryptanalysts](https://cr.yp.to/snuffle.html#security). With the published techniques, ChaCha6 is feasible to break by a large-scale attack, and ChaCha7 is subject to an attack faster than brute force. (For comparison, even if the aforementioned collision failure is ignored, 6 out of the 14 rounds of AES-256 are breakable by a feasible attack, and 8 out of the 14 rounds of AES-256 are subject to an attack faster than brute force.) The long-term security of ChaCha8 is unclear, but the recommended ChaCha20 has very low risk. * There is increasing convergence upon a simple, efficient, easily analyzed mechanism to provide forward secrecy given a strong stream cipher: [fast-key-erasure RNGs](https://blog.cr.yp.to/20170723-random.html). Examples of these RNGs include * 2013: Nick Mathewson's [libottery](https://github.com/nmathewson/libottery), which says it is "based on a construction described in XXX, as summarized by DJB"; * 2013: OpenBSD's [`arc4random` update](https://github.com/openbsd/src/commit/90c1fad70a3483c2c72c3c90acf438a5f235c776), with credit to libottery, replacing previous use of RC4; * 2018: FreeBSD's [`arc4random` update](https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=182610), with credit to libottery and OpenBSD; and * 2022: Linux's [RNG updates](https://www.zx2c4.com/projects/linux-rng-5.17-5.18/inside-linux-kernel-rng-presentation-sept-13-2022.pdf). A 2005 paper by [Barak and Halevi](https://eprint.iacr.org/2005/029) analyzes the following special case of fast-key-erasure RNGs: an m-bit key is used to generate 2m bits of stream-cipher output, split into m bits of RNG output and m bits to overwrite the key. A general fast-key-erasure RNG obtains the same security properties at much higher speed by generating more output from each key: e.g., 768 bytes of output from a 32-byte key. `librandombytes-kernel` uses `getrandom` if it finds it, otherwise `getentropy` if it finds it, otherwise `/dev/urandom`. In the case of `/dev/urandom`, `librandombytes-kernel` waits at program startup for `/dev/random` to become readable; however, it skips this if the file `/var/run/urandom-ready` exists (or if `/dev/random` does not exist). The idea is that system administrators of Linux systems too old to have `getrandom` can run dd count=1 bs=64 if=/dev/random of=/dev/urandom status=none \ && findmnt -t tmpfs -T /var/run >/dev/null \ && touch /var/run/urandom-ready & from boot scripts so that `librandombytes-kernel` doesn't wait after initial per-boot entropy collection. Note that systems where `/var/run` is a persistent directory (rather than `tmpfs`), cleared by boot scripts, should not create `/var/run/urandom-ready`, since `librandombytes-kernel` might be running earlier in the boot process. There are various other ways that one can imagine `/dev/urandom` being read too early on old Linux systems (and not so old, as in the 2018 Linux bug mentioned above); `librandombytes-kernel` tries to reduce risks but does not make guarantees. Provisioning systems with initial randomness is recommended in all cases. There are also many other security reasons to recommend retiring old Linux systems if they cannot be upgraded. `librandombytes-openssl`, which simply calls OpenSSL's `RAND_bytes`, raises more security concerns: * OpenSSL's `RAND_bytes` maintains an RNG state in user space, evidently because of a belief that this produces an important speedup. With its current defaults, OpenSSL can take an hour before it reseeds the user RNG state from the OS. This means that whatever entropy-injection mechanisms are in the OS RNG, for example to protect virtual machines, can leave OpenSSL unprotected for an hour. * OpenSSL's default RNG, starting in [2017](https://www.openssl.org/blog/blog/2017/08/12/random/), is a NIST RNG, specifically CTR-DRBG built on top of AES. A [2019 attack](https://eprint.iacr.org/2019/996) showed that OpenSSL in FIPS mode was using T-table implementations of AES and was leaking RNG secrets to a timing attack. Even with a constant-time AES implementation, the security level of this RNG is not clear from the existing literature. * When OpenSSL first seeds its RNG from `/dev/urandom` on pre-`getrandom` Linux systems, it waits (starting with OpenSSL 1.1.1d in 2019) for `/dev/random` to become readable, and then creates shared memory segment 114. It skips the `/dev/random` wait if shared memory segment 114 exists already. This does not have the same security properties as checking for existence of a file that can be created only by root: if _any_ account on the same system is running code from an attacker then the attacker can create shared memory segment 114, disabling the `/dev/random` test for all OpenSSL instances on the same system. These OpenSSL issues are not obviously showstoppers. For systems where using the OS RNG is a performance problem, OpenSSL's RNG seems to be a reasonable backup plan. Of course, it would be possible to patch OpenSSL to use a fast-key-erasure RNG on top of ChaCha20, to use a safer `/dev/random` test (which could also be handled as a wrapper around OpenSSL), and to reseed more frequently.