-rw-r--r-- 16517 librandombytes-20230905/doc/security.md raw
The standard security goal for a random-number generator (RNG) is that
no feasible computation can distinguish the RNG output from true
randomness, i.e., from a uniform random string of the same length. This
allows applications to treat the RNG output as true randomness.
Beyond merely asking for indistinguishability, some applications ask for
"forward secrecy". This means that the RNG output is indistinguishable
from true randomness for an attacker who sees the subsequent state of
the entire device, including the internal state of the RNG.
A typical explanation of the importance of forward secrecy is as
follows: "A rogue country's spy agency has intercepted ciphertexts that
a whistleblower has sent to a journalist. Agents then grab the
whistleblower's computer while the computer is unlocked, take it away
for analysis, and see that the whistleblower deleted the messages after
sending them. Can the agency use the computer's current RNG state to
reconstruct old keys and decrypt the ciphertexts?"
Sometimes there are also requests for "backward secrecy". A strict
interpretation of backward secrecy says that an attacker who sees the
state of the entire device cannot distinguish the _next_ RNG output from
uniform. A weaker variant is for secrecy to _eventually_ be restored
after a compromise. Either way, backward secrecy is advertised as
providing "self-healing", "post-compromise security", etc. Beware that
this assumes a questionable concept of compromise as a one-time event
rather than a naturally persistent state; meanwhile the complications of
trying to achieve "post-compromise security" have a long history of
distracting attention from the more fundamental task of preventing
compromises in the first place.
Whether or not backward secrecy is desired, there are many ways for RNGs
to fail to provide forward secrecy, or to fail to even reach the
standard security goal. For example:
* After source code for the RC4 stream cipher was posted anonymously in
1994, RC4 was rapidly shown to flunk the standard definition of cipher
security. Before and after this, RC4 was pushed by NSA (which had set
up [special procedures](https://cr.yp.to/export/dtn/V3N4_10_92.pdf)
to allow RC4 export licenses) and by the RSA company. RC4 remained in
wide use for two decades for TLS encryption and for RNGs.
* Around 2005, NSA paid the RSA company
[$10 million](https://www.reuters.com/article/us-usa-security-rsa-idUSBRE9BJ1C220131220)
to use Dual EC as the default RNG in RSA's BSAFE library. Dual EC
was an RNG designed to be
[predictable by NSA](https://projectbullrun.org/dual-ec/index.html).
* Newer RNGs have often turned out to be feasible to distinguish from
uniform. Common contributing factors include inadequate attention to
security (which is hard to measure) and much more attention to
performance (which is much easier to measure). For example, a
[2020 attack](https://tosc.iacr.org/index.php/ToSC/article/view/8700)
uses just 2^57^ CPU cycles to recover the secret "seed" and secret
"increment" of the "PCG64" RNG.
* Many RNG constructions advertised as "provably secure" constructions
based on well-studied cryptographic primitives turn out to provide
much lower security levels than desired. For example, AES-256-CTR, the
usual "provably secure" way to use AES-256 as a stream cipher and RNG,
provably flunks a simple collision test after 2^68^ bytes of output.
This is not a sharp cutoff: one needs a much smaller limit on the
output length to avoid having this attack succeed with probability
measurably above 0.
* Even after the removal of Dual EC, NIST's "SP 800-90" RNG standard
remains unnecessarily complicated and unnecessarily difficult to
analyze. A partial analysis in 2018 showed that SP 800-90A
[failed to provide](https://eprint.iacr.org/2018/349)
some of its claimed forward-secrecy properties. Also, because of
performance problems in how NIST handles short outputs, applications
are encouraged to add their own larger output buffers, typically
compromising forward secrecy.
Furthermore, RNG security can be compromised by implementation failures.
For example:
* The AES design encourages implementors to use "T-table"
implementations on many platforms, exposing secret keys to
[cache-timing attacks](https://cr.yp.to/papers.html#cachetiming).
* If a virtual machine generates RNG output (perhaps used for a nonce
sent to an attacker) and is then restored from a snapshot, many types
of RNGs will generate the same RNG output again (perhaps then used for
a secret key). This recycling compromises the security of various
applications that would otherwise be safe.
* Similarly, many types of RNGs will produce identical results in parent
and child processes after `fork()`, again compromising security of
various applications that would otherwise be safe.
* In 2006, the Debian OS distribution incorrectly removed essential
lines of code from the OpenSSL RNG, producing easily breakable keys in
various cryptographic applications. Other Debian-based distributions,
such as Ubuntu, copied the change.
The Debian error was publicly discovered and corrected in 2008. A
[review](https://fahrplan.events.ccc.de/congress/2008/Fahrplan/events/2995.en.html)
showed that the error was triggered by warnings from a code-analysis
tool that did not understand what was going on in the RNG code.
* In 2012, two
[independent](https://eprint.iacr.org/2012/064)
[papers](https://factorable.net/)
showed that a noticeable percentage of RSA public keys available on
the Internet were factorable because two keys had factors in common.
The second paper observed vulnerable keys from Linux, FreeBSD, and
Windows, and gave a convincing explanation of how randomness would
repeat on embedded Linux systems:
* The Linux kernel accumulated entropy very slowly on small devices.
* Keys were generated by the device on first boot, starting very
early in the boot process.
* `/dev/urandom`, the source of randomness, returned data without
waiting for 256 bits of entropy to be collected.
Another available mechanism, `/dev/random`, waited for entropy but
was justifably avoided by many applications. This mechanism had the
serious misfeature of also waiting for further entropy for every
subsequent call; slow entropy collection thus turned `/dev/random`
into a [problematic](https://github.com/openssl/openssl/issues/9078)
bottleneck.
* In 2018, a Linux kernel bug was
[discovered](https://lkml.org/lkml/2018/4/12/711)
that would result in `/dev/random` not waiting long enough for
entropy to be collected.
RNG failures are often used as motivation for the development of further
RNGs, but if this process is not carefully managed then it increases the
load on reviewers and encourages further security problems.
librandombytes does not provide a new RNG; it is a wrapper around
existing RNGs. It does not wrap every available RNG; it limits the
number of options to simplify review. It takes the maximally centralized
option, the OS kernel's RNG, by default; it provides one backup option,
the OpenSSL RNG, just in case this is critical for system performance.
Usage of the OS kernel's RNG has an imperfect track record, as
illustrated by the papers from 2012 cited above. This is concerning.
However, there are reasons to believe that risks have been reduced:
* There is increasing adoption of OpenBSD's approach of provisioning
all devices with initial randomness at installation time, for
example by having a computer generate initial randomness for a
device that it is flashing or for a new virtual-machine image.
* There is increasing usage of mechanisms to quickly collect entropy,
even on small devices. This serves as a backup if initial randomness
fails, and can rescue forward secrecy when flash does not adequately
erase randomness that was stored for the next boot. Also, people
trying to achieve backward secrecy require these mechanisms.
* There is increasing usage of mechanisms for virtual machines to
collect entropy from the host machine, such as `virtio-rng`.
* It is increasingly common for CPUs to include RNG hardware, although
this hardware is inherently difficult to audit and
[easy to sabotage](https://www.iacr.org/archive/ches2013/80860203/80860203.pdf).
A [BIOS bug](https://lore.kernel.org/all/776cb5c2d33e7fd0d2893904724c0e52b394f24a.1565817448.git.thomas.lendacky@amd.com/T/#u)
removed all entropy from the RNG on AMD CPUs after suspend/resume;
the bug wasn't addressed for five years.
* There is increasing availability of `getrandom` and/or `getentropy`.
These functions wait for initial entropy to be collected (without the
`/dev/random` misfeature of also waiting for further entropy for every
subsequent call). This avoids security problems in the disaster
scenario of initial randomness failing and entropy collection being
slow. Some history:
* 2014: `getentropy` was introduced in OpenBSD 5.6, and `getrandom`
was introduced (as a variant of `getentropy`) in Linux kernel 3.17.
* 2015: `getentropy` and `getrandom` were added to
[Solaris 11.3](https://blogs.oracle.com/solaris/post/solaris-new-system-calls-getentropy2-and-getrandom2).
* 2017: `getentropy` and `getrandom` were added to
[glibc 2.25](https://sourceware.org/legacy-ml/libc-alpha/2017-02/msg00079.html),
with `getentropy` provided as a wrapper around `getrandom` on
Linux systems.
* 2018: `getentropy` and `getrandom` were added to
[FreeBSD 12.0](https://www.freebsd.org/releases/12.0R/relnotes/).
FreeBSD already changed `/dev/urandom`
[in 2000](https://papers.freebsd.org/2017/vbsdcon/gurney-A_Deep_Dive_into_FreeBSDs_Kernel_RNG.files/gurney-A_Deep_Dive_into_FreeBSDs_Kernel_RNG.pdf)
to wait for initial entropy to be collected, already bringing the
same benefit without requiring applications to adopt a new
interface. For OpenBSD, randomness was always available from
installation time, so `/dev/urandom` never had a reason to wait.
(But `getrandom` and `getentropy` have separate advantages of not
relying on opening a file; this simplifies `chroot` setups and
avoids various failure cases.)
* There is increasing convergence upon the ChaCha20 stream cipher as
the foundation of random-number generation.
ChaCha20 is a 2008 variant of the 2005 Salsa20 stream cipher, and is
one of just two ciphers in TLS 1.3. ChaCha20 is stronger against
known attacks than the other TLS 1.3 cipher, AES-256-CTR; recall
that AES-256-CTR flunks a feasible collision test. ChaCha20 also
does not have AES's problem of encouraging variable-time software.
Regarding possible attack improvements, the challenge of breaking
scaled-down versions of Salsa20 and ChaCha20 has attracted 20 attacks
from [43 cryptanalysts](https://cr.yp.to/snuffle.html#security).
With the published techniques, ChaCha6 is feasible to break by a
large-scale attack, and ChaCha7 is subject to an attack faster than
brute force. (For comparison, even if the aforementioned collision
failure is ignored, 6 out of the 14 rounds of AES-256 are breakable
by a feasible attack, and 8 out of the 14 rounds of AES-256 are
subject to an attack faster than brute force.) The long-term
security of ChaCha8 is unclear, but the recommended ChaCha20 has
very low risk.
* There is increasing convergence upon a simple, efficient, easily
analyzed mechanism to provide forward secrecy given a strong stream
cipher: [fast-key-erasure RNGs](https://blog.cr.yp.to/20170723-random.html).
Examples of these RNGs include
* 2013: Nick Mathewson's [libottery](https://github.com/nmathewson/libottery),
which says it is "based on a construction described in XXX, as
summarized by DJB";
* 2013: OpenBSD's
[`arc4random` update](https://github.com/openbsd/src/commit/90c1fad70a3483c2c72c3c90acf438a5f235c776),
with credit to libottery, replacing previous use of RC4;
* 2018: FreeBSD's
[`arc4random` update](https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=182610),
with credit to libottery and OpenBSD; and
* 2022: Linux's
[RNG updates](https://www.zx2c4.com/projects/linux-rng-5.17-5.18/inside-linux-kernel-rng-presentation-sept-13-2022.pdf).
A 2005 paper by
[Barak and Halevi](https://eprint.iacr.org/2005/029)
analyzes the following special case of fast-key-erasure RNGs: an
m-bit key is used to generate 2m bits of stream-cipher output, split
into m bits of RNG output and m bits to overwrite the key. A general
fast-key-erasure RNG obtains the same security properties at much
higher speed by generating more output from each key: e.g., 768
bytes of output from a 32-byte key.
`librandombytes-kernel` uses `getrandom` if it finds it, otherwise
`getentropy` if it finds it, otherwise `/dev/urandom`. In the case of
`/dev/urandom`, `librandombytes-kernel` waits at program startup for
`/dev/random` to become readable; however, it skips this if the file
`/var/run/urandom-ready` exists (or if `/dev/random` does not exist).
The idea is that system administrators of Linux systems too old to have
`getrandom` can run
dd count=1 bs=64 if=/dev/random of=/dev/urandom status=none \
&& findmnt -t tmpfs -T /var/run >/dev/null \
&& touch /var/run/urandom-ready &
from boot scripts so that `librandombytes-kernel` doesn't wait after
initial per-boot entropy collection. Note that systems where `/var/run`
is a persistent directory (rather than `tmpfs`), cleared by boot
scripts, should not create `/var/run/urandom-ready`, since
`librandombytes-kernel` might be running earlier in the boot process.
There are various other ways that one can imagine `/dev/urandom` being
read too early on old Linux systems (and not so old, as in the 2018
Linux bug mentioned above); `librandombytes-kernel` tries to reduce
risks but does not make guarantees. Provisioning systems with initial
randomness is recommended in all cases. There are also many other
security reasons to recommend retiring old Linux systems if they cannot
be upgraded.
`librandombytes-openssl`, which simply calls OpenSSL's `RAND_bytes`,
raises more security concerns:
* OpenSSL's `RAND_bytes` maintains an RNG state in user space,
evidently because of a belief that this produces an important
speedup. With its current defaults, OpenSSL can take an hour before
it reseeds the user RNG state from the OS. This means that whatever
entropy-injection mechanisms are in the OS RNG, for example to
protect virtual machines, can leave OpenSSL unprotected for an hour.
* OpenSSL's default RNG,
starting in [2017](https://www.openssl.org/blog/blog/2017/08/12/random/),
is a NIST RNG, specifically CTR-DRBG built on top of AES. A
[2019 attack](https://eprint.iacr.org/2019/996)
showed that OpenSSL in FIPS mode was using T-table implementations
of AES and was leaking RNG secrets to a timing attack. Even with a
constant-time AES implementation, the security level of this RNG is
not clear from the existing literature.
* When OpenSSL first seeds its RNG from `/dev/urandom` on
pre-`getrandom` Linux systems, it waits (starting with OpenSSL
1.1.1d in 2019) for `/dev/random` to become readable, and then
creates shared memory segment 114. It skips the `/dev/random` wait
if shared memory segment 114 exists already. This does not have the
same security properties as checking for existence of a file that
can be created only by root: if _any_ account on the same system is
running code from an attacker then the attacker can create shared
memory segment 114, disabling the `/dev/random` test for all OpenSSL
instances on the same system.
These OpenSSL issues are not obviously showstoppers. For systems where
using the OS RNG is a performance problem, OpenSSL's RNG seems to be a
reasonable backup plan. Of course, it would be possible to patch OpenSSL
to use a fast-key-erasure RNG on top of ChaCha20, to use a safer
`/dev/random` test (which could also be handled as a wrapper around
OpenSSL), and to reseed more frequently.