Hash Functions: Cryptographic Fingerprints

tbowan
(en français)
November 28th 2019

Spoiler: Checking the integrity of data is a complex task. If, in addition, you are in the presence of intelligent attackers, it becomes downright cryptographic. Continuation of the previous article, this time we will see the criteria of fitness-for-use for these functions.

Having easily covered hash functions through checksums, we can now move on to the serious and, frankly, cryptographic stuff.

Previously, we wanted to catch accidental errors; electromagnetic interference, input error and other cosmic rays. This time, we will go further to detect intentional errors; attempted content modification by pests. So the question is:

How to guarantee that a piece of data is consistent with the original?

The difference is significant because this time, we are no longer playing simply against the second law of thermodynamics but against hackers, real ones, who use their intelligence to ruin the lives of cryptographers and love stories.

Jasper Nance @flickr, CC-By-NC-SA — Jasper Nance @flickr, `CC-By-NC-SA`

So, before your astonished eyes, let’s lift the veil on these functions and demystify those cryptographic hashes…

Constraints

After raising the stakes, we will therefore have to raise the constraints so that candidate functions can claim the title of cryptographic hash functions. It is therefore no longer a question of maximizing the number of errors discovered but of prohibiting them. Formally, this results in the following three constraints on the computation of the fingerprint: irreversible, tamper-proof and resistant to collisions.

Irreversible

Since hash functions are used almost everywhere, they are expected to be relatively easy to compute. Once you have data, computing its checksum or its cryptographic fingerprint should not be more complex than reading it.

What we don’t want, on the other hand, is to be able to build data easily when we have our fingerprint.

Cryptographers, who are mathematicians, therefore picky about vocabulary and notations, speak here of pre-image resistance. For what ?

Because we are talking about functions, $f(x)\rightarrow y$ , where $y$ is the image of $x$ by the function $f$ . Conversely, $x$ is called the pre-image of $y$ .

This constraint takes on its full meaning to store or transmit fingerprints when you want to keep the corresponding data secret. Here are some examples:

The passwords of your users (cf. R22 of the ANSSI guide for securing websites).
Bank card numbers (cf. condition 2.3 of the PA-DSS standard).

In these two cases, we want to reduce the consequences of a theft of the database by storing the fingerprints. When a visitor logs in, you hash their password and check that the result matches the stored one. Ditto for bank card numbers when a customer wants to check if his card has been used (you do the search on the fingerprint).

To be complete with the storage of passwords, you must also use a salt (cf. R23 of the ANSSI guide) and slightly slower functions to avoid certain forms of particular attacks.

Tamper-proof

Previously, we assumed that the attacker only has the fingerprint and had no information on the initial data he was trying to recover.

This time, we refuse that he can modify a data without being detected; the fingerprint must therefore change.

Cryptographers speak here of second pre-image resistance. Knowing a datum $x_1$ , and its image $f(x_1)$ , we can find a second datum $x_2$ such that $f(x_1)=f(x_2)$ . Know what ? $x_2$ is called a second pre-image.

The use case is then different, the hash function is then used to guarantee integrity, the data can be public. On the other hand, for it to work, the fingerprint must be transmitted in a secure way… The most usual case being the fingerprint of the ISO images of software installation dvds.

But if the attacker intercepts the data AND its fingerprint, he can modify both.

It is for this reason that the hash function is often used in conjunction with other cryptographic authentication algorithms to guarantee the identity of the person who computed the fingerprint (this is called message authenticity). These methods are most commonly found:

In encrypted network connections (i.e. by SSL/TLS and therefore in HTTPS),
In the digital certificates,
In email signature (i.e. via GnuPG).

Collision Resistance

The preceding constraints, while sufficient most of the time, are usually complemented by the following third constraint:

It is impossible to find two different data with the same fingerprint.

Impossible… To avoid any misunderstanding of the meaning of this term, it should be understood here as “too long”. Because we can always test combinations randomly until we find one, but if the time to get there is longer than the lifespan of the message (or of the Sun, or even of the Universe), we consider it okay.

This property is mainly used to theoretically prove the security of certain protocols. In practice, it is mainly found in proofs of work consisting of finding data with a similar fingerprint but also in the security of storage systems using the fingerprint as an address (we will come back to this).

Algorithms

Unlike checksums which are relatively easy to understand in detail, cryptographic hash functions are downright less so. Even if these functions are short and contain only basic operations, they chain them together and couple them to constants that look like they randomly come from a supercharged first grade class just before the distribution of Santa’s presents…

So rather than go through the guts of each of these features (giving you some headache), I’ll take you through the three most common ones: MD5, SHA-1, and SHA-2.

MD5

This is the latest representative of the Message Digest proposed by Ronald Rivest (co-creator of RSA) in 1991. It allows you to calculate a 128-bit hash of any given and was standardized in 1992 in RFC 1321.

It is no longer safe. And for a long time… The first signs of weaknesses were published in 1993 (two initializations giving the same results) then 1996 (creating collisions on the internal function but not globally ).

Things then accelerated since a distributed attack was carried out in 2004 (one hour for a collision), the following year in a few minutes on a notebook, as well as the theft of cryptographic certificates.

Using this function to detect malicious modifications therefore clearly no longer works.

Technically, it can still be used as a checksum (and detect accidental errors), but since safer alternatives exist, I prefer to skip MD5 altogether.

SHA1

Guessing that things were going wrong for MD5, the NSA had taken the lead in publishing a new hash function soberly named Secure Hash Algorithm in the standard FIPS 180-1 (edited by NIST) in 1995 and producing 160-bit fingerprints.

1995, last french nuclear tests, Thomas Conté @ flickr

It is no longer safe either. Even if we have long believed the opposite. The first signs appeared in 2004 with the effective computation of collisions on SHA-0 (the predecessor of SHA-1) and theoretical attacks were proposed the following year.

To avoid being overtaken, the NIST launched a competition in 2012 to find new hash functions (Keccak will be elected and renamed SHA- 3). In 2013, Microsoft no longer considered SHA-1 unsafe and the ANSSI issued a similar opinion in 2014.

In 2017, a proof of concept was conducted by Google and the CWI institute in Amsterdam who were able to create two different PDF files but with the same footprint (PDF). The attack was enhanced in April 2019 (PDF) and early January 2020 (PDF).

As before, it is therefore no longer wise to use SHA-1 for integrity checks in an IT security context.

We still use it when it’s not an issue. For example, git and mercurial, two version control systems, use it to identify internal data. In this context, SHA-1 is not used to verify integrity but to identify a document (or an object in general). An attacker can of course create two files, the second spoofing the first, but it is consensually admitted that it is easier to insert malicious data directly without bothering to generate collisions.

SHA2

As it is always good to be ahead, the NSA had, as early as 2002, standardized two new hash functions in its next version of the standard FIPS 180-2: SHA-256 and SHA-512 which, as their names suggest, generate 256-bit and 512-bit hashes.

2002, passage à l’Euro, martaposemuckel @pixabay

It is considered safe. Until proven otherwise. We know, since 2003, that attacks on SHA-0 and SHA-1 have no impact. Since no attack has yet been published, we consider that everything is fine.

It is therefore the function used most of the time when it comes to verifying the integrity of data (in cryptographic certificates and a whole bunch of encrypted flows such as TLS, SAML, JWT, …).

Agencies’ point of view

If you have to choose a hash function, or check that the current one in your environment is good or not, you can follow ANSSI’s recommendations.

For use before 2020, the minimum size of the fingerprints generated by a hash function is 200 bits.

For use beyond 2020, the minimum hash size generated by a hash function is 256 bits.

The best known attack to find collisions should require on the order of $2^{h/2}$ hash computations, where $h$ denotes the size in bits of the hashes.

Section 2.3 of the RGS

And if you don’t want to dig into the details of the functions and the academic literature to check the non-existence of attacks in less than $2^{h/2}$ computations steps, you can turn to the more interventionist NIST in the matter.

SHA-1 can no longer be used to generate fingerprints. It can, if necessary, be used for integrity checks but for backward compatibility reasons.

SHA2 is the new standard, SHA-3 is an alternative.

NIST Report

And now

Hash functions are a bit like checksums, but irreversible, tamper-proof and collision-resistant.

MD5 has been obsolete for so long that half of the world’s population has since been born!. So no, we don’t use MD5 anymore. For its part SHA-1 is also obsolete but we still use it when security do not depend on it.

For everything else, use SHA-256 and SHA512.