# Hash Functions: Cryptographic Fingerprints

**Spoiler:** *Checking the integrity of data is a
complex task. If, in addition, you are in the presence of
intelligent attackers, it becomes downright cryptographic.
Continuation of the previous article, this time we will see the criteria
of fitness-for-use for these functions.*

Cette page est également disponible en français.

Having *easily* covered hash functions through checksums, we can now
move on to the *serious* and, frankly, *cryptographic*
stuff.

Previously, we wanted to catch *accidental errors*;
electromagnetic interference, input error and other cosmic rays. This
time, we will go further to detect *intentional errors*;
attempted content modification by pests. So the question is:

How to

guaranteethat a piece of data isconsistentwith the original?

The difference is significant because this time, we are no longer
playing *simply* against the
second law of thermodynamics but against *hackers*, real
ones, who use their intelligence to ruin the lives of cryptographers and
love stories.

So, before your astonished eyes, let’s lift the veil on these
*functions* and demystify those *cryptographic
hashes*…

# Constraints

After raising the stakes, we will therefore have to raise the
constraints so that candidate functions can claim the title of
*cryptographic hash functions*. It is therefore no longer a
question of *maximizing* the number of errors discovered but of
prohibiting them. Formally, this results in the following three
constraints on the computation of the fingerprint: irreversible,
tamper-proof and resistant to collisions.

## Irreversible

Since hash functions are used almost everywhere, they are expected to
be relatively easy to compute. Once you have data, computing its
*checksum* or its *cryptographic fingerprint* should not
be more complex than
reading it.

**What we don’t want**, on the other hand, is to be able
to build data easily when we have our fingerprint.

Cryptographers, who are mathematicians, therefore picky about vocabulary and notations, speak here of

pre-image resistance. For what ?Because we are talking about functions, $f(x)\rightarrow y$, where $y$ is the image of $x$ by the function $f$. Conversely, $x$ is called the pre-image of $y$.

This constraint takes on its full meaning to store or transmit fingerprints when you want to keep the corresponding data secret. Here are some examples:

- The passwords of your users (
*cf.*R22 of the ANSSI guide for securing websites). - Bank card numbers (
*cf.*condition 2.3 of the PA-DSS standard).

In these two cases, we want to reduce the consequences of a theft of the database by storing the fingerprints. When a visitor logs in, you hash their password and check that the result matches the stored one. Ditto for bank card numbers when a customer wants to check if his card has been used (you do the search on the fingerprint).

To be complete with the storage of passwords, you must also use

a salt(cf.R23 of the ANSSI guide) and slightly slower functions to avoid certain forms of particular attacks.

## Tamper-proof

Previously, we assumed that the attacker only has the fingerprint and had no information on the initial data he was trying to recover.

**This time, we refuse** that he can modify a data
without being detected; the fingerprint must therefore change.

Cryptographers speak here of

second pre-image resistance. Knowing a datum $x_1$, and its image $f(x_1)$, we can find a second datum $x_2$ such that $f(x_1)=f(x_2)$. Know what ? $x_2$ is called a second pre-image.

The use case is then different, the hash function is then used to
*guarantee integrity*, the data can be public. On the other hand,
for it to work, the fingerprint must be transmitted in a secure way… The
most usual case being the fingerprint of the `ISO`

images of
software installation dvds.

But if the attacker intercepts the data AND its fingerprint, he can modify both.

It is for this reason that the hash function is often used in conjunction with other cryptographic authentication algorithms to guarantee the identity of the person who computed the fingerprint (this is called message authenticity). These methods are most commonly found:

- In encrypted network connections (
*i.e.*by SSL/TLS and therefore in HTTPS), - In the digital certificates,
- In email signature (
*i.e.*via GnuPG).

## Collision Resistance

The preceding constraints, while sufficient *most of the
time*, are usually complemented by the following third
constraint:

**It is impossible** to find two different data with the
same fingerprint.

Impossible… To avoid any misunderstanding of the meaning of this term, it should be understood here as “too long”. Because we can always test combinations randomly until we find one, but if the time to get there is longer than the lifespan of the message (or of the Sun, or even of the Universe), we consider it okay.

This property is mainly used to theoretically prove the security of
certain protocols. In practice, it is mainly found in *proofs of
work* consisting of finding data with a similar fingerprint but also
in the security of storage systems using the fingerprint as an address
(we will come back to this).

# Algorithms

Unlike checksums which are *relatively easy to understand* in
detail, cryptographic hash functions are downright less so. Even if
these functions are short and contain only basic operations, they chain
them together and couple them to constants that look like they randomly
come from a supercharged first grade class just before the distribution
of Santa’s presents…

So rather than go through the guts of each of these features (giving you some headache), I’ll take you through the three most common ones: MD5, SHA-1, and SHA-2.

## MD5

This is the latest representative of the * Message
Digest* proposed by Ronald Rivest (co-creator of
RSA) in 1991. It allows you to calculate a 128-bit hash of any given and
was standardized in 1992 in RFC 1321.

**It is no longer safe.** And for a long time… The first
signs of weaknesses were published in 1993 (two initializations giving
the same results) then 1996 (creating collisions on the internal
function but not globally ).

Things then accelerated since a distributed attack was carried out in 2004 (one hour for a collision), the following year in a few minutes on a notebook, as well as the theft of cryptographic certificates.

Using this function to detect malicious modifications therefore clearly no longer works.

Technically, it can still be used as a *checksum* (and detect
accidental errors), but since safer alternatives exist, I prefer to skip
MD5 altogether.

## SHA1

Guessing that things were going wrong for MD5, the NSA had taken the
lead in publishing a new hash function soberly named
* Secure Hash
Algorithm* in the standard FIPS
180-1 (edited by NIST) in 1995 and producing 160-bit
fingerprints.

**It is no longer safe either.** Even if we have long
believed the opposite. The first signs appeared in 2004 with the
effective computation of collisions on SHA-0 (the predecessor of SHA-1)
and theoretical attacks were proposed the following year.

To avoid being overtaken, the NIST launched a competition in 2012 to find new hash functions (Keccak will be elected and renamed SHA- 3). In 2013, Microsoft no longer considered SHA-1 unsafe and the ANSSI issued a similar opinion in 2014.

In 2017, a proof of concept was conducted by Google and the CWI institute in Amsterdam who were able to create two different PDF files but with the same footprint (PDF). The attack was enhanced in April 2019 (PDF) and early January 2020 (PDF).

As before, it is therefore no longer wise to use SHA-1 for integrity checks in an IT security context.

We still use it when it’s not an issue. For example, git and
mercurial, two version control systems, use it to identify internal
data. In this context, SHA-1 is not used to *verify integrity*
but to identify a document (or an object in general). An attacker can of
course create two files, the second spoofing the first, but it is consensually
admitted that it is easier to insert malicious data directly without
bothering to generate collisions.

## SHA2

As it is always good to be ahead, the NSA had, as early as 2002, standardized two new hash functions in its next version of the standard FIPS 180-2: SHA-256 and SHA-512 which, as their names suggest, generate 256-bit and 512-bit hashes.

**It is considered safe.** Until proven otherwise. We
know, since 2003, that attacks on SHA-0 and SHA-1 have no impact. Since
no attack has yet been published, we consider that everything is
fine.

It is therefore the function used most of the time when it comes to verifying the integrity of data (in cryptographic certificates and a whole bunch of encrypted flows such as TLS, SAML, JWT, …).

## Agencies’ point of view

If you have to choose a hash function, or check that the current one in your environment is good or not, you can follow ANSSI’s recommendations.

- For use before 2020, the minimum size of the fingerprints generated by a hash function is 200 bits.
- For use beyond 2020, the minimum hash size generated by a hash function is 256 bits.
- The best known attack to find collisions should require on the order of $2^{h/2}$ hash computations, where $h$ denotes the size in bits of the hashes.

And if you don’t want to dig into the details of the functions and the academic literature to check the non-existence of attacks in less than $2^{h/2}$ computations steps, you can turn to the more interventionist NIST in the matter.

SHA-1can no longer be used to generate fingerprints. It can, if necessary, be used for integrity checks but for backward compatibility reasons.

SHA2is the new standard,SHA-3is an alternative.

# And now

Hash functions are a bit like checksums, but irreversible, tamper-proof and collision-resistant.

MD5 has been obsolete for so long that half of the world’s population has since been born!. So no, we don’t use MD5 anymore. For its part SHA-1 is also obsolete but we still use it when security do not depend on it.

For everything else, use SHA-256 and SHA512.