# Hash Functions: Checksums

**Spoiler:** *Used to check data integrity, hash
functions are everywhere. Today, we present to you the easiest of them,
the checksums. The principle is simple, since any data is a
sequence of 0 and 1, it is also a number that can be projected into a
smaller set. Ideally, any change in the data will send it elsewhere and
we will see the difference.*

In a previous article on public key
infrastructures, I briefly mentionned *hash functions*, but
did not dwell on them. I have since realized that they are rather
unknown. And it’s a shame because even if we don’t realize it, we use it
all the time.

Before discussing their cryptographic version which was mentionned
and which we will discuss in the next
article, I would like to take a little time to introduce you to
their ancestors: the *checksums*. Much simpler, they already have
the fundamental characteristics of hash functions.

It all starts with a simple and fairly common IT (security) need:

How to check that data has been transmitted without being altered?

Whether we are transmitting a packet over a network, storing important files, or copying a seal. How de we make sure that there was no problem and that the data received is indeed consistent with the original?

# Global parity bits

Let’s start with the most rudimentary method: count the number of
bits that are worth
$1$,
if this number is even, we add a
$0$,
if it is odd, we add a
$1$.
Mathematically, it is to make an *exclusive or* of all the bits
of the message. It is written *XOR* in the text (for
*eXclusive OR*) ,
$\oplus$
in math and `^`

when programming.

Data | $\oplus$ |
---|---|

00 | 0 |

01 | 1 |

10 | 1 |

11 | 0 |

011010011100110101 | 0 |

In terms of projection, it reduces everything to only two possible
values: 0 or 1. A bit radical for a *hash* but if a single bit
changes in the data, the parity will change too and we can detect this
error 🎉.

On the other hand, and this is where it is somewhat limited, if two bits change, it will no longer be detected.

# Parity bits on words

To go further, we can cut the message into parts of the equal length
(called *words*) and add their parity bit to each one. We can
then detect several errors (as long as they appear in different
words).

**In Real Live, in ASCII.** In the past, each character
was encoded with 7 bits. We could then code 128, which was considered
*largely sufficient* at that time. The 8^{th} bit,
because data was already handled in 8-bit words, could be used for
parity and thus detect transmission errors.

Letter | 7bits ASCII | Parity | 8bits ASCII |
---|---|---|---|

B | `1000010` |
2 | `01000010` |

o | `1101111` |
6 | `01101111` |

n | `1101110` |
5 | `11101110` |

j | `1101010` |
4 | `01101010` |

o | `1101111` |
6 | `01101111` |

u | `1110101` |
5 | `11110101` |

r | `1110010` |
4 | `01110010` |

The message `Bonjour`

, although only using 49 bits for its
*encoding*, will actually use 56 for its storage and
transmission. You can think of those extra 7 bits as a form of hashing,
but since it’s spread throughout the message, it’s not practical to
extract and use *globally*.

The schism of 1981 was triggered when heretics decided to use this bit to produce blasphemous 256-character

extended ASCII. Each new sect then producing its share of new characters, not always compatible with each other. IBM with EBCDIC, Europeans with ISO-8859-1, some Westerners with windows-1252, …After years of encoding disputes, and the use of conversion tools (

i.e.`iconv`

), 1991 marked the end of the war with an ecumenical proposal for the reunification of peoples:unicode, which would put decades to prevail.Since then, things have settled down and we can finally exchange emoticons without being afraid that they are badly decoded. For their interpretation, on the other hand, we have not decided.

# Parity Words

In some systems, rather than computing the parity for each word
individually, it is calculated *globally*, performing a bitwise
xor between all the words. The result is called *parity
word*.

With an ASCII encoding, one could compute an xor of all the letters
and thus produce a 7-bit parity *word* as well. You can think of
it as doing an *xor* on each column. This is then a new ASCII
character that can be added to the message.

Letter | 7bits ASCII |
---|---|

B | `1000010` |

o | `1101111` |

n | `1101110` |

j | `1101010` |

o | `1101111` |

u | `1110101` |

r | `1110010` |

! |
`1000001` |

That way, `Bonjour`

, using 7 characters, can
*usefully* be supplemented with an 8^{th} control
character: `!`

. This character could be considered a hash of
the original message. Well, in real life, we don’t use this technique on
characters but on much longer words.

**In Real Life, in RAID.** This method is used when
connecting hard disks in parallel to distribute the data. In *RAID
3* (but also 5, 6, …), one of the disks contains the bitwise
*xor* of the other disks. This *control disk* has the
advantage of making it possible to completely rebuild a failed disk.

It works because the *xor* with the bits form a commutative group a
bit special where any data is in fact its own inverse. In our case,
let’s say you have your
$n$
disks (noted
$a_i$)
and your control disk (noted
$c$)
which is the *xor* of the others, formally it looks like this:
$\begin{split}
a_1 \oplus a_2 \oplus ... \oplus a_n & = c \\
\end{split}$ Since each data is its own inverse, we
can also consider that the xor with the
$n+1$
disks gives zero:
$\begin{split}
(a_1 \oplus a_2 \oplus ... \oplus a_n) \oplus c & = 0 \\
a_1 \oplus a_2 \oplus ... \oplus a_n \oplus c & = 0 \\
\end{split}$ As the operation is associative and
commutative, we can, in fact, compute the *xor* in any order and
we can therefore number the disks as we wish. If a disk fails, any disk
but I’ll name it
$a_1$
for simplicity, you can re-order the list and then reverse the previous
equations:
$\begin{split}
a_2 \oplus ... \oplus a_n \oplus c \oplus a_1 & = 0 \\
a_2 \oplus ... \oplus a_n \oplus c & = a_1 \\
\end{split}$ Et voilà 😃, you can rebuild your disk.
In fact, once the *xor* is done, mathematically, the discs become
equivalent, each controlling all the others.

# Sums

To continue, rather than a *xor* of the words of the message,
we can also use a sum of the words of the message. We then approach this
notion of *checksum*.

Since we work with *words of fixed size*, the result of the
sum can be *longer* than the words (because of the carries). It
is then necessary to choose how to manage these *overflows*:
throwing them away silently is often the simplest, but as we will see,
some people prefer to keep them and do something with them.

**In Real Life, IPv4.** When your computers communicate
in the *old-fashioned* network, they forge *IP packets*
containing, in their *header*, the IP address of your station,
that of the destination and a whole lots of other cool data to do lots
of fun things.

Among other headers, the *checksum*. To compute it, we cut the
header into 16-bit words (i.e. two bytes) which are then added. Here,
what overflows is considered as an additional word that is added later
(and so on as long as there is a hold, but it ends up quickly). This is
called a “1’s complement addition”. The result is then reversed (the
$1$
become
$0$
and vice versa). This is called a *complement to 1*, hence some
confusion sometimes. The result is finally inserted in the right place
of the message.

The interest of reversing this *checksum* is that the routers
which will receive the message can simply compute a checksum of the
complete header and check that the result is indeed zero.
Mathematically, the principle is similar to the *xor* with
previous hard drives.

**For example:** Here is a network capture during a ping
request (technically, an *ICMP Echo Request*) from my post
(*192.168.1.149*) to *google.fr*
(*216.58.213.131*), the part concerning the IP protocol is
selected and contains, among other headers, the checksum selected in
blue (here, `0x6917`

).

If we want to make sure that the header is correct, we must therefore add all the 16-bit words (i.e. 4 hexadecimal characters) of the IPv4 header, without taking into account the other headers (here Ethernet and ICMP) nor the data (content of the ICMP request). All this, taking into account these complementary stories…

The following table shows you this calculation:

Header (value) | Hexa |
---|---|

Version (4), Header size (5), type (0) | 4500 |

total size (60) | 003C |

ID | 61AE |

Flags and offset (0) | 0000 |

TTL (128) and protocol (1 for ICMP) | 8001 |

Checksum | 6917 |

Source (first half 192.168) | C0A8 |

Source (second half 1.149) | 0195 |

Destination (first half 216.58) | D83A |

Destination (second half 213.131) | D583 |

Total (with carry) |
3FFFC |

Carry only | 3 |

1’s complement
addition |
FFFF |

Complement to 1 |
0000 |

And as things are well done, this *complement to 1 of the total
complemented to 1* is indeed zero and the packet is valid. No
surprise since this request had received a response (or rather an
*ICMP echo reply*).

Note that ICMP also inserts a *checksum* computed in exactly
the same way on its own headers.

In IPv6things have changed and there is no morechecksum.The engineers have indeed noted that integrity checks were already made by the lower layers, called

linklayers like Ethernet or WiFi, but also by the upper layers, calledtransportlayers like TCP and UDP, which both have achecksum(computed exactly as for IP and ICMP) but whose scope also overflows on part of the IP headers.To be clean, we should have fixed TCP and UDP to avoid doing a checksum on the IP headers (since they are two different layers, they should not depend on each other) but as it is very expensive to change a protocol and as a solution had to be found to the shortage of IPv4 addresses (we are in the 90s), we preferred to change only the IP layer by producing

IPv6.

# Decimal divisions

The problem with the previous methods is that if some words are moved to another order, they are not detected. This is the principle of commutative operations, it helps us a lot for hard disks, but it backfires in other cases.

Historically, these new methods were developed for digital data that was often manipulated by hand. Where the previous ones want to detect an electronic error (a change of bit), here, we try to detect human errors, with therefore inversions, wrong fingers, forgetting of digits…

To detect these errors of *displacements*, the idea is then to
replace the notion of *sum* by that of *division*, keeping
the *remainder* to be exact (mathematically, we also speak of
*modulo*). Because it is the most *sensitive to initial
conditions*: a small change in the data will normally produce a
different modulo.

The trick is therefore to find the divisors that maximize the detected errors; change of individual digits anywhere (and sometimes groups of digits) and inversions between two digits (side by side or at a certain distance).

**In Real Life, the national identification number**.
This number, which makes it possible to identify a person and attach
rights and duties to him, is often constructed from civil status.

**In France,**we use one number for gender (1 for girls, 2 for boys), four for date of birth (2 for year, 2 for month), 5 for place of birth (2 for the department, three for the city) and finally 3 for the serial number.**In Belgium,**6 digits are used for the date of birth (2 for the year, 2 for the month, 2 for the days) then three for the serial number (with an odd counter for boys and another even for the girls).

For these two countries, then follows a *validation key*
($c$)
on two digits. This key is the *97’s complement* of the remainder
of the division of the number
($N$)
previously formed by 97:
$\begin{split}
C & = 97-(N \mod 97)
\end{split}$

To be completely exact, in Belgium, if you were born after the year 2000, we will add a 2 at the beginning of the number. Just the time for the computation since this 2 does not officially appear in your number.

**For example.** If we take the example social security
number from the
Wikipedia page, *Nathalie Durand*, 157^{th} baby born
in May 1969 in a non-existent town in Maine et Loire.

Its registration number is therefore *2 69 05 49 588 157*. If
we compute the integer division by 97, we get the following result:

$\begin{split} 2,690,549,588,157 & = 27,737,624,620 \times 97 + 17 \end{split}$

That is a remainder of 17. The key being in fact the complement of 97, we obtain a key of 80 as follows:

$\begin{split} 97-17&=80 \end{split}$

There is also a modulo 97 in the

ISO 7064standard, used among other things for international bank account numbers (IBAN). After some manipulation (including inserting a null key) we then compute thecomplement to 98of the remainder of the division by 97:$\begin{split} C&=98-(N\mod 97)\\ \end{split}$

The advantage of this method is, as for IPv4, that the verification is done by computing a key on the given data (already containing a valid key, therefore). The advantage of using

98is that a calculated key is worth at least 1, different from the empty key inserted during the computation.

# Binary divisions

With binary data, we could keep the same principle but their length
and especially the management of carries become difficult to handle. To
*simplify* all that, we will then replace the sum by a
*xor* which has the good taste not to have any carries. To be
programmed or wired, it is much more practical.

Mathematically, we change the set for the calculations: instead of the

usualring $(Z/nZ, +, \times)$, we will use $(Z/nZ, \oplus, \times)$. The xor replaces the sum.

On the other hand, even if the principle of the computation remains the same, since one of the two basic operations is changed, the results are different. See for yourself in the following table where the first column is in decimal (with asterisk as operator $ *$ ) and the following columns are in binary but one with the sum, the other with the xor.

$(Z/nZ, +, \times)$ | $(Z/nZ, \oplus, \times)$ | |
---|---|---|

$3 \times 3$ | $11 \times 11$ | $11 \times 11$ |

$3 \times (1 * 2)$ | $11 \times (1 + 10)$ | $11 \times (1 \oplus 10)$ |

$(3 \times 1) * (3 \times 2)$ | $(1 \times 11) + (10 \times 11)$ | $(1 \times 11) \oplus (10 \times 11)$ |

$3 * 6$ | $11 + 110$ | $11 \oplus 110$ |

$9$ or $5$ | $1001$ | $101$ |

Likewise, division keeps the same meaning; compute the inverse of the multiplication. But its computation (and therefore its result) changes. You can do the tests by writing your divisions as in primary school.

And as for the *validation keys* of identification numbers,
the whole art of the computer scientist is to choose a divisor that will
maximize the errors detected. Among other things, a divisor using
$n$
bits will always have a remainder that fits on
$n-1$
bits. If you want a 32-bit remainder, then you need a 33-bit number.

**In Real Life, CRCs.** These *cyclic redundancy
codes* were invented in the 60s and follow this principle of
*dividing a piece of data*, keeping the rest and checking that it
is the same as the one transmitted.

When we talk about CRC, we associate binary data with polynomials. Each bit of the message being the coefficient of the corresponding polynomial term. Here are some examples :

Data | Polynomial |
---|---|

$1$ | $1$ |

$10$ | $x$ |

$101$ | $x^2 + 1$ |

$10110110$ | $x^7 + x^5 + x^4 + x^2 + x$ |

Rather than using the ring $(Z/nZ, \oplus, \times)$, we then use that of polynomials with Boolean coefficients $(\mathbb{B}[x], \oplus, \times)$. Computations and data are then expressed in terms of polynomials. The chosen divisor can be thought of as a generator of all valid messages, and that sort of thing.

But as it works the same way, we can stay on numbers without introducing $x$.

As there is a plethora of possible divisors (many of which are
actually used), we took the convention of naming each case by the term
*CRC*, followed by the size of the remainder (*e.g.* 32
when it is 32 bits) followed by the *name* of the divisor when it
is not the *default* one.

By convention, as all divisors start with a

`1`

and it appears one bit more than the size of the rest, we do not write it and we do not mention it. In fact, we can even do the computation without this 1…

For example, **CRC-3-GSM** is used in mobile networks.
It uses the divisor *3* (in base 10, in binary, it gives
`11`

, and in real life, we therefore divide by
`1011`

).

Another example, much more *known*, is the
**CRC-32**, which uses, by default, the divisor
`0x04C11DB7`

. It is used, among other things, in Ethernet
frames (your RJ45 cables), Sata, *gzip* archives and *PNG*
images.

We also used this CRC-32 as an *integrity check* for the
*WEP* protocol which was supposed to protect access to WiFi
networks. Among its weaknesses, the use of this CRC in a computer attack
context was not suitable since an attacker could forge his CRCs for his
modified packets.

# And now ?

The *hash functions* are in fact very numerous, as are their
uses. For example, the preceding *ping google* actually
implements at least three of these functions to check the integrity of
the transmitted and received packet:

- A
**CRC-32**at the end of the Ethernet frame, - A
**1’s complement of a 1’s complement sum**in the IP header, - Another
**1s-complement of a 1s-complemented sum**in the ICMP header.

Similarly, a health insurance *reimbursement* will use one
control key for your IBAN and another for your identification number.
These payments using *computer and telecommunications
technologies*, I let you imagine the number of *checksums*
computed between the moment you go to your doctor and the moment you see
the reimbursement on the web interface of your online bank…

And these are just checksums, to go further you can take a look at the cryptographic fingerprints…