What is Entropy?

tbowan
(en français)
April 28th 2023

Spoiler: When you are interested in computer security, there will always come a time when you will encounter it; exotic and mysterious. Take care of her with respect to make her your ally or risk her turning against you.

After risking the Apocalypse by explaining the difference between cipher and encryption, and with no fear of consequences, I decided to tackle today one of the sexiest and most fundamental buzzword in computer security, “entropy”…

“Nobody knows what entropy really is.”

John von Neumann from Information Theory and Evolution

Entropy is initially defined in thermodynamics to quantify disorder in a system and the second law states that it can only increase. Thus, it evokes both chaos theory and fundamental secrets of the Universe. The kind of term used by the real pros, those who see the matrix beyond the fabric of reality.

In computer science, we also use entropy to quantify disorder, except that we rather speak of “information uncertainty”. It is therefore used whenever one wants to qualify passwords, cryptographic keys, session identifiers or when one seeks to identify the type of a content.

It is a fundamental and fascinating concept (when you like this kind of thing) but paradoxically rarely taught. You will therefore hear it used more or less out of place by impostors who have not really understood what it is but want to shine in society by impressing neophytes.

Note: I had already written an article on the subject in 2020 to explain what it is about but while I was rereading it (to then translate it into English) I realized that it was still too theoretical and deserved a rewrite to make it more accessible. There it is…

What are informations?

To quantify something, you need to know what it is. You might think the answer is trivial, but it deserves clarification to avoid the common mistake of confusing “information” with “data”.

To put it simply (because this subject deserves a full-fledged article), the core difference is at the heart of The Hitchhiker’s Guide to the Galaxy. But since I feel you are in a hurry, here is a summary of this parable:

Deep Thought, the second greatest computer of all time, took almost seven and a half million years of intensive calculations to finally provide the ultimate answer: “42”.

Faced with the general incomprehension and frustration of the public, Deep Thought clarified that the problem was that they had never really understood the question and that a second computer could provide it.

After many adventures, Arthur Dent finally discovered it: “Six times nine? Forty-two. which thus resolves the big question about life, the Universe and everything else.

Thus, when Deep Thought provides the answer “42”, it is obviously incomprehensible because without context, we do not know what it is (e.g. a French county, a temperature, a measurement…). This data does not inform us in the least.

It is only by finding the question that Arthur Dent gives meaning to 42, which is here the solution of an equation ( $x = 6 \times 9$ ). It is in this relation to a question that a piece of data becomes a piece of information.

Intuitive amount of information

We will go into the details later, but for the moment, you already feel that all the answers are not equivalent and do not inform you as much as each other. Some contain more information than others.

How are they distinguished? The key factor is the number of alternatives. The more possible answers a question has, the more each of them will inform you.

For example, if you wonder where a person was born. I can give you the name of the French county (90 alternatives) but you won’t be far ahead. You would know more with the name of the town (36,528 alternatives) and even more with the postal address (25 million alternatives in the National Address Base).

In an analogous way, chaining questions makes it possible to accumulate the information of the successive answers. If after having told you the place I inform you about the date, then if it was by vaginal delivery and so on, you would have more and more information about this event.

What scale of measurement?

The quantity of information being in the number of possibilities, could we use this number of alternatives directly as the quantity of information?

For one question, it’s not too bad. Between the counties and the year of birth, we feel that these answers are equal and indeed the numbers of possibilities are close (90 counties and 100 years).

First problem to warm up with the answers that narrow each other. If after telling you the county I tell you the town of birth then the address, we go from 91 counties to 36,529 towns then 25,000,000 addresses. Did you feel like you got 36,438 more pieces of information when I specified the town? Then 24,963,471 more information with the address?

Second problem (which will make it possible to define a relevant scale) when the different questions are combined. If after telling you the county I tell you the year of birth, we go from 100 possible counties to 9000 pairs of counties and possible years. Do you feel like you have multiplied your informational capital by 100?

The trick is that when you ask several independent questions, each new answer multiplies the number of possibilities, but intuitively it seems rather that this information adds up.

Luckily, there’s a math function that turns products into sums; the logarithm: $\log (a \times b) = \log(a) + \log(b)$ By using the logarithm of the number of possibilities (rather than this number directly), the amount of information from each answer will now be added to the amount of information from the previous answers.

For example, here are the amount of information by calculating the logarithm of the number of answer:

Answers	Alternatives	Information
County	90	1.95
Town	36,529	4.56
Address	25,000,000	7.40
Year	100	2.00
County and year	9,000	3.95

To show you that these amounts of information are more intuitive, here is what they give for the previous two problems:

When the answers become clearer. With the county, you had 1.95 amount of information. We go up to 4.56 by adding the town (which is therefore 2.61); adding the address finally adds 2.83 more. Going from the county to the town, then from the town to the address contains about as much information.
When the questions are linked. With the county you have 1.95 quantities of information and you add 2 with the year, which roughly doubles the total.

Unit and coding

The next step is to find a unit of measurement. It is both a matter of determining the meaning of measurement and of comparative standards.

For example, the weight says how heavy it is and if I tell you 1Kg you can compare it to a carton of milk. If I tell you that an object is 2m from you it tells you how far it is and you can visualize the length of a table.

Intuitively, the amount of information designates the effort it takes to remember it, the place it takes in our memory and since we invented writing, the place it takes to write it down and avoid to have to remember that.

Hence the more formal definition as follows:

The amount of information is the number of characters to encode it.

And what’s good is that the logarithm does exactly that, counting the number of characters to code a number of alternatives.

To convince yourself of this, count the numbers that you can write with digits. If you use 2 digits you can write 100 numbers (00 to 99). Generally speaking, if you use $n$ digits, you would write $10^n$ numbers. The logarithm, by definition, is the inverse calculation, knowing the result of the previous calculation (the number of coded elements, here $10^n$ ), finding the exponent (the number of digits used, here $n$ ) .

On the previous examples, the base 10 provided measures in numbers suitable for our representations:

The 91 counties could only use 1.95 digits but as we cannot use pieces of digits, we code them with 2 digits (forgetting the details),
The 36529 towns could only use 4.56 digits but for the same reason 5 are used.
Hence the commune code which uses two for the departments ( $1.95$ rounded up) and 3 for the municipality in the department ( $4.56 - 1.95 = 2.61$ which can be rounded up to 3).

We can also use other bases for the logarithm, corresponding to other alphabets. For example, in base 26 we calculate the quantity of information in letters and in base 2, the quantity in bits.

Biased information

It is now time to consider another intuition that I had overlooked: that for the same question, within the same group of alternatives, certain answers seem to carry more information than others.

For example, let’s take the town of birth. Or to be exact, the municipality of residence of the mother for children born from 2014 to 2021 because this statistic is published by INSEE.

If I answer you Paris it does not surprise you so much (219,911 births counted) and you would have had more information if I had told you Trouville (51 births) or even more with Gouy-les-Groseillers (only 1 birth).

The idea is that of the 6,145,484 possible births (to the question “who is this person born between 2014 and 2021”), the place of residence of the mother partly answers the question by leaving a margin of uncertainty, a number of alternatives that remain to be determined, and therefore information that you lack to know everything.

Mathematically, we follow this intuition; we take the available information (logarithm of the number of births) from which we subtract the unknown information (logarithm of the number of births in the municipality).

The following table gives you the values (in bits) for the towns in the example:

Municipality	Number of Births	Unknown Information	Information obtained
Paris	219,911	19.7 bit	4.8 bit
Trouville	51	5.7 bit	16.9 bit
Gouy-les-Groseillers	1	0.0 bit	22.6 bit

And here again there is a link with the coding of information. Compared to the previous 5-digit code (i.e. 16 bits), the whole can be compressed by using these new measures to shorten the coding of the most frequent municipalities (5 bits for Paris, which saves 11 bits each time) and extend that of the rarest (23 bits for Gouy-les-Groseillers but it will only be used once).

Detail which is important, formally we speak in fact of information of an event (e.g. “to be born between 2014 and 2021 from a mother living in Trouville”) and rather than to calculate it as the difference between the total information and the residual information, we express it via the probability of the event (denoted $p_x$ ): $I(x) = - \log p_x$ But logarithm magic makes it exactly the same as our previous calculation which used the total number of alternatives (denoted by $n$ ) and the number of remaining alternatives remaining for the answer (denoted by $n_x$ ): $\begin{align} I(x) & = - \log p_x \\ &=-\log n_x/n \\ & = \log n - \log n_x \end{align}$

You can therefore use and consider both point of view, they are equivalent.

Entropy, mean of responses

We now have all the tools necessary to approach the notion of entropy.

It’s quite simple, rather than wondering what information there is in an answer, we wonder how much information we can expect from asking a question. And for that, we will calculate the weighted average of the information of the answers (mathematicians say “expected value”). $H = \sum p_x I(x) = - \sum p_x \log p_x$ For example, if we take the place of birth (or rather the place of residence of the mother of children born between 2014 and 2021); we can expect 12 bits of information if we ask for the name of the municipality and 6.15 bits of information if we ask for the county. Sometimes we are lucky to get more (e.g. Gouy-les-Groseillers), sometimes we get less (e.g. Paris) but in the long term, we converge towards the average.

If you had to guess something and have several questions to choose from, the one with the highest entropy will give you more information on average than the others. Some say it is of better quality.

And now ?

Entropy was originally defined to measure the amount of information emitted by a transmitter and determine the information rate from that source. In particular, the entropy provides a minimum for the average size of the transmitted information (with fewer characters we would lose something).

For example, even if we can encode the names of the towns with 5 digits or 15.15 bits, if we know that it is the place of residence of the mother, we can go down to an average of 12.15 bits per town. That is a gain of 20% on average. But we cannot do with less bits without losing information.

To come back to computer security, entropy also makes it possible to measure the quality of a source of random information because it is maximum if the messages are equally probable and decreases as soon as some are more or less frequent.

Be careful however because a low entropy is a sign of a bad generator but the reverse is not true, rotten generators can have a good entropy without being of quality.

Small precision for these generators and their pool of randomness. As a user, it is a source, so we can talk about the entropy present in the pool (uncertainty about what it will provide next). But from the point of view of the system, the bits are already there and ready to go out and we should therefore speak of information present in the pool. The use has privileged the point of view of the user.

We can then speak of the entropy of a password generator, a key generator or a session identifier generator, which informs us about the hardness of guessing the value that these generators will provide.

On the other hand, talking about the entropy of a password is an abuse of language that forgets that there is a generator in the story.

Where ciphering and ciphering allows elites to distinguish themselves from the plebs, entropy allows professionals to detect imposters.