Anonymize IP addresses for statistics

I don't know about you, but I love doing statistics. Since IP addresses are personal data, in order to do this well, they must first be anonymized.

Without wishing to monitor its users individually, it is always useful to be able to do some statistics on their activities. Whether it's on a website to get an idea of what works or where visits come from. Or in a network to find the most used resources, the files accessed, etc.

But to respect its users (their personal data and incidentally the GDPR), some precautions must be taken by anonymizing this data.

Erase what is recognizable. FotografieLink @ pixabay
Erase what is recognizable. FotografieLink @ pixabay

Today, and following up on the web server access logs recovery, I am offering you a very simple method to anonymize the IP addresses in your logs.

If you don't want to worry about why and how, two sed rules are enough to keep only the first two numbers of IPv4 addresses and the first 3 groups of IPv6s.

Constraints

You could prove common sense and anonymize your data alone in your corner, but since there are legal and regulatory constraints, it's worth taking a look at them first, it would be a shame not to comply.

First and foremost, it should be understood that the GDPR, following previous laws, does not uses anonymization as a means of getting out of its scope but as a constraint that applies to certain processing operations.

Data acquisition You must therefore have a legal basis to collect the data. Either you have requested free and informed consent from your users, or you have a legal obligation to hold this data. That’s up to you and that’s off the topic of the day.

In our case, OVH has an obligation to build these access logs and we have a legitimate interest in accessing the previous day's logs to diagnose and correct problems.

Compatibility with initial processing. If the data has not received consent for processing, you must evaluate its compatibility with the processing for which users have consented.

In our case, the European Union has already provided for it with a specific paragraph which says that the statistics are not incompatible with the initial purposes (Paragraph 1.b of article 5).

Anonymization. We therefore have the right to use our data for statistics, but while respecting certain constraints. The article is a little longer, but for what concerns us today, the following sentence will guide us:

Where those purposes can be fulfilled by further processing which does not permit or no longer permits the identification of data subjects, those purposes shall be fulfilled in that manner.

Paragraph 1 of article 89</cite

We will therefore have to avoid any identification of the persons concerned. As you had anticipated, anonymization therefore meets this need.

TheDigitalArtist @ pixabay
TheDigitalArtist @ pixabay

Anonymization criteria. To ensure that users are no longer identifiable, we do not anonymize data anyhow. The G29 (the European CNILs) has already worked on the subject and provided a guideline containing the following three criteria:

  1. Individualization: Is it always possible to isolate an individual?
  2. Correlation: is it possible to link separate sets of data concerning the same individual?
  3. Inference: Can we deduce information about an individual?

If you meet these three criteria, it's anonymous and your are ok. If even one is missing, it will be necessary to do a detailed analysis of the risks of re-identification.

My solution

With all these constraints in mind, we can move forward: we have the right to use our logs for statistics, but we must anonymize all personal data. In our case, the IP addresses (since these are the only personal data present).

IP version 4

Since IP version 4 addresses are widely used, there are already official guides.

The IP address used to geolocate the Internet user should not be more precise than the scale of the city. Concretely the last two bytes [16 bits out of 32] of the IP address must be deleted.

The CNIL, solution for audience measurement cookies

If you have followed my last articles on the geolocation of IP addresses (the theory and the practice), you will realize that with the shortage of IP addresses, Regional Internet Registers began to allocate smaller and smaller networks. The tiniest having 1024 addresses, which has 22-bit long prefix, 6 more than the 16 allowed by the CNIL...

If you want to geolocate with more correctness, you may need to do so before anonymizing IP addresses at the risk of losing geographic accuracy.

Coming back to our case here is a sed command to anonymize IP addresses version 4 in web access logs.

sed -r -e "s/^(([0-9]+\.){2})[^ ]+ /\10.1 /"

Basically, I search the beginning of the line with ^, and capture the first two numbers of the IP address with (([0-9] + \.) {2}). I then drop everything that follows until the first space [^] + (which delimits the fields in the access logs). The replacement takes the first two numbers with \ 1 then I add0.1 to replace the deleted bytes.

IP version 6

Everyone braked so hard to migrate to IPv6 that we never think about it. So much so that the CNIL did not give any concrete criteria on the issue.

So I turned to RFC2374 which explains how IP addresses blocs are~~ should be split. RIRs allocate 48-bit long prefixes for ASNs and let ASNs use the last 16 bits as they wish, to identify their access points.

Note that, I do not know for what reason, certain allocations in the delegation files have 64-bit long prefixes, 16 more than the recommended 48.

We can therefore restrict ourselves to the first 48 bits (ie 6 bytes, which are three groups of 4 hexadecimal characters). We will lose a bit in precision for borderline cases but it will be compatible with the precision of a city. If accuracy is important to you, geolocate before anonymizing.

So here is the sed rule for these addresses. The options are the same as before.

sed -r -e "s/^(([^:]{1,4}:){1,3})[^ ]+ /\1:1 /"

I start at the beginning of the line with ^ then I capture at most three hexadecimal groups with (([^:]{1,4}:){1,3}) and drop the rest until the next one space with [^ ]+. When replacing, I keep the prefix with \1 and to have a valid address, I give it the first of the block with :1.

A script

Finally, the ideal is still to encapsulate this method of anonymization in a script. We can then forget how it works and use it to anonymize files (it will be opaque).

Here, I did not try to make it very complicated, each file passed as an argument is passed through the sed mill which will anonymize the IP version 4 and 6 addresses.

Note that since we are replacing with a syntactically valid address, we can pass the script as many times as we want on the files. From the second, that will not change, but it has a reassuring side: if we no longer know if a file is already anonymized, we can always restart the script, at worst, it will not change anything.

#!/bin/sh

for i in $@; do
    echo "-- Anonymize - $i"
    sed -i -r \
        -e "s/^(([0-9]+\.){2})[^ ]+ /\10.1 /" \
        -e "s/^(([^:]{1,4}:){1,3})[^ ]+ /\1:1 /"  \
        "$i"
done

Now, we will only need to add a call to this script after retrieving your log files.

And after ?

It remains to check that this method meets the anonymization constraints of the G29.

Individualization. As each guarded prefix encompasses 216 possible addresses, it is not possible to isolate the activity of a user whose IP address is known from anonymized logs.

Correlation. It remains possible to isolate visits in terms of requests that are very close in time and possibly linked using the REFERER header (for example, an access to a page, then the recovery of CSS and images, and sometimes a bounce to another page of the site and so on). Beyond this notion of visit, it is not possible to follow the activity of a user because there is nothing to determine that the following visits come from the same user and not from one of his \(2 ^ {16} -\) 1 neighbors.

The inference. Knowing that a user has come one day, except if he is the only one for the day, it is not possible to isolate his visits from those of his neighbors.

As the three criteria are not fully met, the risk of re-identification must be analyzed:

If you know a user's IP address and already know that he’s come in a short enough time to be the only one in his neighborhood, you can tell what they saw.

That's a lot of hard-to-realize assumptions: either you have access to their computer system, or you have access to their DNS queries. In either case, and if this access is legal, then this is a criminal investigation and you already have access to the non-anonymized logs.