# Anonymize IP addresses for statistics

I don't know about you, but I love doing statistics. Since IP addresses are personal data, in order to do this well, they must first be anonymized.

Without wishing to monitor its users individually, it is always useful to be able to do some statistics on their activities. Whether it's on a website to get an idea of what works or where visits come from. Or in a network to find the most used resources, the files accessed, etc.

But to respect its users (their personal data and incidentally the GDPR), some precautions must be taken by anonymizing this data.

Today, and following up on the web server access logs recovery, I am offering you a very simple method to anonymize the IP addresses in your logs.

# Constraints

You could prove common sense and anonymize your data alone in your corner, but since there are legal and regulatory constraints, it's worth taking a look at them first, it would be a shame not to comply.

Data acquisition You must therefore have a legal basis to collect the data. Either you have requested free and informed consent from your users, or you have a legal obligation to hold this data. That’s up to you and that’s off the topic of the day.

Compatibility with initial processing. If the data has not received consent for processing, you must evaluate its compatibility with the processing for which users have consented.

Anonymization. We therefore have the right to use our data for statistics, but while respecting certain constraints. The article is a little longer, but for what concerns us today, the following sentence will guide us:

We will therefore have to avoid any identification of the persons concerned. As you had anticipated, anonymization therefore meets this need.

Anonymization criteria. To ensure that users are no longer identifiable, we do not anonymize data anyhow. The G29 (the European CNILs) has already worked on the subject and provided a guideline containing the following three criteria:

1. Individualization: Is it always possible to isolate an individual?
2. Correlation: is it possible to link separate sets of data concerning the same individual?
3. Inference: Can we deduce information about an individual?

If you meet these three criteria, it's anonymous and your are ok. If even one is missing, it will be necessary to do a detailed analysis of the risks of re-identification.

# My solution

With all these constraints in mind, we can move forward: we have the right to use our logs for statistics, but we must anonymize all personal data. In our case, the IP addresses (since these are the only personal data present).

## IP version 4

Since IP version 4 addresses are widely used, there are already official guides.

The IP address used to geolocate the Internet user should not be more precise than the scale of the city. Concretely the last two bytes [16 bits out of 32] of the IP address must be deleted.

The CNIL, solution for audience measurement cookies

If you have followed my last articles on the geolocation of IP addresses (the theory and the practice), you will realize that with the shortage of IP addresses, Regional Internet Registers began to allocate smaller and smaller networks. The tiniest having 1024 addresses, which has 22-bit long prefix, 6 more than the 16 allowed by the CNIL...

Coming back to our case here is a sed command to anonymize IP addresses version 4 in web access logs.

• The -r option allows me to use extended regular expressions (with braces),
• The -e option announces a substitution rule.
sed -r -e "s/^(([0-9]+\.){2})[^ ]+ /\10.1 /"

Basically, I search the beginning of the line with ^, and capture the first two numbers of the IP address with (([0-9] + \.) {2}). I then drop everything that follows until the first space [^] + (which delimits the fields in the access logs). The replacement takes the first two numbers with \ 1 then I add0.1 to replace the deleted bytes.

## IP version 6

Everyone braked so hard to migrate to IPv6 that we never think about it. So much so that the CNIL did not give any concrete criteria on the issue.

So I turned to RFC2374 which explains how IP addresses blocs are~~ should be split. RIRs allocate 48-bit long prefixes for ASNs and let ASNs use the last 16 bits as they wish, to identify their access points.

We can therefore restrict ourselves to the first 48 bits (ie 6 bytes, which are three groups of 4 hexadecimal characters). We will lose a bit in precision for borderline cases but it will be compatible with the precision of a city. If accuracy is important to you, geolocate before anonymizing.

So here is the sed rule for these addresses. The options are the same as before.

sed -r -e "s/^(([^:]{1,4}:){1,3})[^ ]+ /\1:1 /"

I start at the beginning of the line with ^ then I capture at most three hexadecimal groups with (([^:]{1,4}:){1,3}) and drop the rest until the next one space with [^ ]+. When replacing, I keep the prefix with \1 and to have a valid address, I give it the first of the block with :1.

## A script

Finally, the ideal is still to encapsulate this method of anonymization in a script. We can then forget how it works and use it to anonymize files (it will be opaque).

Here, I did not try to make it very complicated, each file passed as an argument is passed through the sed mill which will anonymize the IP version 4 and 6 addresses.

#!/bin/sh

for i in $@; do echo "-- Anonymize -$i"
sed -i -r \
-e "s/^(([0-9]+\.){2})[^ ]+ /\10.1 /" \
-e "s/^(([^:]{1,4}:){1,3})[^ ]+ /\1:1 /"  \
"\$i"
done

Now, we will only need to add a call to this script after retrieving your log files.

# And after ?

It remains to check that this method meets the anonymization constraints of the G29.

Individualization. As each guarded prefix encompasses 216 possible addresses, it is not possible to isolate the activity of a user whose IP address is known from anonymized logs.

Correlation. It remains possible to isolate visits in terms of requests that are very close in time and possibly linked using the REFERER header (for example, an access to a page, then the recovery of CSS and images, and sometimes a bounce to another page of the site and so on). Beyond this notion of visit, it is not possible to follow the activity of a user because there is nothing to determine that the following visits come from the same user and not from one of his $$2 ^ {16} -$$ 1 neighbors.

The inference. Knowing that a user has come one day, except if he is the only one for the day, it is not possible to isolate his visits from those of his neighbors.

As the three criteria are not fully met, the risk of re-identification must be analyzed:

If you know a user's IP address and already know that he’s come in a short enough time to be the only one in his neighborhood, you can tell what they saw.

That's a lot of hard-to-realize assumptions: either you have access to their computer system, or you have access to their DNS queries. In either case, and if this access is legal, then this is a criminal investigation and you already have access to the non-anonymized logs.