Geolocation, know the country of your users

tbowan
(en français)
February 5th 2020

Spoiler: Whether you're developer or user, you've probably encountered the problem of geolocation. Before getting into millimeter-level precision, today we will see how to obtain the country using public databases

When developping an app (web or mobile), we quickly need to know from which country does our users come from.

Some time, it's to make the navigation more ergonomic. Even if the web browser send you the prefered language of its user, it does not send you hits location. Especially, the time zone, which can bother the clock display.

Whithout geolocation, we must delegate to JS the task to localize the timestamps.

Other times, it's to respect legal constraints. Like video streaming of sporting events, for which television channels buy rights in defined territories.

This is how in France, fan of Soccer must cheat and use VPN in order to see the best team of the world (on RTBF).

To be complete, it is a source of information to offer to the visitor adapted suggestions; closer to their home (or to their local concerns). Whether it is news, search results or even specific products and services. Targeted advertising is then not far away.

Unless your visitors send your their location, you must guess their (geo)localtion from their IP. For that, you can obtain databases with the connection, from the price depend from the precision (geographic and temporal).

True. Web browsers have an API in order to get the geolocation. Fortunately, if the application wants your data, the browser will ask your permission (which you should decline) before providing it.

Some companies provide geolocation databases, but if you want only the country, it would be a shame to pay, because, as we show you today, this information is public and freely accessible.

Distribution of addresses

For a whole bunch of technical reasons, machines use addresses to communicate on the Internet in the form of a sequence of bits, 0 and 1, of fixed length. IPv4 uses 32 bits that are written with 4 numbers (corresponding to bytes). IPv6 uses 128 bits that are written in hexadecimal.

To facilitate routing packets across the network, these addresses are split in two. The prefix designates the network and is the same to all the machines that are part of it. The suffix designates the machine, within its network. A network can of course be divided into subnets (the part of the address assigned to the network increases, to the detriment of the part identifying the machine) but that is another story.

To avoid this becoming a mess, an organization has taken charge of globally supervising the allocation of IP addresses, the IANA. For diplomatic reasons, because it's 1990 and nobody wants the global network to be managed solely by the United States of America, the IANA has delegated its prerogatives to Regional Internet Registries (the RIR), including the RIPE NCC created in 1992 for Europe.

The joke is that the regional registries, feeling that it would be very practical to come together to organize themselves, founded the NRO, a new organization to centralize their activities...

When a network operator (we speak of an Autonomous System) needs machines to be accessible, it makes a request to the registry on which its country depends, which then provides it with a free network prefix (if there are any left).

Get the datas

We are lucky, the Internet was created by idealistic academics. Data about how it works, while not easy to find, are freely available to the public.

The same applies to the prefix made by the Registries for the benefit of the Operators. These are called "RIRs delegation files", they are distributed by the registries and are updated daily. Here they are:

If you are a fan of statistics or have a slightly more touristic approach to the history of address allocation, I suggest the excellent statistics page by Patrick Maigron. Among other things, you will find the history of the use of prefixes and geographical maps for the (unequal) distribution between countries.

The contents of these databases are provided in CSV format (where the separator is the | character). The details are common and provided by each RIR (i.e. those of RIPE-NCC, because the APNIC version – first Google result – contains errors 😉).

Header and summaries

The first line of the file is the header. It is used to determine the format and content of the file.

2|ripencc|1580684399|210814|19830705|20200202|+0100

The fields have the following meaning:

Version : 2 for some time now,
Name of the registry : ripencc here for the European register,
Serial number : it must be incremented at each update, here it is a timestamp (1580684399) but it can also be a date (i.e. it is the choice of AFRINIC),
Number of records: number of lines, not including headers and summaries,
Start date: start date of the period, 19830705 here (for July 5th 1983), some RIRs put 0s.
End date: date of the end of the period, 20200202 here (for February 2nd 2020),
Time difference UTC: of the RIR that generated the file, +0100 here.

This is followed by three lines of summaries, which provide the number of records for each type (IPv4 networks, IPv6 networks and operator number ranges). Example with RIPE-NCC IPv4 networks:

ripencc|*|ipv4|*|81452|summary

The fields have the following meaning:

Registery name: as above,
*: the ‘ star ’ character,
Registration type: one of ipv4, ipv6 and asn,
* : the ‘ star ’ character,
Number of records: number of lines for this type (and especially not the number of addresses),
summary : the word ‘ summary ’, together with the stars, to differentiate the summary lines from the others.

Records

All the other lines concern these famous records. For example, the first IPv4 network allocated in Europe:

ripencc|FR|ipv4|2.0.0.0|1048576|20100712|allocated|ec66869d-4669-423f-8a24-0cb99c3e1491

The fields are as follows:

Registry name: you're getting used to it,
Country code: two characters (following ISO-3166), here FR,
Record type: sometimes it's not clear, here it's ipv4,
Range start: first network address or first autonomous system number, here it's network 2.0.0. 0,
Size : for IPv4 networks and operator numbers, this is the number of resources (here, 1048576 addresses in this network), for IPv6 networks, this is the prefix length,
Registration date : date on which this resource was allocated or assigned (here 12 July 2010),
Status : allocated (allocated here) or assigned,
ID opaque: unique identifier that identifies the owner of the resource (ec66869d-4669-423f-8a24-0cb99c3e1491 here for Orange, this may change from one version of the file to another)

And that's all... To find out the country of an IP address, all you need to do is find the record corresponding to its network.

The RIRs have officially stated that they are not suppliers of geolocation data and that we should instead use the databases provided by companies whose business it is. But as we're only interested in the country here (and not the cities or coordinates), the RIRs' data will be more than enough for us.