Create a Geoip database with PHP and SQL

Spoiler: Having seen the theory, let’s tackle the practice: creating and using our own IP address geolocation database from public data on the Internet. I’ll use a sqlite database to store the address ranges of each network and its association, and a PHP script to encapsulate all the logic into a class. I’ll take the opportunity to do some performance comparisons between different variants.

When you’re developing a web application and you want for an audience that, while not necessarily global, nonetheless transcends the borders of your own country, you must have to deal with geolocation, even a little bit. Knowing the country from which your visitors come allows you to better localize the interface, adapt the offer or respect legal constraints.

As we’ve seen earlier, if you don’t need anything more precise than the country, you can get your data directly from the Register Internet Rregionals via their delegation files, which contain everything you need. It’s freely accessible and updated daily.

TheDigitalArtist @ pixabay

After the theory, we’re going to get our hands dirty and build our own geolocation database usable in our applications.

The database

To keep things simple and relatively universal, I’m going to use PHP to perform the operations and an sqlite database to organize and store the data. On a Debian or derivative, you can install everything you need with the package manager as follows:

sudo apt-get install php-cli php-sqlite3

If you’re in a different situation or have particular constraints, you can always use other systems, languages and storage structures - you’ll just have to translate as you go along…

Now that the environment is ready, we’re finally going to be able to code. And since I prefer things clean, I’m going to encapsulate the database in a class and complete it little by little. It’s also interesting to be able to follow the RAII pattern by initializing everything in the constructor.

class Database {
    
    private $pdo ;

    public function __construct($filename) {
        $this->pdo = new PDO("sqlite:$filename") ;
        
        // Because I prefer exceptions :
        $this->pdo->setAttribute(
            \PDO::ATTR_ERRMODE,
            \PDO::ERRMODE_EXCEPTION
        ) ;

        // defined later
        $this->initIPv4() ;
        $this->initIPv6() ;
    }
    
    // ...
}

IP version 4 networks

I start with IP version 4 addresses, as they’re simpler and will help us lay the foundations for what’s to come.

SQL database

Instead of storing the network with the prefix length, we’ll store it as an open interval [begin,end[[begin,end[ with the first and last network addresses. This will save us the problem of calculating prefixes to compare networks and addresses.

Even though they’re written as 4 decimal numbers, IP version 4 addresses are basically 32-bit integers (unsigned, to be precise). In SQL, we can store them in an int unsigned, which will be usefull, not least for future comparisons.

Tourism… small distinctive characteristic of sqlite, all integers are in fact handled as INTEGER (64-bit integers) then truncated when written to the appropriate precision.

create table if not exists IPv4 (
    start   int unsigned,
    end     int unsigned,
    country varchar(2)
);

Last optimization, as we’re going to use this table for searches, we’ll add indexes to the start and end columns. The advantage is that we’ll go from a complexity in O(n)O(n) to a complexity in O(logn)O(\log n) for searches. You’ll see later that this isn’t just theoretical…

create index if not exists index_ipv4_start on IPv4(start);
create index if not exists index_ipv4_end   on IPv4(end)  ;

With this, we have everything we need to initialize the IPv4 table using the following method:

private function initIPv4() {
    $this->pdo->query("
            create table if not exists IPv4 (
                start   int unsigned,
                end     int unsigned,
                country varchar(2)
            )") ;
    $this->pdo->query("
            create index if not exists index_ipv4_start on IPv4(start)
            ") ;
    $this->pdo->query("
            create index if not exists index_ipv4_end on IPv4(end)
            ") ;
}

Add a network

In order to add a network, we’ll convert the IP address to an integer using the ip2long() function, designed for this purpose. As the RIRs provide the number of addresses for IPv4 (and not a prefix), we’ll add it to this value to obtain the last of the network.

Defense in depth: Even if I know that this function will be called with IP addresses, I always prepare my queries and then bind their parameters to avoid SQL injections in any case.

public function addIPv4($addr, $count, $country) {
    $start = ip2long($addr) ;
    $st = $this->pdo->prepare(
        "insert into IPv4 (start, end, country)"
        . " values (:start, :end, :country)"
    ) ;
    $st->execute([
        "start"   => $start,
        "end"     => $start + $count,
        "country" => $country
    ]) ;
}

Search for an address

In order to find the country of an address, and as we use intervals, we can simply search the line whose bounds enclose the value.

Tourism… To handle the case where no network exists, I use the coalescing NULL operator ?? (which returns its second argument if the first is NULL or doesn’t exist), a new feature in PHP7. The advantage over the simplified ternary operator ?: is that there’s no warning if the first argument doesn’t exist.

public function getIPv4($ip) {
    $value = ip2long($ip) ;
    $st = $this->pdo->prepare("
            select * from IPv4
            where
                start  <= :value and
                :value <  end
            ") ;
    $st->execute(["value" => $value]) ;
    $row = $st->fetch() ;
    return $row["country"] ?? "ZZ" ;
}

We can also take advantage of the fact that the networks are disjoint to avoid unnecessary comparisons. In this case, we’ll look for intervals whose beginning is smaller than our address, then keep the largest of them.

As some networks are reserved and do not appear in the database (i.e. 192.168.0.0/16), there are gaps. For a reserved address, the query will return a range that doesn’t cover it. You must therefore check with PHP that the network provided is correct.

public function getIPv4($ip) {
    $value = ip2long($ip) ;
    $st = $this->pdo->prepare("
            select * from IPv4
            where start <= :value
            order by start desc
            limit 1
            ") ;
    $st->execute(["value" => $value]) ;
    $row = $st->fetch() ;
    if ($row === false || $row["end"] < $value) {
        return "ZZ" ;
    }
    return $row["country"] ;
}

Comparisons

The point of making variants is to compare them. So I generated 2,500 random addresses for which I requested a basic search using one or other of the algorithms. And for the clever ones who think they could benefit from both versions by keeping the test on boundariesand looking for the larger network start, I’ve also used this variant (which I’ve called hybrid but haven’t added here).

And to be really exhaustive, I did these tests with and without indexes on the interval bounds. Just to show you how life-changing an index can be.

For the measurements, I reused my PHP profiler which allows me to measure the execution of a particular function while keeping it very simple.

Here are the raw results, durations are expressed in seconds (with milliseconds after the decimal point). Tests are run on Ubuntu 18.04.4 using two Intel® Xeon® X5670 processors. Note that the script uses only one core at a time:

Algorithm Without index With indexes
Two boundaries 38,078 0,838
One boundary 117,394 0,121
Hybrid 75,380 28,115

Unsurprisingly, the theory is in line with reality: indexes really do speed up queries (time is divided by 1000). Second self-evident truth: the simpler the query, the better it performs (the hybrid version doesn’t benefit as much from indexes).

IP version 6 networks

For IP addresses in version 6, we’ll have to adapt the previous structures and algorithms. Nothing too big, but it deserves its own section.

SQL database

As before, we’ll store the intervals in the form [begin,end[[begin, end[. This will be much more practical than having to compare IPv6 prefixes and manage its own particularities (i.e. RFC 4291). On the other hand, as these addresses are 128 bits long, they no longer fit directly into an integer, and you’ll have to choose a suitable model.

A hexadecimal string. We can translate the 128 bits into a 32-character hexadecimal string and store it in a varchar(32). To do this, I’ll use the function inet_pton() to obtain these 128 bits and then hex2bin() to obtain the 32 characters.

public static function ip6toString($ip) {
    $bin = inet_pton($ip) ;
    $hex = bin2hex($bin) ;
    return $hex ;
}

Intervals are still possible, and indexes can be used, but as we’re dealing with character strings, I find their handling less practical than with integers.

Two 64-bit integers To speed up comparisons, rather than 32 characters (integers between 0 and 255 in fact), we can group them into two unsigned 64-bit integers. This time, we’ll use the function unpack() to transform the 128 bits into an array of 2 64-bit integers.

Tourism… The unpack() function returns an associative array. When you request several elements of the same type, the numbering starts at 1 (not 0).

public static function ip6toInts($ip) {
    $bin  = inet_pton($ip) ;
    $ints = unpack("J2", $bin) ;
    return [$ints[1], $ints[2]] ;
}

The comparison will be faster, but we’ll have to code it ourselves, first comparing the most significant integers and then, if they’re equal, the least significant ones. In PHP, we’d have something like this:

public static function isLessThan($lhs, $rhs) {
    return $lhs[0] < $rhs[0] || ($lsh[0] == $rhs[0] && $lsh[1] < $rhs[1]) ;
}

It’s not very practical in PHP, so can you imagine if you had to write this kind of thing in every SQL query? It’ll work, but it’s not really practical.

A 64-bit integer. The thing is, Regional Internet Registries don’t allocate prefixes larger than 48 bits (RFC 2450), so there’s no point in keeping the last 64 bits of the addresses, since they won’t have any influence on the calculations.

According to this standard, it is also useless to keep the 16 bits beyond the 48th but we will keep them for two reasons: 1. there is no format for storing integers on 3 bytes (48 bits), 2. the RIRs have nevertheless cut networks whose prefix is between 59 and 64 bits…

public static function ip6toBigInt($ip) {
    $bin = inet_pton($ip) ;
    $ints = unpack("J2", $bin) ;
    return $ints[1] ;
}

For storage, we’ll use a single bigint unsigned for the bounds and, as before, add an index to each of them.

private function initIPv6() {
    $this->pdo->query("
            create table if not exists IPv6 (
                start   bigint unsigned,
                end     bigint unsigned,
                country varchar(2)
            )") ;
    $this->pdo->query("
            create index if not exists index_ipv6_start on IPv6(start)
            ") ;
    $this->pdo->query("
            create index if not exists index_ipv6_end on IPv6(end)
            ") ;
}

Add a network

This time, the specific characteristic comes from the delegation files which, for IP version 6 addresses, don’t provide the number of addresses in the network but its prefix. The prefix must therefore be used to calculate the number to be added to obtain the end of the interval.

public function addIPv6($addr, $prefix_length, $country) {
    $start = self::ip6toBigInt($addr) ;
    $count = 1 << (64 - $prefix_length) ;
    $st = $this->pdo->prepare(
        "insert into IPv6 (start, end, country)"
        . " values (:start, :end, :country)"
    ) ;
    $st->execute([
        "start"   => $start,
        "end"     => $start + $count,
        "country" => $country
    ]) ;
}

Search for an address

With our previous conventions, the search algorithm won’t change. We simply use our ip6toBigInt() function for conversion before searching.

public function getIPv6($ip) {
    $value = static::ip6toBigInt($ip) ;
    $st = $this->pdo->prepare("
            select * from IPv6
            where start <= :value
            order by start desc
            limit 1
            ") ;
    $st->execute(["value" => $value]) ;
    $row = $st->fetch() ;
    if ($row === false || $row["end"] < $value) {
        return "ZZ" ;
    }
    return $row["country"] ;
}

Comparisons

This time, we will compare modeling variants : the 64-bit integer I’ve chosen to show you versus the 32-character hexadecimal string. As the two-integer version doesn’t add anything useful, I haven’t to add it.

The tests are the same as before: find the country for 2500 randomly generated addresses. It’s always the same machine and times are in seconds:

Variante Without indexes With Indexes
32 character string 36,288824 0,122571
64-bit integer 33,025770 0,130657

Once again, we can see the benefits of adding indexes: time is divided by 300. On the other hand, the variants are roughly equivalent, which was expected, since time depends on the index traversal, which in our case is equivalent for these two data.

Tourism… Even if there are 4 times more IPv4 networks as IPv6 networks, search times are also equivalent. This is normal and, once again, due to the indexes and, above all, their logarithmic search complexity in O(logn)O(\log n) :

Multiplying the size of the set will therefore add constant time, not multiply it. log(20000)=log(50000×4)=log(50000)+log(4)=4,6990+0,6021=5,3031 \begin{split} \log(20000) & = \log(50000 \times 4) \\ & = \log(50000) + \log(4) \\ & = 4,6990 + 0,6021 \\ & = 5,3031 \\ \end{split}

Dans notre cas, multiplier la taille de l’ensemble par 4 ajoute 14% de temps. Un temps d’exécution de 0,121 secondes pour la recherche des adresses IPv4 devrait nous donner théoriquement 0,137 secondes pour IPv6, très proche de nos mesures.

Update

Now that our class can respond to requests and is ready to receive data, we can tackle the files made available to the Regional Internet Registries.

Since we want out to keep things clean, I’m going to continue by splitting the update into several functions, each with a single responsibility.

Purge

As we don’t update the database very often (once a day), we can afford to make a simpler code that empties the database before reintegrating everything. The alternative would be more complex: for each network, we’d have to deal with cases where networks have been grouped or divided, and therefore that isn’t really useful.

private function purge() {
    $this->pdo->query("delete from ipv4") ;
    $this->pdo->query("delete from ipv6") ;
}

One file

We can now add a supplied delegation file (via its URL). First we need to retrieve its contents, via file_get_contents() which can retrieve files fia FTP. We can then filter the comment lines (starting with #), separate them into columns (separated by |) and process the ipv4 and ipv6 records.

Tourism… By default, sqlite will save all database modifications in the file each time, which can generate a lot of I/O and slow down the program. To avoid this, all operations are grouped together in a transactions. Modifications will be written together during commit.

private function addDelegationFile($url) {
    $this->pdo->beginTransaction() ;

    $content = file_get_contents($url) ;
    foreach (explode("\n", $content) as $line) {
        // Comments in the file
        if (strpos($line, "#") === 0) {
            continue ;
        }

        $row = explode("|", $line) ;
        // lines to ignore: empty or header or summaries
        if (count($row) == 1 || is_numeric($row[0]) || $row[1] == "*") {
            continue ;
        }
        
        // unused records
        if ($row[1] == "" || $row[5] == "" || $row[6] == "available") {
            continue ;
        }
        
        // useful records
        switch($row[2]) {
            case "ipv4" :
                $this->addIPv4($row[3], $row[4], $row[1]) ;
                break ;
            case "ipv6" :
                $this->addIPv6($row[3], $row[4], $row[1]) ;
                break ;
        }

    }
    $this->pdo->commit() ;
}

Global

All that remains is to organize the update logic: empty the table, then add the files one by one.

public function update() {
    $this->purge() ;

    $urls = [
        "ftp://ftp.afrinic.net/pub/stats/afrinic/delegated-afrinic-extended-latest",
        "ftp://ftp.apnic.net/pub/stats/apnic/delegated-apnic-extended-latest",
        "ftp://ftp.arin.net/pub/stats/arin/delegated-arin-extended-latest",
        "ftp://ftp.lacnic.net/pub/stats/lacnic/delegated-lacnic-extended-latest",
        "ftp://ftp.ripe.net/pub/stats/ripencc/delegated-ripencc-extended-latest",
    ] ;

    foreach ($urls as $url) {
        $this->addDelegationFile($url) ;
    }
}

Comparisons

For this last part, I made a comparison of update times with and without transactions. I’d originally planned to do lots of updates to make it average a bit, but when I saw the time it took to do a single update without a transaction, I gave up.

To avoid FTP delays, the files were repatriated before the test and then read locally each time.

Here are the raw results. The tests are done on the same machine as before, note that it has an SSD disk which greatly reduces I/O latency, the results would be even worse on an old-fashioned mechanical disk

Variantes Temps
Without transactions 3134,352950
With transactions 10,732112

That’s what I call optimization 😉 going from 52 minutes to less than 11 seconds, it’s worth it 😄.

Finally

For the class to be usable, we need a helper to redirect country searches to the correct method depending on the version of the address…

public function getIP($ip) {
    if (strpos($ip, ".") !== false) {
        return $this->getIPv4($ip) ;
    } else if (strpos($ip, ":") !== false) {
        return $this->getIPv6($ip) ;
    }
    
    throw new Exception("Not a valid IP Address") ;
}

With all this, it’s now very easy to update the database and do a search:

$db = new Database("db.sqlite") ;
$db->update() ;

echo $db->getIP("192.168.1.1")              . "\n" ; // ZZ
// www.arsouyes.org :
echo $db->getIP("188.165.53.185")           . "\n" ; // FR
echo $db->getIP("2001:41d0:301::21")        . "\n" ; // FR
// www.google.com :
echo $db->getIP("216.58.201.228")           . "\n" ; // US
echo $db->getIP("2a00:1450:4007:816::2004") . "\n" ; // IE

And now ?

In this article, I showed you how to navigate into delegation files and organize data into an SQLite database that you could use in other application.

For those who are curious, here are some other articles that may interest you.

Geolocation, know the country of your users

February 5th 2020 Whether you’re developer or user, you’ve probably encountered the problem of geolocation. Before getting into millimeter-level precision, today we will see how to obtain the country using public databases.

PHP Profiler

June 25th 2017 - tbowan. When you need to measure the resources taken by some parts of your scripts, rather than taking out the big guns with xDebug, I offer you a handy little class.