formula results from the CPAN

GeoIP2
view release on metacpan or search on metacpan
maxmind-db/MaxMind-DB-spec.md view on Meta::CPAN
search tree.

### record\_size

This is an unsigned 16-bit integer. It indicates the number of bits in a
record in the search tree. Note that each node consists of *two* records.

### ip\_version

This is an unsigned 16-bit integer which is always 4 or 6. It indicates
whether the database contains IPv4 or IPv6 address data.

### database\_type

This is a string that indicates the structure of each data record associated
with an IP address. The actual definition of these structures is left up to
the database creator.

Names starting with "GeoIP" are reserved for use by MaxMind (and "GeoIP" is a
trademark anyway).

### languages

An array of strings, each of which is a locale code. A given record may
contain data items that have been localized to some or all of these
locales. Records should not contain localized data for locales not included in
this array.

This is an optional key, as this may not be relevant for all types of data.

### binary\_format\_major\_version

This is an unsigned 16-bit integer indicating the major version number for the
database's binary format.

### binary\_format\_minor\_version

This is an unsigned 16-bit integer indicating the minor version number for the
database's binary format.

### build\_epoch

This is an unsigned 64-bit integer that contains the database build timestamp
as a Unix epoch value.

### description

This key will always point to a map. The keys of that map will be language
codes, and the values will be a description in that language as a UTF-8
string.

The codes may include additional information such as script or country
identifiers, like "zh-TW" or "mn-Cyrl-MN". The additional identifiers will be
separated by a dash character ("-").

This key is optional. However, creators of databases are strongly
encouraged to include a description in at least one language.

### Calculating the Search Tree Section Size

The formula for calculating the search tree section size *in bytes* is as
follows:

    ( ( $record_size * 2 ) / 8 ) * $number_of_nodes

The end of the search tree marks the beginning of the data section.

## Binary Search Tree Section

The database file starts with a binary search tree. The number of nodes in the
tree is dependent on how many unique netblocks are needed for the particular
database. For example, the city database needs many more small netblocks than
the country database.

The top most node is always located at the beginning of the search tree
section's address space. The top node is node 0.

Each node consists of two records, each of which is a pointer to an address in
the file.

The pointers can point to one of three things. First, it may point to another
node in the search tree address space. These pointers are followed as part of
the IP address search algorithm, described below.

The pointer can point to a value equal to `$number_of_nodes`. If this is the
case, it means that the IP address we are searching for is not in the
database.

Finally, it may point to an address in the data section. This is the data
relevant to the given netblock.

### Node Layout

Each node in the search tree consists of two records, each of which is a
pointer. The record size varies by database, but inside a single database node
records are always the same size. A record may be anywhere from 24 to 128 bits
long, depending on the number of nodes in the tree. These pointers are
stored in big-endian format (most significant byte first).

Here are some examples of how the records are laid out in a node for 24, 28,
and 32 bit records. Larger record sizes follow this same pattern.

#### 24 bits (small database), one node is 6 bytes

    | <------------- node --------------->|
    | 23 .. 0          |          23 .. 0 |

#### 28 bits (medium database), one node is 7 bytes

    | <------------- node --------------->|
    | 23 .. 0 | 27..24 | 27..24 | 23 .. 0 |

Note 4 bits of each pointer are combined into the middle byte. For both
records, they are prepended and end up in the most significant position.

#### 32 bits (large database), one node is 8 bytes

    | <------------- node --------------->|
    | 31 .. 0          |          31 .. 0 |

### Search Lookup Algorithm

The first step is to convert the IP address to its big-endian binary
representation. For an IPv4 address, this becomes 32 bits. For IPv6 you get
128 bits.

The leftmost bit corresponds to the first node in the search tree. For each
bit, a value of 0 means we choose the left record in a node, and a value of 1
means we choose the right record.

The record value is always interpreted as an unsigned integer. The maximum
size of the integer is dependent on the number of bits in a record (24, 28, or
32).

If the record value is a number that is less than the *number of nodes* (not
in bytes, but the actual node count) in the search tree (this is stored in the
database metadata), then the value is a node number. In this case, we find
that node in the search tree and repeat the lookup algorithm from there.

If the record value is equal to the number of nodes, that means that we do not
have any data for the IP address, and the search ends here.

If the record value is *greater* than the number of nodes in the search tree,
then it is an actual pointer value pointing into the data section. The value
of the pointer is relative to the start of the data section, *not* the
start of the file.

In order to determine where in the data section we should start looking, we use
the following formula:

    $data_section_offset = ( $record_value - $node_count ) - 16

The 16 is the size of the data section separator. We subtract it because we
want to permit pointing to the first byte of the data section. Recall that
the record value cannot equal the node count as that means there is no
data. Instead, we choose to start values that go to the data section at
`$node_count + 16`. (This has the side effect that record values
`$node_count + 1` through `$node_count + 15` inclusive are not valid).

This is best demonstrated by an example:

Let's assume we have a 24-bit tree with 1,000 nodes. Each node contains 48
bits, or 6 bytes. The size of the tree is 6,000 bytes.

When a record in the tree contains a number that is less than 1,000, this
is a *node number*, and we look up that node. If a record contains a value
greater than or equal to 1,016, we know that it is a data section value. We
subtract the node count (1,000) and then subtract 16 for the data section
separator, giving us the number 0, the first byte of the data section.

If a record contained the value 6,000, this formula would give us an offset of
4,984 into the data section.

In order to determine where in the file this offset really points to, we also
need to know where the data section starts. This can be calculated by
determining the size of the search tree in bytes and then adding an additional
16 bytes for the data section separator:

    $offset_in_file = $data_section_offset
                      + $search_tree_size_in_bytes
                      + 16

Since we subtract and then add 16, the final formula to determine the
offset in the file can be simplified to:

    $offset_in_file = ( $record_value - $node_count )
                      + $search_tree_size_in_bytes

### IPv4 addresses in an IPv6 tree

When storing IPv4 addresses in an IPv6 tree, they are stored as-is, so they
occupy the first 32-bits of the address space (from 0 to 2**32 - 1).

Creators of databases should decide on a strategy for handling the various
mappings between IPv4 and IPv6.

The strategy that MaxMind uses for its GeoIP databases is to include a pointer
from the `::ffff:0:0/96` subnet to the root node of the IPv4 address space in
the tree. This accounts for the
[IPv4-mapped IPv6 address](http://en.wikipedia.org/wiki/IPv6#IPv4-mapped_IPv6_addresses).

MaxMind also includes a pointer from the `2002::/16` subnet to the root node
of the IPv4 address space in the tree. This accounts for the
[6to4 mapping](http://en.wikipedia.org/wiki/6to4) subnet.

Database creators are encouraged to document whether they are doing something
similar for their databases.

The Teredo subnet cannot be accounted for in the tree. Instead, code that
searches the tree can offer to decode the IPv4 portion of a Teredo address and
look that up.

## Data Section Separator

There are 16 bytes of NULLs in between the search tree and the data
section. This separator exists in order to make it possible for a verification
tool to distinguish between the two sections.

This separator is not considered part of the data section itself. In other
words, the data section starts at `$size_of_search_tree + 16` bytes in the
file.

## Output Data Section

Each output data field has an associated type, and that type is encoded as a
number that begins the data field. Some types are variable length. In those
cases, the type indicator is also followed by a length. The data payload
always comes at the end of the field.

All binary data is stored in big-endian format.

Note that the *interpretation* of a given data type's meaning is decided by
higher-level APIs, not by the binary format itself.

### pointer - 1

A pointer to another part of the data section's address space. The pointer
will point to the beginning of a field. It is illegal for a pointer to point
to another pointer.

Pointer values start from the beginning of the data section, *not* the
beginning of the file.
( run in 0.989 second using v1.01-cache-2.11-cpan-9581c071862 )