Synopsis of Tools for Importing OpenStreetMap Data into PostgreSQL

Sisyphers

1

The de-facto standard for rendering OpenStreetMap data is PostgreSQL/PostGIS-based: The planet file or any extract is taken and then imported into the database. The same approach is useful for data exploration, connecting OSM data to own applications or doing data analysis. To get OSM data into Postgres, a suitable tool is necessary: Usually either osm2pgsql or imposm are used for this purpose.

While osm2pgsql is more mature and is used on e.g. the core OSM infrastructure, imposm is still not considered 1.0 despite being available for several years and proven in many installations. Then there is the swiss army knife of OpenStreetMap osmium, which a while ago received the functionality to emit data in a PostgreSQL-compatible format.

In this article we are going to compare the tools, show their strengths and weaknesses, have a look on resource consumption and the feasibility for different use cases.

Feature Set and Philosophy

The available tools have different approaches: osm2pgsql is a general purpose importer, with a focus on map rendering and geocoding. imposm is mostly used for map rendering, with the focus on generating tables that are pre-optimized for maps. osmium treats Postgres as an export format, without even connecting directly to the database, but just giving the option to pipe the data in Postgres’ COPY format.

imposm allows customization through a mapping file, in which the user can describe which OSM elements will be considered, what tables will be created and how data shall be normalized (e.g. data types, sanitization).

osm2pgsql has two knobs that can be tweaked: Firstly, a style file that contains the columns that will be created and secondly, a Lua scripting API that allows for more complex transformations.

osmium usually just writes one table, with all tags in a jsonb column. The considered tags can be customized through white and black lists (exclude_tags and include_tags).

Many workflows use OpenStreetMap’s replication mechanism which allows to stay updated on the latest upstream data without having to download the full planet. OpenStreetMap provides minutely, hourly or daily diffs which can be ingested by compatible software. osm2pgsql and imposm support this natively: They can update an existing PostgreSQL database if run in diff mode (imposm) or slim mode (osm2pgsql). osmium can update on-disk OSMPBF files, but doesn’t provide a mechanism to replicate changes into a database.

Making the Right Choice

The choice of the importer software may depend on your external constraints: If you want to use a map style that includes osm2pgsql-compatible queries, it makes sense to stick with it, as the resulting database schemas are not equal and would need either additional views or transformations.

On the other hand, if you prefer the “one object type per table” nature of imposm, queries from e.g. osm2pgsql can be ported with some manual work. Also, imposm makes it extremely easy to normalize common OSM values, e.g. “yes” and “no” strings can be converted into native PostgreSQL booleans and implausible ones can be discarded at import time, thus reducing the need to sanitize values while querying.

For analytical tasks, osmium may be more useful as it won’t perform any tag transformations by default.

Software Strategy

Release, development, support and maintenance strategies are very different for those three tools. Especially for long-term system operation this might be a very important aspect to be considered.

osm2pgsql dates back to 2006 and has the largest number of contributors. Maintenance is performed by a community of people. Pull requests are processed in a timely manner, providing useful feedback. There are several releases per year. Packages are available for a wide range of operating systems/distributions: Debian, Ubuntu, FreeBSD, macOS (via homebrew) and others.

imposm started in 2012 and is under single-person corporate development and maintainership. Issues do not always receive feedback, pull requests are treated defensively and non-trivial changes are frowned upon. Releases are irregular, thus packaging is non-existent: Neither Debian, Ubuntu, FreeBSD nor macOS provide packages.2 There is little development of new features.

osmium is a command-line tool based on the C++ library libosmium, which started in 2013, mostly developed by a single developer. Issues and feature requests are discussed and lead to development, if deemed useful. Pull requests receive feedback and are merged based on merit. Releases are frequent and release notes along with a consistent version numbering help communicate changes. Packages do exist in many distributions: Debian, Ubuntu, Fedora and macOS (homebrew) ship both library and the command-line interface, while FreeBSD only provides the library.

Resources for Import

Disk Space

Due to the large amounts of data it is generally advisable to work only with flash-based disks: SATA SSDs, or better: NVMe storage. Those have superior sequential write speeds and can retrieve random data much faster than spinning disks.

imposm will convert the data first into a more efficient on-disk format before flushing it into the database: By default it will create a cache directory which will take around twice the size of the original file. With the standard mapping the cluster size will be about 6x larger than the input file. Though, your milage may vary, as WAL configuration, toast settings and file system influence cluster size.

osmium holds an in-memory cache by default (more details in the next section). A full import needs around 9x of the original file size. No indexes will be created by default.

osm2pgsql needs around 7x the disk space of the original file using the standard mapping. The --slim option automatically removes intermediate tables after being finished with the initial import.

RAM

Due to the space-optimized design of the OSM PBF file format, for line or polygon assembly, traversing through at least one level of indirection is necessary. More specificly, all nodes are usually located at the very beginning of the file, followed by all ways. Thus, to assemble a line, the nodes at the start of the file are referenced. When selectively retrieving an object, jumping back to referenced elements is viable, but for unfiltered imports a full in-memory cache is a much faster option. Due to the extremely large amount of nodes, it is advisable to have lots of free RAM, optimally equal to the size of the input file.

osm2pgsql lets users specify the cache size and imposm utilizes an on-disk cache. osmium ships with several strategies and defaults to flex_mem, which will try to select the most efficient type. On memory-constraint systems, it can be run with dense_file_array, which will use a disk-based cache.

In the benchmark imposm utilized up to 3 GB of RAM (roughly equal to the extract size), osmium (with flex_mem) peaked at 5 GB and osm2pgsql has been configured to use up to 5 GB.

In addition to the importer’s memory usage, PostgreSQL needs buffer space as well. In the benchmark one quarter of the total system memory was reserved for its purposes. Due to the flexible memory management of PostgreSQL, memory can be over-provisioned for the time of import, as it will be reclaimed after the importers finish.

CPU

Nowadays, all of the three importers use multithreaded architectures, which allow to saturate several cores. Also, with the introduction of parallel workers for PostgreSQL, the database can work concurrently on the import operations.

CPU saturation

The import duration will not scale linearly with core count, though. While it is advisable to have four cores instead of two and six will be slightly faster than four, there will be diminishing returns: With imposm import duration improved by 25 % when switching from 4 to 16 CPU cores. osmium meanwhile only improved by less than 10 %.

Import Speed

Because everyone loves meaningless synthetic benchmarks, there you go:

  • osmium imported the test file in 15 minutes.
  • imposm took 44 minutes to read, import and generate indexes.
  • osm2pgsql completed the import, including indexes in 62 minutes.

Querying

Query performance is mostly impacted by database size and indexes: A small database may be usable without indexes for simpler queries, a large database will slow down if scanned sequentially.

PostgreSQL brings a lot of (nearly) zero-cost abstractions (e.g. views) that help making data structures manageable, but good planning can improve data base size and thus performance.

Example: Table Structure

Let’s assume you have an imported database and now want to perform an analysis on shops. With an osmium-style single table import, you’d have a table with two columns: geometry and tags. When you perform a query, the database would need to scan all tags for the key shop.

SELECT tags
FROM osmdata
WHERE
	ST_DWithin(
		ST_SetSRID(ST_MakePoint(6.968, 50.943), 4326),
		geom,
		0.1
	)
	AND tags->'shop' IS NOT NULL;

-- Query planner:
->  Parallel Seq Scan on osmdata
	(cost=0.00..1509477.60 rows=3148508 width=87)
	(actual time=3958.424..7044.846 rows=52 loops=3)
[...]
Rows Removed by Filter: 7578245

A speedup can be achieved by an index on the geometry column, but in larger or densely mapped areas PostgreSQL would still need to scan many tuples.

CREATE INDEX ON osmdata USING gist (geom);
-- took 438 s
-- table data: 6047 MB
-- index size: 1546 MB

SELECT tags
FROM osmdata
WHERE
	ST_DWithin(
		ST_SetSRID(ST_MakePoint(6.968, 50.943), 4326),
		geom,
		0.1
	)
	AND tags->'shop' IS NOT NULL;

-- Now the spatial index will be used first and the
-- result set will be filtered down afterwards.
-- Query planner:
Index Cond: (geom && '[...]'::geometry)
Filter: (((tags -> 'shop'::text) IS NOT NULL) AND
	('[...]'::geometry && st_expand(geom, '{...}'::double precision)) AND
	_st_dwithin('[...]'::geometry, geom, '{...}'::double precision))

Rows Removed by Filter: 699511

To achieve a speedup, all elements with the shop key could be indexed3, but an index needs time to initialize, penalizes subsequent updates and inserts and takes disk space.

CREATE INDEX ON osmdata (( tags ->> 'shop'));
-- took 39 s
-- index size: 487 MB

Alternatively all shops could be written into a separate table during import, which could reduce the need for an index. But especially for exploratory development such requirements often cannot be planned upfront. If reimporting is not viable due to resource restrictions, a materialized view4 can be a good alternative as those are written to non-volatile memory.5

CREATE MATERIALIZED VIEW osmdata_shops AS
	SELECT * FROM osmdata
		WHERE tags->>'shop' IS NOT NULL;
-- took 18 s
-- materialized view size: 33 MB

SELECT tags
FROM osmdata_shops
WHERE
	ST_DWithin(
		ST_SetSRID(ST_MakePoint(6.968, 50.943), 4326),
		geom,
		0.1
	);
-- Execution Time: 150 ms

Update or Full Re-import?

osm2pgsql and imposm allow for continuously updateable databases, but this has some limitations: A database cannot be updated, if is hasn’t been declared update-able from the start. Also they need additional on-disk storage. Additionally most users will experience bloat: The database size will only ever increase. Some techniques exist to reduce the impact of this, but an eventual reimport will be necessary, if disk space needs to be reclaimed.

Conclusion

If you want to process OpenStreetMap’s geographical data in PostgreSQL you are presented with some good options:

osm2pgsql is the solid choice, with wide compatibility and a long project history. It’s resource efficient and offers some customization options. It also handles diffs.

imposm might not be very well supported, but offers quick options for preprocessing and is disk-efficient, if needed.

osmium might be the newest, but since it’s a thin layer on top of the legendarily efficient libosmium, it’s the quickest here. It has the fewest customization options, but perfect for everyone who wants to do their work mostly in SQL.

Annex

  • Benchmarks have been performed with a Germany extract (3.0 GB), on a virtualized server with four dedicated cores, 16 GB RAM and NVMe storage, formatted with ext4. The OS is Ubuntu 19.04.
    • PostgreSQL configuration values:
      • shared_buffers = 4GB
      • temp_buffers = 32MB
      • work_mem = 1GB
      • maintenance_work_mem = 64MB
      • autovacuum = off
    • Every run has been performed with a freshly recreated cluster and a PostgreSQL restart.
  • The 16 core benchmarks were performed in the same setup, except for RAM, which increased to 64 GB.
  • The shop query example has been done on the NRW, Germany extract (635 MB).

Consulting is available by the author.


  1. Banner picture by Fallaner, licenced under CC BY SA 4.0 ↩︎

  2. Some of those do have packages for imposm 2, which is the python-based predecessor of imposm (3), which is fundamentally incompatible and mostly unsuitable for the tasks discussed in this article. ↩︎

  3. See https://www.postgresql.org/docs/12/datatype-json.html#JSON-INDEXING ↩︎

  4. See https://www.postgresql.org/docs/current/sql-creatematerializedview.html ↩︎

  5. Materialized views and a continuously updated database can be tricky to handle as those are only updated on demand. ↩︎