<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Thomas Skowron's Blog</title><link>https://thomas.skowron.eu/blog/</link><description>Recent content on Thomas Skowron's Blog</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Sat, 18 Feb 2017 11:19:40 +0000</lastBuildDate><atom:link href="https://thomas.skowron.eu/blog/" rel="self" type="application/rss+xml"/><item><title>Load Balancing Without Giving Away the Keys</title><link>https://thomas.skowron.eu/blog/loadbalancing-without-giving-away-the-keys/</link><pubDate>Mon, 31 Aug 2020 15:49:00 +0000</pubDate><guid>https://thomas.skowron.eu/blog/loadbalancing-without-giving-away-the-keys/</guid><description>&lt;p>Distributing the load of a web application or especially an API endpoint using a load balancer (LB) is highly useful: You can get better performance, make software rollouts smoother and withstand node failure.&lt;/p>
&lt;p>If an application node becomes unavailable due to failure or a planned maintenance, the LB notices that it does not respond and stops send traffic to it.&lt;/p>
&lt;video width="100%" height="200" autoplay loop playsinline muted>
&lt;source src="lb-pure.mp4">
&lt;source src="lb-pure-av1.mkv">
&lt;/video>
&lt;p>The LB itself can go down, too, but since LBs can be made stateless, failover and redundancy can be achieved more easily through e.g. the usage of CARP&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup> or anycast. Many cloud providers offer managed load balancers.&lt;/p>
&lt;p>Since HTTPS has become the standard even for non-private data transfers, the management of TLS certificates and keys needs to be considered as well. Often software operators opt to decrypt the TLS on load balancer level and send traffic via unencrypted HTTP to the nodes, compromising confidentiality.&lt;/p>
&lt;p>TLS can also be handled by the nodes itself, but this bears two challenges: All the nodes need the same key and certificate and also, if you&amp;rsquo;re acquiring the certificates from Let&amp;rsquo;s Encrypt or any ACME-compatible CA, they need to serve the appropriate authorization challenge.&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>&lt;/p>
&lt;p>For serving HTTPS directly from a Go application, there is the &lt;a href="https://github.com/caddyserver/certmagic">certmagic&lt;/a> library, which also powers the automatic TLS management in &lt;a href="https://caddyserver.com">Caddy&lt;/a>. By default, it manages key generation, cert acquisition and wraps the HTTP server with TLS. The storage for the keys is rather flexible: Normally they&amp;rsquo;re stored on disk, but you can add a different adapter or write your own. There is a third-party S3 adapter, which allows you to store your certs/keys/challenges on Amazon S3. If you configure your nodes to use the same S3 bucket, it doesn&amp;rsquo;t matter which node will be asked by the load balancer to serve the challenge, all of them can answer, because they&amp;rsquo;re pulling the challenge via API.&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/loadbalancing-without-giving-away-the-keys/keys-on-s3.png" alt="Keys and certs on S3">
&lt;/p>
&lt;p>Personally I found it unacceptable to store all the keys in plain text on some else&amp;rsquo;s hard drive. Perfect forward secrecy prevents from decoding HTTPS traffic before a breach, but with a third party service it&amp;rsquo;s impossible to determine whether such an event took place.&lt;/p>
&lt;p>This is the reason why I built a &lt;a href="https://github.com/thomersch/certmagic-generic-s3">custom certificate storage for certmagic&lt;/a>, which stores the keys, certs and challenges on any S3-compatible store (e.g. Backblaze, Digital Ocean Spaces or your own Minio instance), but encrypts it using &lt;a href="https://nacl.cr.yp.to/secretbox.html">NaCl&amp;rsquo;s secret box&lt;/a>&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup> before sending it off.&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/loadbalancing-without-giving-away-the-keys/keys-on-s3-encr.png" alt="Keys and certs on S3">
&lt;/p>
&lt;p>You don&amp;rsquo;t have to trust any of the provider&amp;rsquo;s claims about ”at rest“ encryption and you don&amp;rsquo;t have to implement any highly available storage yourself. You just have to spend a few pennies per month for any S3 storage, without lock-in.&lt;/p>
&lt;p>This allows you to do HTTPS without having to break encryption mid-way and you can offload TLS to the nodes, instead of the LB having to do all the work.&lt;/p>
&lt;section class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1" role="doc-endnote">
&lt;p>&lt;a href="https://www.freebsd.org/doc/handbook/carp.html">FreeBSD Manual on CARP&lt;/a> &lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2" role="doc-endnote">
&lt;p>To be pedantic, only if you&amp;rsquo;re using either TLS-ALPN or HTTP authorization. You can also use DNS validation, if your DNS software or zone provider supports this. &lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3" role="doc-endnote">
&lt;p>More specifically, the &lt;a href="https://pkg.go.dev/golang.org/x/crypto@v0.0.0-20200820211705-5c72a883971a/nacl/secretbox?tab=doc">secretbox implementation in Go&lt;/a> &lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/section></description></item><item><title>Printing with SwiftUI</title><link>https://thomas.skowron.eu/blog/printing-with-swiftui/</link><pubDate>Wed, 08 Jul 2020 16:12:00 +0000</pubDate><guid>https://thomas.skowron.eu/blog/printing-with-swiftui/</guid><description>&lt;p>This post combines one of the world&amp;rsquo;s oldest information technologies with one of the newer ones: Printing on paper and SwiftUI. SwiftUI is Apple&amp;rsquo;s rather new framework to declaratively design user interfaces, which runs on all their platforms. It has a few rough edges, but it is very powerful, because it allows mixing and matching with classic &lt;code>NSViews&lt;/code> on macOS, so you can even use them for print (or PDF) output.&lt;/p>
&lt;p>macOS has always had rather great printing support and for a first-class Mac app this is a feature one shouldn&amp;rsquo;t miss.&lt;/p>
&lt;p>Integrating the print workflow into a new app is not too hard and starts with connecting the “Print&amp;hellip;“ menu item&amp;rsquo;s action to an outlet. The menu configuration is not part of SwiftUI, but lives by default in a dedicated storyboard &lt;code>Main.storyboard&lt;/code>.&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/printing-with-swiftui/storyboard.png" alt="Storyboard">
&lt;/p>
&lt;p>With an option click drag (aka &lt;em>right click drag&lt;/em>) you can connect the print menu entry to an outlet in your &lt;code>AppDelegate&lt;/code> class.&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/printing-with-swiftui/outlets.png" alt="Outlets">
&lt;/p>
&lt;p>Your delegate will look something like this:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-swift" data-lang="swift">&lt;span style="font-weight:bold">@IBAction&lt;/span> &lt;span style="font-weight:bold">func&lt;/span> &lt;span style="color:#900;font-weight:bold">applicationPrint&lt;/span>(&lt;span style="font-weight:bold">_&lt;/span> sender: NSMenuItem) {
...
}
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Inside this method you&amp;rsquo;ll need to setup your view that you will want to print, set up the Print layout info and then hand over to the user to make a decision about the output format and device.&lt;/p>
&lt;p>First of all, inside your delegate, set up an &lt;code>NSPrintInfo&lt;/code> instance and configure it:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-swift" data-lang="swift">&lt;span style="font-weight:bold">let&lt;/span> &lt;span style="color:#008080">printInfo&lt;/span> = NSPrintInfo()
printInfo.scalingFactor = &lt;span style="color:#099">0.7&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>I used a smaller scaling factor, because the font sizes suitable for screens were too large for printing in my particular case.&lt;/p>
&lt;p>Unless you want to print the whole window with e.g. all input elements and text fields (in which case you can simply pass &lt;code>window!.contentView&lt;/code> to the print routine), you need to initialize your printable view and populate it with some data. Let&amp;rsquo;s assume you have a SwiftUI view called &lt;code>EntryListView&lt;/code> and it takes a list of string as data:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-swift" data-lang="swift">&lt;span style="font-weight:bold">let&lt;/span> &lt;span style="color:#008080">entryListView&lt;/span> = EntryListView(data: [&lt;span style="color:#b84">&amp;#34;This is a string&amp;#34;&lt;/span>, &lt;span style="color:#b84">&amp;#34;And another one&amp;#34;&lt;/span>, &lt;span style="color:#b84">&amp;#34;One more&amp;#34;&lt;/span>])
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Since this is a standalone SwiftUI view, it needs to be contained in an &lt;code>NSHostingView&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-swift" data-lang="swift">&lt;span style="font-weight:bold">let&lt;/span> &lt;span style="color:#008080">printContainerView&lt;/span> = NSHostingView(rootView: entryListView)
printContainerView.frame.size = CGSize(width: &lt;span style="color:#099">800&lt;/span>, height: &lt;span style="color:#099">600&lt;/span>)
&lt;/code>&lt;/pre>&lt;/div>&lt;p>This view is now wrapped and can be passed to a print operation alongside the print info configuration:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-swift" data-lang="swift">&lt;span style="font-weight:bold">let&lt;/span> &lt;span style="color:#008080">printOperation&lt;/span> = NSPrintOperation(view: printContainerView, printInfo: printInfo)
printOperation.printInfo.isVerticallyCentered = &lt;span style="font-weight:bold">false&lt;/span>
printOperation.printInfo.isHorizontallyCentered = &lt;span style="font-weight:bold">false&lt;/span>
printOperation.runModal(&lt;span style="font-weight:bold">for&lt;/span>: window, delegate: &lt;span style="font-weight:bold">self&lt;/span>, didRun: &lt;span style="font-weight:bold">nil&lt;/span>, contextInfo: &lt;span style="font-weight:bold">nil&lt;/span>)
&lt;/code>&lt;/pre>&lt;/div>&lt;p>If the user selects “Print&amp;hellip;” from the menu bar or hits “⌘P”, a print dialog will popup with your view ready to be printed:&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/printing-with-swiftui/printdialog.png" alt="Print Dialog">
&lt;/p></description></item><item><title>Migrating Foreign Keys in PostgreSQL</title><link>https://thomas.skowron.eu/blog/migrating-foreign-keys-in-postgresql/</link><pubDate>Thu, 21 May 2020 08:30:48 +0000</pubDate><guid>https://thomas.skowron.eu/blog/migrating-foreign-keys-in-postgresql/</guid><description>&lt;p>Applications often need to work with external IDs, e.g. UUIDs of third party services or &lt;a href="https://en.wikipedia.org/wiki/Stock_keeping_unit">SKUs&lt;/a>. In such cases, an external ID should be stored just once in a mapping table and from there on only referenced by an internal (e.g. serial) foreign key. Without a dedicated mapping table, you will carry such external identifiers through all tables. This may seem reasonable at the beginning, but may soon cause trouble, e.g. if you want to add a second type of external service or when you realize, that those seemingly unique identifiers are not as unique as assumed.&lt;/p>
&lt;p>Even though I should know better, this kind of mistake happens to me from time to time anyway. Or it happens to others and I have to save the day.&lt;/p>
&lt;p>Because PostgreSQL is not the wild west, if you reference a field in a different table, this will be enforced through a constraint, which is fantastic for data consistency, but annoying if you misdesigned your data structure and have to change your foreign keys. But, don&amp;rsquo;t worry, it&amp;rsquo;s still possible to change this inside a single transaction with those four-ish steps:&lt;/p>
&lt;ol start="0">
&lt;li>Create the new key ID column, make it unique (e.g. &lt;code>serial&lt;/code>)&lt;/li>
&lt;li>Identify all tables which reference the (old) ID:&lt;br>
1.1. Note the table name&lt;br>
1.2. Find the referencing column&lt;br>
1.3. Determine the foreign key constraint name (see &lt;code>pg_constraint&lt;/code> table)&lt;/li>
&lt;li>For each referencing table:&lt;br>
2.1. Drop the existing constraint (from 1.3.)&lt;br>
2.2. Update the referencing key values to the new ones&lt;br>
2.3. Add new constraint for the new base table ID&lt;/li>
&lt;li>Drop primary key constraint from base table&lt;/li>
&lt;li>Create new primary key constraint on base table&lt;/li>
&lt;/ol>
&lt;h2 id="example">Example&lt;/h2>
&lt;p>Let&amp;rsquo;s assume your application tracks users, which have external identifiers. At the moment you got a user table, which has this external ID as the primary key:&lt;/p>
&lt;h3 id="user">&lt;code>user&lt;/code>&lt;/h3>
&lt;table>
&lt;tr>
&lt;th>external_user_id (primary key)&lt;/th>
&lt;th>user_name&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td>12312&lt;/td>
&lt;td>Alice&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>91823&lt;/td>
&lt;td>Bob&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>Thus, all tables that reference any user have those long, external identifiers:&lt;/p>
&lt;h3 id="user_permissions">&lt;code>user_permissions&lt;/code>&lt;/h3>
&lt;table>
&lt;tr>
&lt;th>user_id&lt;/th>
&lt;th>can_read&lt;/th>
&lt;th>can_write&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td>12312&lt;/td>
&lt;td>true&lt;/td>
&lt;td>true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>91823&lt;/td>
&lt;td>false&lt;/td>
&lt;td>true&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>Now, if you want to gain flexbility and add an internal ID, you can start off with adding a &lt;code>serial&lt;/code> column to users:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">ALTER&lt;/span> &lt;span style="font-weight:bold">TABLE&lt;/span> &lt;span style="font-weight:bold">user&lt;/span> &lt;span style="font-weight:bold">ADD&lt;/span> &lt;span style="font-weight:bold">column&lt;/span> id &lt;span style="color:#999">serial&lt;/span> &lt;span style="font-weight:bold">UNIQUE&lt;/span>;
&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>user&lt;/code> would now look like this:&lt;/p>
&lt;table>
&lt;tr>
&lt;th>external_user_id (primary key)&lt;/th>
&lt;th>user_name&lt;/th>
&lt;th>id&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td>12312&lt;/td>
&lt;td>Alice&lt;/td>
&lt;td>1&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>91823&lt;/td>
&lt;td>Bob&lt;/td>
&lt;td>2&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>Now, for every referencing table, you need to strip the constraint, update the values to the new serial ID and then add the new constraint:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">ALTER&lt;/span> &lt;span style="font-weight:bold">TABLE&lt;/span> user_permissions
&lt;span style="font-weight:bold">DROP&lt;/span> &lt;span style="font-weight:bold">CONSTRAINT&lt;/span> user_permissions_user_id_...id;
&lt;span style="font-weight:bold">UPDATE&lt;/span> user_permissions
&lt;span style="font-weight:bold">SET&lt;/span> user_id &lt;span style="font-weight:bold">=&lt;/span> (
&lt;span style="font-weight:bold">SELECT&lt;/span> id
&lt;span style="font-weight:bold">FROM&lt;/span> &lt;span style="font-weight:bold">user&lt;/span>
&lt;span style="font-weight:bold">WHERE&lt;/span> external_user_id &lt;span style="font-weight:bold">=&lt;/span> user_permissions.user_id
);
&lt;span style="font-weight:bold">ALTER&lt;/span> &lt;span style="font-weight:bold">TABLE&lt;/span> user_permissions
&lt;span style="font-weight:bold">ADD&lt;/span> &lt;span style="font-weight:bold">CONSTRAINT&lt;/span> user_permissions_user_id_...id
&lt;span style="font-weight:bold">FOREIGN&lt;/span> &lt;span style="font-weight:bold">KEY&lt;/span> (user_id)
&lt;span style="font-weight:bold">REFERENCES&lt;/span> &lt;span style="font-weight:bold">user&lt;/span> (id);
&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>user_permissions&lt;/code> looks now like this:&lt;/p>
&lt;table>
&lt;tr>
&lt;th>user_id&lt;/th>
&lt;th>can_read&lt;/th>
&lt;th>can_write&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>true&lt;/td>
&lt;td>true&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>false&lt;/td>
&lt;td>true&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>Finally, you can now change &lt;code>user.external_user_id&lt;/code> without constraints, you just have to swap out the primary key in &lt;code>user&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">ALTER&lt;/span> &lt;span style="font-weight:bold">TABLE&lt;/span> &lt;span style="font-weight:bold">user&lt;/span> &lt;span style="font-weight:bold">DROP&lt;/span> &lt;span style="font-weight:bold">CONSTRAINT&lt;/span> user_pkey;
&lt;span style="font-weight:bold">ALTER&lt;/span> &lt;span style="font-weight:bold">TABLE&lt;/span> &lt;span style="font-weight:bold">user&lt;/span> &lt;span style="font-weight:bold">ADD&lt;/span> &lt;span style="font-weight:bold">PRIMARY&lt;/span> &lt;span style="font-weight:bold">KEY&lt;/span> (id);
&lt;/code>&lt;/pre>&lt;/div>&lt;p>At this moment, you can even put the external identifier into a different table, if you wish, thus possibly separating the concerns even better.&lt;/p>
&lt;h2 id="bonus-content-for-django-users">Bonus Content for Django Users&lt;/h2>
&lt;p>Swapping out primary keys is too difficult for the automatic Django migration assistant, so you&amp;rsquo;ll need to get your hands dirty. Obviously, you can do all those steps as raw SQL migrations, but then &lt;code>makemigrations&lt;/code> will still nag you about changes in your model, thus you got to use &lt;code>SeparateDatabaseAndState&lt;/code>, apply the changes with SQL and then tell Django what the effect is. For an example, see my code in &lt;a href="https://github.com/thomersch/openstreetmap-calendar/commit/7890094c28dc1990eee87ffe65e5b2b926a46add">osmcal&lt;/a>.&lt;/p>
&lt;h3 id="extra-bonus">Extra Bonus&lt;/h3>
&lt;p>If you&amp;rsquo;re working with raw SQL migrations, you might run into trouble and all you&amp;rsquo;ll get is some context-free SQL error message. For better debugging, you can use the &lt;code>sqlmigrate&lt;/code> subcommand which will display the SQL queries instead of directly executing them. That way you can step through the statements and debug them step by step.&lt;/p>
&lt;pre>&lt;code>./manage.py sqlmigrate &amp;lt;app name&amp;gt; &amp;lt;migration name&amp;gt;
&lt;/code>&lt;/pre></description></item><item><title>Making of the OpenStreetMap Calendar</title><link>https://thomas.skowron.eu/blog/making-of-the-osm-calendar/</link><pubDate>Fri, 15 May 2020 12:28:17 +0000</pubDate><guid>https://thomas.skowron.eu/blog/making-of-the-osm-calendar/</guid><description>&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/making-of-the-osm-calendar/osmcal.png" alt="Screenshot of osmcal">
&lt;/p>
&lt;p>Wikis have changed the landscape of information technology: No matter what organisation you‘re in and what topic you‘re working on, chances are that there is a Wiki, in which you can collaborate with others. Mostly without asking permission, you can put in notes, make lists and refine the material of others. Same with the OpenStreetMap wiki: Some pages are the standard source of information, like &lt;em>“How to map a…”&lt;/em>, others are things relevant to just a single person, like pages where mappers list all the street they‘ve already mapped. And then there are things that have started with a good idea and gotten out of hand…&lt;/p>
&lt;p>While collaborating on texts is good in most Wiki software, when it gets to tables and templates and structures, things get confusing, rather quickly. Sadly one of those examples is the Wiki calendar, which involves fiddling around with the custom templates, writing markup and someone rotating past events by hand. When we were talking about the OSM Wiki during events, people have been complaining repeatedly about the calendar. Some communities don‘t even announce their events in it, because they have shifted to Meetup or Facebook for better usability. There have been discussions how to solve it, maybe some sort of plugin or extension, but nothing has happened.&lt;/p>
&lt;h2 id="inception">Inception&lt;/h2>
&lt;p>So, in early summer of 2019 &lt;a href="https://youtu.be/MRuS3dxKK9U?t=89">I was mad as hell and couldn‘t take it anymore&lt;/a>, so I sat down and started writing a &lt;a href="https://osmcal.org">standalone, special purpose web-based calendar for the OSM community. With maps!&lt;/a>&lt;/p>
&lt;p>Initially I wanted to utilize the infrastructure of the German OpenStreetMap Local Chapter, FOSSGIS, but they were busy generating rules about what I got to provide before getting a precious openstreetmap.de subdomain, so I went on my own and the domain &lt;a href="https://osmcal.org">osmcal.org&lt;/a> was born.&lt;/p>
&lt;p>The software is based on the incredibly powerful Django framework which helps with HTML templates, form handling, databases, authentication and much more. It‘s all &lt;a href="https://github.com/thomersch/openstreetmap-calendar">open source&lt;/a>, of course, as it should be. The issue tracker is on GitHub and already attracts a small community of people with bug reports, feature requests or general questions.&lt;/p>
&lt;h2 id="progress">Progress&lt;/h2>
&lt;p>While osmcal started with a minimal feature set, development has continued and with the input from community members some features have shipped since then:&lt;/p>
&lt;ul>
&lt;li>In October 2019, the possibility to “join“ events has been added, so organizers get a list of participants and as a participant you can see who else is going to come&lt;/li>
&lt;li>In December 2019, iCalendar subscriptions have been made available&lt;/li>
&lt;li>Since February 2020, event organizers can create surveys for participation, so they can ask people for e.g. their t-shirt sizes, their dietary preferences or skill level.&lt;/li>
&lt;/ul>
&lt;p>Special thanks to all reporters, especially &lt;a href="https://github.com/jbelien">@jbelien&lt;/a>, &lt;a href="https://github.com/adrienandrem">@adrienandrem&lt;/a> and &lt;a href="https://github.com/qeef">@qeef&lt;/a> for their collaboration and testing!&lt;/p>
&lt;h2 id="future">Future&lt;/h2>
&lt;p>Of course, at the moment, in-person meetings are out of the question, but this doesn’t mean that there are no events at all. Some are postponed, but many are happening online and those need to be scheduled and communicated as well. Thus, now it’s about time to integrate time zone support, which I initially deemed unnecessary as every event would have a location, but that has been disproven by the increased need of online events.&lt;/p>
&lt;p>Also, I want more communities to integrate osmcal into their websites and wiki pages, so their lives get easier by not having to create and maintain their events on multiple platforms.&lt;/p>
&lt;p>I was asked to consider applying for an &lt;a href="https://wiki.openstreetmap.org/wiki/Microgrants/Microgrants_2020">OpenStreetMap Microgrant&lt;/a>, so I submitted the aforementioned ideas with a request for funding. If you like to see those features and want to support further development, consider expressing your support on my &lt;a href="https://wiki.openstreetmap.org/wiki/Microgrants/Microgrants_2020/Proposal/OpenStreetMap_Calendar">grant application&lt;/a>.&lt;/p>
&lt;h2 id="making-more-connections">Making more connections&lt;/h2>
&lt;p>One new functionality that will be coming soon is community tagging, so organizers can create non-OSM-related events and have a dedicated feed with all their events. But more on that when it’s done. Stay tuned.&lt;/p></description></item><item><title>Faster Map Making With osmium</title><link>https://thomas.skowron.eu/blog/fast-mapmaking-with-osmium/</link><pubDate>Sun, 22 Mar 2020 17:29:08 +0000</pubDate><guid>https://thomas.skowron.eu/blog/fast-mapmaking-with-osmium/</guid><description>&lt;p>It&amp;rsquo;s sunday and I am fiddling around to make some maps, maybe to put them up my own wall and that&amp;rsquo;s the moment where I usually stuggle to get data out of OpenStreetMap into QGIS: Overpass is slow and won&amp;rsquo;t give you too much data, Shapefiles are ok, but this only works for limited regions and filtering and exporting OSMPBFs into GeoJSON is annoying.&lt;/p>
&lt;p>Of course it would be better to have the data in a local PostgreSQL, but in order to get it there, you need to think about what data you want to load, but since you&amp;rsquo;re still iterating and fumbling with the data, this is not yet clear, so you&amp;rsquo;d need to have some cycle of loading, purging and reloading again. Annoying.&lt;/p>
&lt;p>Then I remembered that you can use osmium to get &lt;em>all&lt;/em> the data into PostgreSQL, without having to filter, which is not fast enough for tile servers, but easily fast enough for some experimentation.&lt;/p>
&lt;p>Just get osmium tool and fire up:&lt;/p>
&lt;pre>&lt;code>osmium export -f pg -o taiwan.pg taiwan-latest.osm.pbf
&lt;/code>&lt;/pre>
&lt;p>Create a Postgres database and a table for the data:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">CREATE&lt;/span> &lt;span style="font-weight:bold">DATABASE&lt;/span> mapmaking;
&lt;span style="color:#a61717;background-color:#e3d2d2">\&lt;/span>&lt;span style="font-weight:bold">c&lt;/span> mapmaking
&lt;span style="font-weight:bold">CREATE&lt;/span> EXTENSION postgis;
&lt;span style="font-weight:bold">CREATE&lt;/span> &lt;span style="font-weight:bold">TABLE&lt;/span> taiwan (
geom GEOMETRY,
tags JSONB
);&lt;/code>&lt;/pre>&lt;/div>
&lt;p>And now suck the data into the table:&lt;/p>
&lt;pre>&lt;code class="language-psql" data-lang="psql">\copy taiwan FROM &amp;#39;taiwan.pg&amp;#39;&lt;/code>&lt;/pre>
&lt;p>Now you got all the geometries in one column, the other holds all the tags. Now, you can fire up QGIS and use that table with some rudimentary querying action:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">SELECT&lt;/span>
geom, tags&lt;span style="font-weight:bold">-&amp;gt;&amp;gt;&lt;/span>&lt;span style="color:#b84">&amp;#39;ele&amp;#39;&lt;/span> &lt;span style="font-weight:bold">AS&lt;/span> ele,
tags&lt;span style="font-weight:bold">-&amp;gt;&amp;gt;&lt;/span>&lt;span style="color:#b84">&amp;#39;name&amp;#39;&lt;/span>,
tags&lt;span style="font-weight:bold">-&amp;gt;&amp;gt;&lt;/span>&lt;span style="color:#b84">&amp;#39;name:en&amp;#39;&lt;/span> &lt;span style="font-weight:bold">AS&lt;/span> name_en
&lt;span style="font-weight:bold">FROM&lt;/span>
taiwan
&lt;span style="font-weight:bold">WHERE&lt;/span> tags&lt;span style="font-weight:bold">-&amp;gt;&amp;gt;&lt;/span>&lt;span style="color:#b84">&amp;#39;natural&amp;#39;&lt;/span>&lt;span style="font-weight:bold">=&lt;/span>&lt;span style="color:#b84">&amp;#39;peak&amp;#39;&lt;/span>;&lt;/code>&lt;/pre>&lt;/div>
&lt;p>Enjoy!&lt;/p>
&lt;p>By the way, my work-in-progress greyscale map looks like this:&lt;/p>
&lt;div class="img-screen-max-height">
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/fast-mapmaking-with-osmium/taiwan.png" alt="Taiwan">
&lt;/p>
&lt;/div></description></item><item><title>Towards Remotified Conferences</title><link>https://thomas.skowron.eu/blog/towards-remotified-conferences/</link><pubDate>Sun, 15 Mar 2020 15:16:29 +0000</pubDate><guid>https://thomas.skowron.eu/blog/towards-remotified-conferences/</guid><description>&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/towards-remotified-conferences/fossgis2020-0904.jpg" alt="FOSSGIS Conference Participants 2020">
&lt;/p>
&lt;p>Our annual &lt;a href="https://fossgis-konferenz.de/2020/">FOSSGIS conference&lt;/a> just ended prematurely on Friday. Due to Corona we had to improvise and tweak a few things. Notably a lot of people couldn&amp;rsquo;t make it, either because they wanted to stay on the safe side for personal reasons or because their employers cancelled all business trips. This resulted in a lot of cancellations just before the conference, including a few speakers.&lt;/p>
&lt;p>For those who still wanted to join, we wanted to have as much content as possible. We already live stream almost all the talks, except for a few who don&amp;rsquo;t wish to be televised, so remote consumption of the conference programme is nothing new to us. But because of speakers not being able to travel, we had to adapt and enabled remote speaking for the first time.&lt;/p>
&lt;p>We are of the opinion that conferences are not just about talks, but about the connections we make, but considering the situation we can make more connections by allowing remote speaking than by cancelling the talks altogether. Unfortunately, we didn&amp;rsquo;t have much time before the conference to test thoroughly: The final decision to hold the conference just happened less than 48 hours before the opening.&lt;/p>
&lt;h2 id="tech">Tech&lt;/h2>
&lt;p>Our conferences are usually held at universities, this year at &lt;a href="https://www.uni-freiburg.de/?set_language=en">University of Freiburg&lt;/a>, which gave us the possibility to use the video conference calling system of the &lt;a href="https://www.dfn.de/en/">German National Research and Education Network, DFN&lt;/a>, which is a web based application that luckily doesn&amp;rsquo;t need any addtional plugins.&lt;/p>
&lt;p>In some cases we only had a few hours between knowing the speaker wouldn&amp;rsquo;t come to the conference and their scheduled talk. This meant we could only inform them about the technical necessities and ask them to join one of the test runs. Unfortunately not all of the participants were able to join our test, so in some cases we had a rocky start into the talk: In those cases the viewers had to wait for several minutes until the setup was settled. This is not perfect, but the pariticipants' expectations were reduced due to the circumstances.&lt;/p>
&lt;p>Also, because there was not enough time, we couldn&amp;rsquo;t require a specific hardware setup. Thus some speakers had used either their notebook&amp;rsquo;s internal microphone, which even if not drowned by fan noise in most cases is too quiet, or they used Bluetooth headsets which had troubles with transmission, resulting in very distorted and unpleasant audio.&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/towards-remotified-conferences/fossgis2020-0841.jpg" alt="Talk with empty stage">
&lt;/p>
&lt;p>The video part worked much better: Most webcams are good enough for the picture-in-picture mode we used: Big slides on full screen and small camera window in the corner. Small video interruptions or low resolution are also not as annoying as audio hiccups.&lt;/p>
&lt;h2 id="qa">Q&amp;amp;A&lt;/h2>
&lt;p>As I mentioned before, I believe that the exchange with participants and speakers is very important. Of course you can&amp;rsquo;t just talk in the hallway with the speaker after the talk, but at least for publicly suitable questions we gave everyone to get the microphone and ask the speaker directly in the Q&amp;amp;A session. Thanks to the &lt;a href="https://en.wikipedia.org/wiki/Mix-minus">N-1 audio&lt;/a> setup the &lt;a href="https://c3voc.de">VOC&lt;/a> has provided for us, the whole speaker-participant conversation worked quite well without any echo.&lt;/p>
&lt;h2 id="speaker-experience">Speaker Experience&lt;/h2>
&lt;p>While the participants were mostly positive about the possibilities to watch the talks and being able to ask questions, some speakers noted that the presenting experience has not been optimal: They did not have a video feed from the hall, but rather just a view into the control room. I suggested to turn the camera into the crowd, but the VOC crew noted that for privacy reasons they don&amp;rsquo;t want to do that, even though that we asked for the participants' permission.&lt;/p>
&lt;p>We muted the moderator&amp;rsquo;s microphone during the talk, so we the audience wouldn&amp;rsquo;t get any amplified noises, but this also meant that the speakers didn&amp;rsquo;t get any audible feedback from hall during the talk.&lt;/p>
&lt;h2 id="developing-the-concept">Developing the Concept&lt;/h2>
&lt;p>We didn&amp;rsquo;t really plan for all of this, but reacted to the situation, so not everything has been perfect. If we want or need remote speakers in the future, we need to make sure that a few things are in place:&lt;/p>
&lt;h3 id="must-haves">Must Haves&lt;/h3>
&lt;p>&lt;strong>Headset:&lt;/strong> Speakers need a wired headset or dedicated microphone, no excuses. Yes, it is easier to just talk into the notebook, but this does not work with a hall full of people. If your speaker doesn&amp;rsquo;t have suitable equipment, you&amp;rsquo;ll need to provide some. I&amp;rsquo;ve been told that people had good experiences with the Sennheiser PC 8 (&lt;a href="https://amzn.to/2WfyaV6">Amazon&lt;/a>), which is around 30 Euro/35 USD and features driver-less USB audio.&lt;/p>
&lt;p>&lt;strong>Internet Connectivity:&lt;/strong> Remote speaking is not for everyone from everywhere, you absolutely need a stable connection above all. Speaking from a home DSL or cable connection is good, but only if you don&amp;rsquo;t have packet loss and at least a few megabits of bandwidth. If speakers don&amp;rsquo;t have this at home or in the office, they definitely need to look for a better location. The location also needs to be tested beforehand, so there are no unpleasant surprises such as restrictive firewalls. Also: No wi-fi. Especially in metropolitan areas where multiple dozen wireless networks are competing, there will be unpleasant jitter. Speakers need to connect cables or, if sufficiently stable, use LTE, which has latency guarantees (no LTE over wi-fi though).&lt;/p>
&lt;p>&lt;strong>Light:&lt;/strong> Often more important than the camera itself is the lighting. Optimally natural light coming from the same direction as the camera or from the side. If that is not available, a properly illuminated ceiling works as well.&lt;/p>
&lt;p>&lt;strong>Test Run:&lt;/strong> Every talk needs to be tested. Optimally a connection from the venue should be established and tested for multiple minutes. Based on the experience, I strongly suggest to test a few days in advance, so speakers can acquire help or better equipment, if necessary.&lt;/p>
&lt;h3 id="nice-to-have">Nice to Have&lt;/h3>
&lt;p>&lt;strong>Chat Rooms:&lt;/strong> With FOSSGIS conference we are already streaming and recording all talks (most are published on the same day), so people who can&amp;rsquo;t come to watch, can do it from home. But it would be nice to facilitate the contact between speakers and viewers: It much easier to talk to someone spontaneously in the hallway than writing an email.&lt;/p>
&lt;p>&lt;strong>Conferencing Software:&lt;/strong> The DFN system worked quite alright, but it is a closed systems in which only research institutes can initiate rooms. In my experience Apple&amp;rsquo;s FaceTime is the most reliable video calling service, but it is only available on Apple devices thus not a good option for conferences. I would much more prefer some open source solution, that works for all operating systems and enables speakers and participants to create own rooms. I have been using &lt;a href="https://jitsi.org">Jitsi&lt;/a> for some 1-to-1 video calls with good results. Probably it could also be used for conference setups.&lt;/p>
&lt;p>&lt;strong>Live (Audio/Video) Feedback&lt;/strong>: YouTube and Twitch have live chat for streamers. Especially speakers who don&amp;rsquo;t just read slides, but have a more conversational style of speaking need to get some live feedback, so audience video and audio could help as long as it doesn&amp;rsquo;t get too noisy.&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>We are not yet able to replace in-person conferences fully, especially considering that you can&amp;rsquo;t drink a beverage with the conference participants after a long day of talks, but considering the more available and affordable technology, we can at least save some of the tedious travel.&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/towards-remotified-conferences/fossgis2020-0877.jpg" alt="Conference Hallway Track">
&lt;/p>
&lt;p>A big strength of conferences is that people are using a certain time frame to educate and to form new networks, so I doubt that fully remote conferences are going to take over soon, but extraordinary circumstances require extraordinary actions.&lt;/p></description></item><item><title>Why I Don't Believe in Boards Consisting of Volunteers</title><link>https://thomas.skowron.eu/blog/why-i-dont-believe-in-volunteer-boards/</link><pubDate>Sun, 08 Mar 2020 15:32:14 +0000</pubDate><guid>https://thomas.skowron.eu/blog/why-i-dont-believe-in-volunteer-boards/</guid><description>&lt;p>Currently, several people I know hold unpaid board positions in non-profits. Those organisations are tradionally run with slim budgets which suffice for the basic needs of the organisation. Often money is collected and spent for conferences, travel or project grants. But the budget is rarely spent for elected board members, which makes it very hard to fairly evaluate the performance of board members.&lt;/p>
&lt;p>A common opinion is that those organisations do not want to encourage &amp;ldquo;professional politicians&amp;rdquo; that will try to get reelected because of the money.&lt;/p>
&lt;p>But effectively no-one may critize the work of those volunteering board members: Someone did a bad job? Please don&amp;rsquo;t comment, because you don&amp;rsquo;t know how much time this member was able to spend. There weren&amp;rsquo;t any new initiatives? Well, you don&amp;rsquo;t know how busy they were!&lt;/p>
&lt;p>And this makes me a bit uncomfortable: We&amp;rsquo;re part of NGOs or non-profits to support a cause, we hold elections for the best candidates, vote for someone and then some people get special titles. And if we think they did a good job, we&amp;rsquo;re going to re-elect them and reward them with the same responsibility and some more work.&lt;/p>
&lt;p>This way of working excludes basically anyone with a full-time job. If you&amp;rsquo;re already working forty hour weeks and maybe have a commute of an hour per day? Well, it&amp;rsquo;s going to be hard to keep up. It could even result in more strain in boards: If you have individual board members which are able to work on board-related matters during work-time, you&amp;rsquo;ll make it harder for volunteers to keep up. And if you survive you may suffer burnout at the end of your term. This skews the representation as well: There is an overproportional number of self-employed europeans in boards.&lt;/p>
&lt;p>In order to open up, I believe that organisations need to create plans to compensate for a minimum amount of working hours: It won&amp;rsquo;t be feasible to pay people full-time salaries, but they should get the opportinuty to get at least one paid weekday at median national salary.&lt;/p>
&lt;p>Volunteers can still do projects, but a board position which has a constant stream of tasks? This is not a project, that&amp;rsquo;s a job!&lt;/p>
&lt;p>Also, we should not drive away productive volunteers from their projects and convert them into policy makers, but we need to create a creative, resourceful environment for those who want to make plans reality. Otherwise our organisations will stagnate.&lt;/p>
&lt;p>I think that organisations should strive to have enough budget to be able to compensate their administrators. I also believe that those organisations will be able to achieve more thus reaching more supporters, which again will make it easier to pay the boards. The creation of politicians can be limited by other means, e.g. the introduction of term limits.&lt;/p></description></item><item><title>Could We Create Binary Diffs for OpenStreetMap Planet Files?</title><link>https://thomas.skowron.eu/blog/delta-osm/</link><pubDate>Thu, 06 Feb 2020 11:29:08 +0000</pubDate><guid>https://thomas.skowron.eu/blog/delta-osm/</guid><description>&lt;p>Due to the recent spike in OpenStreetMap planet dump traffic, some people have wondered why so many users are redownloading the whole planet every week. The unfortunate truth is that applying replication diffs takes a long time&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup> and is not 100% reliable.&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup> I was wondering whether creating, distributing and applying binary deltas could help here.&lt;/p>
&lt;p>The OSM PBF format consists of a sequence of blocks with data inside. Usually per block there is one kind of objects: nodes, ways or relations. Blocks can be up to 2&lt;sup>32&lt;/sup> bytes in size&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>, but usually capped at a much, much smaller size. The planet file currently consists of more than 14000 blocks, so every block is on average around 4 MB in size.&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>&lt;/p>
&lt;p>If we assume that most changes are in fact additions to OSM, most changes would happen at the end of every section of the respective kinds of objects. This would mean that most blocks at the beginning would stay the same, as they rarely change and thus their delta would be zero.&lt;/p>
&lt;p>The reality is quite different, unfortunately: Deleted objects move the position of all following objects inside the file, so changes cascade to every subsequent block:&lt;/p>
&lt;p>&lt;img src="block-shift@2x.png" alt="Block Shift">&lt;/p>
&lt;p>It becomes very clear that with any change the block will look quite differently. Now, if we compare the planet from 2020-01-20 with 2020-01-27 there is zero blocks that stayed exactly the same. Of course this highly depends on the file packaging, e.g. when tested with an extract of Bremen&lt;sup id="fnref:5">&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref">5&lt;/a>&lt;/sup>, 2020-02-01 and 2020-02-02 around 30 % of the blocks stayed the same.&lt;/p>
&lt;p>The impact of shifting/cascading gets amplified by the used marshalling: In addition to the heavy use of delta encoded varints inside the file format, every block is compressed using zlib, so small changes can have an even more significant impact on the final block.&lt;/p>
&lt;p>&lt;img src="delta-shift@2x.png" alt="More realistic example">&lt;/p>
&lt;p>A final option is to compare the changes of every block in an uncompressed state: Let&amp;rsquo;s walk through two files and generate a blockwise binary diff.&lt;sup id="fnref:6">&lt;a href="#fn:6" class="footnote-ref" role="doc-noteref">6&lt;/a>&lt;/sup>&lt;/p>
&lt;p>But even after decompression, the differences are too significant: The decompressed blocks are around 8.5 MB&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup>, while the (binary, compressed) deltas between weekly versions are on average 2 MB. A blockwise weekly diff would be around half the size of the full planet, plus any new data at the end. Also, generating and applying binary diffs is far from being fast.&lt;/p>
&lt;p>The diffability could be improved by keeping blocks as similar as possible, e.g. by removing deleted objects without allowing subsequent ones to move forward. But this would come at a cost of higher complexity during OSM PBF assembly. Furthermore deleted objects would not reduce file size.&lt;/p>
&lt;p>&lt;strong>Conclusion:&lt;/strong> With the current file format, there is no simple alternative approach and sticking to replication diffs is the only viable option at the moment.&lt;/p>
&lt;h3 id="_bonus-question_-could-we-reduce-file-sizes-by-using-a-different-compression-algorithm">&lt;em>Bonus Question:&lt;/em> &amp;ldquo;Could we reduce file sizes by using a different compression algorithm?&amp;rdquo;&lt;/h3>
&lt;p>OSM PBF nowadays uses zlib compression which has a good compression ratio for data blocks.&lt;sup id="fnref:7">&lt;a href="#fn:7" class="footnote-ref" role="doc-noteref">7&lt;/a>&lt;/sup> But how about the newer Zstandard compression, which promises smaller files than zlib at a much higher performance?&lt;/p>
&lt;p>&lt;strong>Bremen&lt;/strong>&lt;/p>
&lt;table style="text-align:right">
&lt;tr>
&lt;th>&lt;/th>
&lt;th style="text-align:right">Compression Ratio&lt;/th>
&lt;th style="text-align:right">Decoding Duration (cumulated)&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td>zlib&lt;/td>
&lt;td>2.28x&lt;/td>
&lt;td>464 ms&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>zstd&lt;/td>
&lt;td>2.40x&lt;/td>
&lt;td>169 ms&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>&lt;strong>Planet&lt;/strong>&lt;sup id="fnref:8">&lt;a href="#fn:8" class="footnote-ref" role="doc-noteref">8&lt;/a>&lt;/sup>&lt;/p>
&lt;table style="text-align:right">
&lt;tr>
&lt;th>&lt;/th>
&lt;th style="text-align:right">Compression Ratio&lt;/th>
&lt;th style="text-align:right">Decoding Duration (cumulated)&lt;/th>
&lt;/tr>
&lt;tr>
&lt;td>zlib&lt;/td>
&lt;td>2.36x&lt;/td>
&lt;td>1306 s&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>zstd&lt;/td>
&lt;td>2.46x&lt;/td>
&lt;td>383 s&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>This means that a 50 GiB planet file could be reduced by around 2 GiB while significantly improving read performance by using Zstandard. Such a change would not be a simple task: Zstandard compatibility would have to be implemented in a significant number of applications and libraries.&lt;/p>
&lt;p>Whether such a change is worth the speed-up is questionable, since most of applications need to assemble geometries from OSM data, which is the biggest performance bottleneck in most usecases.&lt;/p>
&lt;section class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1" role="doc-endnote">
&lt;p>Around an hour on a quad core system with good internet connectivity and spinning hard disks. &lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2" role="doc-endnote">
&lt;p>Diffs break sometimes, if for example very large changesets are uploaded and the processes that generate or apply the replication diffs exhaust their resources or run into timeouts. &lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3" role="doc-endnote">
&lt;p>See fileformat.proto, &lt;code>message Blobheader&lt;/code>. Please notice that this describes the zlib-compressed size. Average compression ratio is around 2x. &lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4" role="doc-endnote">
&lt;p>On an OSM.org planet file, generated using planet-dump-ng &lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:5" role="doc-endnote">
&lt;p>Extract from Geofabrik, generated using osmium &lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:6" role="doc-endnote">
&lt;p>By the way, don&amp;rsquo;t try to run &lt;code>bsdiff&lt;/code> on a planet file, the expected memory usage is around 900 GB. &lt;a href="#fnref:6" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:7" role="doc-endnote">
&lt;p>Fluctuates across the file (relations have a higher ratio than nodes), but averages around 2x. &lt;a href="#fnref:7" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:8" role="doc-endnote">
&lt;p>Tested on a relatively busy quad core Intel Core i7-2600 with spinning hard disks. Compression level for Zstandard was set to 20 (zstd supports up to 22), zlib compression level of OSM planets is set to 9 (maximum compression). Code from &lt;a href="https://github.com/thomersch/gosmparse">gosmparse&lt;/a> has been used to perform the benchmark. No parallel decoding has been peformed. &lt;a href="#fnref:8" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/section></description></item><item><title>Synopsis of Tools for Importing OpenStreetMap Data into PostgreSQL</title><link>https://thomas.skowron.eu/blog/osm-to-postgres/</link><pubDate>Mon, 11 Nov 2019 19:37:14 +0000</pubDate><guid>https://thomas.skowron.eu/blog/osm-to-postgres/</guid><description>&lt;p>&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/osm-to-postgres/sisyphers.jpg" alt="Sisyphers">
&lt;/p>
&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;p>The de-facto standard for rendering OpenStreetMap data is PostgreSQL/PostGIS-based: The planet file or any extract is taken and then imported into the database. The same approach is useful for data exploration, connecting OSM data to own applications or doing data analysis. To get OSM data into Postgres, a suitable tool is necessary: Usually either &lt;a href="https://github.com/openstreetmap/osm2pgsql">osm2pgsql&lt;/a> or &lt;a href="https://github.com/omniscale/imposm3">imposm&lt;/a> are used for this purpose.&lt;/p>
&lt;p>While osm2pgsql is more mature and is used on e.g. the core OSM infrastructure, imposm is still not considered 1.0 despite being available for several years and proven in many installations. Then there is the swiss army knife of OpenStreetMap &lt;a href="https://osmcode.org/osmium-tool/">osmium&lt;/a>, which a while ago received the functionality to emit data in a PostgreSQL-compatible format.&lt;/p>
&lt;p>In this article we are going to compare the tools, show their strengths and weaknesses, have a look on resource consumption and the feasibility for different use cases.&lt;/p>
&lt;h2 id="feature-set-and-philosophy">Feature Set and Philosophy&lt;/h2>
&lt;p>The available tools have different approaches: osm2pgsql is a general purpose importer, with a focus on map rendering and geocoding. imposm is mostly used for map rendering, with the focus on generating tables that are pre-optimized for maps. osmium treats Postgres as an export format, without even connecting directly to the database, but just giving the option to pipe the data in Postgres' &lt;code>COPY&lt;/code> format.&lt;/p>
&lt;p>imposm allows customization through a mapping file, in which the user can describe which OSM elements will be considered, what tables will be created and how data shall be normalized (e.g. data types, sanitization).&lt;/p>
&lt;p>osm2pgsql has two knobs that can be tweaked: Firstly, a &lt;a href="https://github.com/openstreetmap/osm2pgsql/blob/master/default.style">style file&lt;/a> that contains the columns that will be created and secondly, a &lt;a href="https://github.com/openstreetmap/osm2pgsql/blob/master/docs/lua.md">Lua scripting API&lt;/a> that allows for more complex transformations.&lt;/p>
&lt;p>osmium usually just writes one table, with all tags in a &lt;code>jsonb&lt;/code> column. The considered tags can be customized through white and black lists (&lt;code>exclude_tags&lt;/code> and &lt;code>include_tags&lt;/code>).&lt;/p>
&lt;p>Many workflows use OpenStreetMap&amp;rsquo;s replication mechanism which allows to stay updated on the latest upstream data without having to download the full planet. OpenStreetMap provides minutely, hourly or daily diffs which can be ingested by compatible software. osm2pgsql and imposm support this natively: They can update an existing PostgreSQL database if run in diff mode (imposm) or slim mode (osm2pgsql). osmium can update on-disk OSMPBF files, but doesn&amp;rsquo;t provide a mechanism to replicate changes into a database.&lt;/p>
&lt;h2 id="making-the-right-choice">Making the Right Choice&lt;/h2>
&lt;p>The choice of the importer software may depend on your external constraints: If you want to use a map style that includes osm2pgsql-compatible queries, it makes sense to stick with it, as the resulting database schemas are not equal and would need either additional views or transformations.&lt;/p>
&lt;p>On the other hand, if you prefer the &amp;ldquo;one object type per table&amp;rdquo; nature of imposm, queries from e.g. osm2pgsql can be ported with some manual work. Also, imposm makes it extremely easy to normalize common OSM values, e.g. &amp;ldquo;yes&amp;rdquo; and &amp;ldquo;no&amp;rdquo; strings can be converted into native PostgreSQL booleans and implausible ones can be discarded at import time, thus reducing the need to sanitize values while querying.&lt;/p>
&lt;p>For analytical tasks, osmium may be more useful as it won&amp;rsquo;t perform any tag transformations by default.&lt;/p>
&lt;h2 id="software-strategy">Software Strategy&lt;/h2>
&lt;p>Release, development, support and maintenance strategies are very different for those three tools. Especially for long-term system operation this might be a very important aspect to be considered.&lt;/p>
&lt;p>osm2pgsql dates back to 2006 and has the largest number of contributors. Maintenance is performed by a community of people. Pull requests are processed in a timely manner, providing useful feedback. There are several releases per year. Packages are available for a wide range of operating systems/distributions: Debian, Ubuntu, FreeBSD, macOS (via homebrew) and others.&lt;/p>
&lt;p>imposm started in 2012 and is under single-person corporate development and maintainership. Issues do not always receive feedback, pull requests are treated defensively and non-trivial changes are frowned upon. Releases are irregular, thus packaging is non-existent: Neither Debian, Ubuntu, FreeBSD nor macOS provide packages.&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup> There is little development of new features.&lt;/p>
&lt;p>osmium is a command-line tool based on the C++ library libosmium, which started in 2013, mostly developed by a single developer. Issues and feature requests are discussed and lead to development, if deemed useful. Pull requests receive feedback and are merged based on merit. Releases are frequent and release notes along with a consistent version numbering help communicate changes. Packages do exist in many distributions: Debian, Ubuntu, Fedora and macOS (homebrew) ship both library and the command-line interface, while FreeBSD only provides the library.&lt;/p>
&lt;h2 id="resources-for-import">Resources for Import&lt;/h2>
&lt;h3 id="disk-space">Disk Space&lt;/h3>
&lt;p>Due to the large amounts of data it is generally advisable to work only with flash-based disks: SATA SSDs, or better: NVMe storage. Those have superior sequential write speeds and can retrieve random data much faster than spinning disks.&lt;/p>
&lt;p>imposm will convert the data first into a more efficient on-disk format before flushing it into the database: By default it will create a cache directory which will take around twice the size of the original file. With the standard mapping the cluster size will be about 6x larger than the input file. Though, your milage may vary, as WAL configuration, toast settings and file system influence cluster size.&lt;/p>
&lt;p>osmium holds an in-memory cache by default (more details in the next section). A full import needs around 9x of the original file size. No indexes will be created by default.&lt;/p>
&lt;p>osm2pgsql needs around 7x the disk space of the original file using the standard mapping. The &lt;code>--slim&lt;/code> option automatically removes intermediate tables after being finished with the initial import.&lt;/p>
&lt;h3 id="ram">RAM&lt;/h3>
&lt;p>Due to the space-optimized design of the OSM PBF file format, for line or polygon assembly, traversing through at least one level of indirection is necessary. More specificly, all nodes are usually located at the very beginning of the file, followed by all ways. Thus, to assemble a line, the nodes at the start of the file are referenced. When selectively retrieving an object, jumping back to referenced elements is viable, but for unfiltered imports a full in-memory cache is a much faster option. Due to the extremely large amount of nodes, it is advisable to have lots of free RAM, optimally equal to the size of the input file.&lt;/p>
&lt;p>osm2pgsql lets users specify the cache size and imposm utilizes an on-disk cache. osmium ships with &lt;a href="https://docs.osmcode.org/osmium/latest/osmium-index-types.html">several strategies&lt;/a> and defaults to &lt;code>flex_mem&lt;/code>, which will try to select the most efficient type. On memory-constraint systems, it can be run with &lt;code>dense_file_array&lt;/code>, which will use a disk-based cache.&lt;/p>
&lt;p>In the benchmark imposm utilized up to 3 GB of RAM (roughly equal to the extract size), osmium (with flex_mem) peaked at 5 GB and osm2pgsql has been configured to use up to 5 GB.&lt;/p>
&lt;p>In addition to the importer&amp;rsquo;s memory usage, PostgreSQL needs buffer space as well. In the benchmark one quarter of the total system memory was reserved for its purposes. Due to the flexible memory management of PostgreSQL, memory can be over-provisioned for the time of import, as it will be reclaimed after the importers finish.&lt;/p>
&lt;h3 id="cpu">CPU&lt;/h3>
&lt;p>Nowadays, all of the three importers use multithreaded architectures, which allow to saturate several cores. Also, with the introduction of parallel workers for PostgreSQL, the database can work concurrently on the import operations.&lt;/p>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/osm-to-postgres/htop.png" alt="CPU saturation">
&lt;/p>
&lt;p>The import duration will not scale linearly with core count, though. While it is advisable to have four cores instead of two and six will be slightly faster than four, there will be diminishing returns: With imposm import duration improved by 25 % when switching from 4 to 16 CPU cores. osmium meanwhile only improved by less than 10 %.&lt;/p>
&lt;h2 id="import-speed">Import Speed&lt;/h2>
&lt;p>Because everyone loves meaningless synthetic benchmarks, there you go:&lt;/p>
&lt;ul>
&lt;li>osmium imported the test file in 15 minutes.&lt;/li>
&lt;li>imposm took 44 minutes to read, import and generate indexes.&lt;/li>
&lt;li>osm2pgsql completed the import, including indexes in 62 minutes.&lt;/li>
&lt;/ul>
&lt;h2 id="querying">Querying&lt;/h2>
&lt;p>Query performance is mostly impacted by database size and indexes: A small database may be usable without indexes for simpler queries, a large database will slow down if scanned sequentially.&lt;/p>
&lt;p>PostgreSQL brings a lot of (nearly) zero-cost abstractions (e.g. views) that help making data structures manageable, but good planning can improve data base size and thus performance.&lt;/p>
&lt;h3 id="example-table-structure">Example: Table Structure&lt;/h3>
&lt;p>Let&amp;rsquo;s assume you have an imported database and now want to perform an analysis on shops. With an osmium-style single table import, you&amp;rsquo;d have a table with two columns: geometry and tags. When you perform a query, the database would need to scan all tags for the key &lt;code>shop&lt;/code>.&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">SELECT&lt;/span> tags
&lt;span style="font-weight:bold">FROM&lt;/span> osmdata
&lt;span style="font-weight:bold">WHERE&lt;/span>
ST_DWithin(
ST_SetSRID(ST_MakePoint(&lt;span style="color:#099">6&lt;/span>.&lt;span style="color:#099">968&lt;/span>, &lt;span style="color:#099">50&lt;/span>.&lt;span style="color:#099">943&lt;/span>), &lt;span style="color:#099">4326&lt;/span>),
geom,
&lt;span style="color:#099">0&lt;/span>.&lt;span style="color:#099">1&lt;/span>
)
&lt;span style="font-weight:bold">AND&lt;/span> tags&lt;span style="font-weight:bold">-&amp;gt;&lt;/span>&lt;span style="color:#b84">&amp;#39;shop&amp;#39;&lt;/span> &lt;span style="font-weight:bold">IS&lt;/span> &lt;span style="font-weight:bold">NOT&lt;/span> &lt;span style="font-weight:bold">NULL&lt;/span>;
&lt;span style="color:#998;font-style:italic">-- Query planner:
&lt;/span>&lt;span style="color:#998;font-style:italic">&lt;/span>&lt;span style="font-weight:bold">-&amp;gt;&lt;/span> Parallel Seq Scan &lt;span style="font-weight:bold">on&lt;/span> osmdata
(cost&lt;span style="font-weight:bold">=&lt;/span>&lt;span style="color:#099">0&lt;/span>.&lt;span style="color:#099">00&lt;/span>..&lt;span style="color:#099">1509477&lt;/span>.&lt;span style="color:#099">60&lt;/span> &lt;span style="font-weight:bold">rows&lt;/span>&lt;span style="font-weight:bold">=&lt;/span>&lt;span style="color:#099">3148508&lt;/span> width&lt;span style="font-weight:bold">=&lt;/span>&lt;span style="color:#099">87&lt;/span>)
(actual time&lt;span style="font-weight:bold">=&lt;/span>&lt;span style="color:#099">3958&lt;/span>.&lt;span style="color:#099">424&lt;/span>..&lt;span style="color:#099">7044&lt;/span>.&lt;span style="color:#099">846&lt;/span> &lt;span style="font-weight:bold">rows&lt;/span>&lt;span style="font-weight:bold">=&lt;/span>&lt;span style="color:#099">52&lt;/span> loops&lt;span style="font-weight:bold">=&lt;/span>&lt;span style="color:#099">3&lt;/span>)
[...]
&lt;span style="font-weight:bold">Rows&lt;/span> Removed &lt;span style="font-weight:bold">by&lt;/span> Filter: &lt;span style="color:#099">7578245&lt;/span>&lt;/code>&lt;/pre>&lt;/div>
&lt;p>A speedup can be achieved by an index on the geometry column, but in larger or densely mapped areas PostgreSQL would still need to scan many tuples.&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">CREATE&lt;/span> &lt;span style="font-weight:bold">INDEX&lt;/span> &lt;span style="font-weight:bold">ON&lt;/span> osmdata &lt;span style="font-weight:bold">USING&lt;/span> gist (geom);
&lt;span style="color:#998;font-style:italic">-- took 438 s
&lt;/span>&lt;span style="color:#998;font-style:italic">-- table data: 6047 MB
&lt;/span>&lt;span style="color:#998;font-style:italic">-- index size: 1546 MB
&lt;/span>&lt;span style="color:#998;font-style:italic">&lt;/span>
&lt;span style="font-weight:bold">SELECT&lt;/span> tags
&lt;span style="font-weight:bold">FROM&lt;/span> osmdata
&lt;span style="font-weight:bold">WHERE&lt;/span>
ST_DWithin(
ST_SetSRID(ST_MakePoint(&lt;span style="color:#099">6&lt;/span>.&lt;span style="color:#099">968&lt;/span>, &lt;span style="color:#099">50&lt;/span>.&lt;span style="color:#099">943&lt;/span>), &lt;span style="color:#099">4326&lt;/span>),
geom,
&lt;span style="color:#099">0&lt;/span>.&lt;span style="color:#099">1&lt;/span>
)
&lt;span style="font-weight:bold">AND&lt;/span> tags&lt;span style="font-weight:bold">-&amp;gt;&lt;/span>&lt;span style="color:#b84">&amp;#39;shop&amp;#39;&lt;/span> &lt;span style="font-weight:bold">IS&lt;/span> &lt;span style="font-weight:bold">NOT&lt;/span> &lt;span style="font-weight:bold">NULL&lt;/span>;
&lt;span style="color:#998;font-style:italic">-- Now the spatial index will be used first and the
&lt;/span>&lt;span style="color:#998;font-style:italic">-- result set will be filtered down afterwards.
&lt;/span>&lt;span style="color:#998;font-style:italic">-- Query planner:
&lt;/span>&lt;span style="color:#998;font-style:italic">&lt;/span>&lt;span style="font-weight:bold">Index&lt;/span> Cond: (geom &lt;span style="font-weight:bold">&amp;amp;&amp;amp;&lt;/span> &lt;span style="color:#b84">&amp;#39;[...]&amp;#39;&lt;/span>::geometry)
Filter: (((tags &lt;span style="font-weight:bold">-&amp;gt;&lt;/span> &lt;span style="color:#b84">&amp;#39;shop&amp;#39;&lt;/span>::&lt;span style="color:#999">text&lt;/span>) &lt;span style="font-weight:bold">IS&lt;/span> &lt;span style="font-weight:bold">NOT&lt;/span> &lt;span style="font-weight:bold">NULL&lt;/span>) &lt;span style="font-weight:bold">AND&lt;/span>
(&lt;span style="color:#b84">&amp;#39;[...]&amp;#39;&lt;/span>::geometry &lt;span style="font-weight:bold">&amp;amp;&amp;amp;&lt;/span> st_expand(geom, &lt;span style="color:#b84">&amp;#39;{...}&amp;#39;&lt;/span>::double &lt;span style="font-weight:bold">precision&lt;/span>)) &lt;span style="font-weight:bold">AND&lt;/span>
_st_dwithin(&lt;span style="color:#b84">&amp;#39;[...]&amp;#39;&lt;/span>::geometry, geom, &lt;span style="color:#b84">&amp;#39;{...}&amp;#39;&lt;/span>::double &lt;span style="font-weight:bold">precision&lt;/span>))
&lt;span style="font-weight:bold">Rows&lt;/span> Removed &lt;span style="font-weight:bold">by&lt;/span> Filter: &lt;span style="color:#099">699511&lt;/span>&lt;/code>&lt;/pre>&lt;/div>
&lt;p>To achieve a speedup, all elements with the &lt;code>shop&lt;/code> key could be indexed&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>, but an index needs time to initialize, penalizes subsequent updates and inserts and takes disk space.&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">CREATE&lt;/span> &lt;span style="font-weight:bold">INDEX&lt;/span> &lt;span style="font-weight:bold">ON&lt;/span> osmdata (( tags &lt;span style="font-weight:bold">-&amp;gt;&amp;gt;&lt;/span> &lt;span style="color:#b84">&amp;#39;shop&amp;#39;&lt;/span>));
&lt;span style="color:#998;font-style:italic">-- took 39 s
&lt;/span>&lt;span style="color:#998;font-style:italic">-- index size: 487 MB&lt;/span>&lt;/code>&lt;/pre>&lt;/div>
&lt;p>Alternatively all shops could be written into a separate table during import, which could reduce the need for an index. But especially for exploratory development such requirements often cannot be planned upfront. If reimporting is not viable due to resource restrictions, a materialized view&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup> can be a good alternative as those are written to non-volatile memory.&lt;sup id="fnref:5">&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref">5&lt;/a>&lt;/sup>&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-sql" data-lang="sql">&lt;span style="font-weight:bold">CREATE&lt;/span> MATERIALIZED &lt;span style="font-weight:bold">VIEW&lt;/span> osmdata_shops &lt;span style="font-weight:bold">AS&lt;/span>
&lt;span style="font-weight:bold">SELECT&lt;/span> &lt;span style="font-weight:bold">*&lt;/span> &lt;span style="font-weight:bold">FROM&lt;/span> osmdata
&lt;span style="font-weight:bold">WHERE&lt;/span> tags&lt;span style="font-weight:bold">-&amp;gt;&amp;gt;&lt;/span>&lt;span style="color:#b84">&amp;#39;shop&amp;#39;&lt;/span> &lt;span style="font-weight:bold">IS&lt;/span> &lt;span style="font-weight:bold">NOT&lt;/span> &lt;span style="font-weight:bold">NULL&lt;/span>;
&lt;span style="color:#998;font-style:italic">-- took 18 s
&lt;/span>&lt;span style="color:#998;font-style:italic">-- materialized view size: 33 MB
&lt;/span>&lt;span style="color:#998;font-style:italic">&lt;/span>
&lt;span style="font-weight:bold">SELECT&lt;/span> tags
&lt;span style="font-weight:bold">FROM&lt;/span> osmdata_shops
&lt;span style="font-weight:bold">WHERE&lt;/span>
ST_DWithin(
ST_SetSRID(ST_MakePoint(&lt;span style="color:#099">6&lt;/span>.&lt;span style="color:#099">968&lt;/span>, &lt;span style="color:#099">50&lt;/span>.&lt;span style="color:#099">943&lt;/span>), &lt;span style="color:#099">4326&lt;/span>),
geom,
&lt;span style="color:#099">0&lt;/span>.&lt;span style="color:#099">1&lt;/span>
);
&lt;span style="color:#998;font-style:italic">-- Execution Time: 150 ms&lt;/span>&lt;/code>&lt;/pre>&lt;/div>
&lt;h2 id="update-or-full-re-import">Update or Full Re-import?&lt;/h2>
&lt;p>osm2pgsql and imposm allow for continuously updateable databases, but this has some limitations: A database cannot be updated, if is hasn&amp;rsquo;t been declared update-able from the start. Also they need additional on-disk storage. Additionally most users will experience bloat: The database size will only ever increase. Some techniques exist to reduce the impact of this, but an eventual reimport will be necessary, if disk space needs to be reclaimed.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>If you want to process OpenStreetMap&amp;rsquo;s geographical data in PostgreSQL you are presented with some good options:&lt;/p>
&lt;p>osm2pgsql is the solid choice, with wide compatibility and a long project history. It&amp;rsquo;s resource efficient and offers some customization options. It also handles diffs.&lt;/p>
&lt;p>imposm might not be very well supported, but offers quick options for preprocessing and is disk-efficient, if needed.&lt;/p>
&lt;p>osmium might be the newest, but since it&amp;rsquo;s a thin layer on top of the legendarily efficient libosmium, it&amp;rsquo;s the quickest here. It has the fewest customization options, but perfect for everyone who wants to do their work mostly in SQL.&lt;/p>
&lt;h2 id="annex">Annex&lt;/h2>
&lt;ul>
&lt;li>Benchmarks have been performed with a Germany extract (3.0 GB), on a virtualized server with four dedicated cores, 16 GB RAM and NVMe storage, formatted with ext4. The OS is Ubuntu 19.04.
&lt;ul>
&lt;li>PostgreSQL configuration values:
&lt;ul>
&lt;li>&lt;code>shared_buffers = 4GB&lt;/code>&lt;/li>
&lt;li>&lt;code>temp_buffers = 32MB&lt;/code>&lt;/li>
&lt;li>&lt;code>work_mem = 1GB&lt;/code>&lt;/li>
&lt;li>&lt;code>maintenance_work_mem = 64MB&lt;/code>&lt;/li>
&lt;li>&lt;code>autovacuum = off&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Every run has been performed with a freshly recreated cluster and a PostgreSQL restart.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>The 16 core benchmarks were performed in the same setup, except for RAM, which increased to 64 GB.&lt;/li>
&lt;li>The shop query example has been done on the &lt;em>NRW, Germany&lt;/em> extract (635 MB).&lt;/li>
&lt;/ul>
&lt;p>&lt;a href="https://thomas.skowron.eu/services/">Consulting is available by the author.&lt;/a>&lt;/p>
&lt;section class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1" role="doc-endnote">
&lt;p>Banner picture by Fallaner, licenced under &lt;a href="https://creativecommons.org/licenses/by-sa/4.0/deed.en">CC BY SA 4.0&lt;/a> &lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2" role="doc-endnote">
&lt;p>Some of those do have packages for imposm 2, which is the python-based predecessor of imposm (3), which is fundamentally incompatible and mostly unsuitable for the tasks discussed in this article. &lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3" role="doc-endnote">
&lt;p>See &lt;a href="https://www.postgresql.org/docs/12/datatype-json.html#JSON-INDEXING">https://www.postgresql.org/docs/12/datatype-json.html#JSON-INDEXING&lt;/a> &lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4" role="doc-endnote">
&lt;p>See &lt;a href="https://www.postgresql.org/docs/current/sql-creatematerializedview.html">https://www.postgresql.org/docs/current/sql-creatematerializedview.html&lt;/a> &lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:5" role="doc-endnote">
&lt;p>Materialized views and a continuously updated database can be tricky to handle as those are only updated on demand. &lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/section></description></item><item><title>Points of Interest as a Service</title><link>https://thomas.skowron.eu/blog/openstreetmap-data-as-a-service/</link><pubDate>Mon, 15 Apr 2019 16:05:34 +0000</pubDate><guid>https://thomas.skowron.eu/blog/openstreetmap-data-as-a-service/</guid><description>&lt;p>When building a location based service, access to public geo data like &lt;strong>points of interest&lt;/strong> (POI) can be cumbersome and expensive. There are several proprietary databases which have specialized on access to POI data, but with the ever growing collection of data by OpenStreetMap and the collaborative power and its &lt;strong>mature ecosystem&lt;/strong> of professionals and volunteers OSM is now a leading data source in many areas.&lt;/p>
&lt;p>OpenStreetMap now tracks over &lt;strong>a million restaurants and bars&lt;/strong>, 310,000 hotels, nearly 150,000 view points, 140,000 hospitals and &lt;strong>over three million shops&lt;/strong> worldwide. They are not only precisely geolocated, but often also cover additional data like opening hours, telephone numbers or wheelchair accessibility.&lt;/p>
&lt;p>The data and a lot of the technology in the OSM ecosystem is open, but for those who want to build upon it and use its power without worrying about setup and hosting, the possibilities are often limited: raw OSM data needs several processing steps and scaling up to a full planet access can be prohibitively expensive.&lt;/p>
&lt;p>For this purpose, today we are launching OORA, a cloud-native API service that allows you use OpenStreetMap data without having to worry, so you can focus on your application.&lt;/p>
&lt;h2 id="technical-details">Technical Details&lt;/h2>
&lt;p>To get all nearby convenience shops, just get&lt;/p>
&lt;pre>&lt;code>https://api.skowron.eu/oora/v1/shop/convenience?
center_lat=50.95137&amp;amp;
center_lon=6.95697
&lt;/code>&lt;/pre>
&lt;p>which will return something like this:&lt;/p>
&lt;div class="highlight">&lt;pre style="background-color:#fff;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-json" data-lang="json">[
{
&lt;span style="color:#000080">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;Feature&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;properties&amp;#34;&lt;/span>: {
&lt;span style="color:#000080">&amp;#34;name&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;Maxi Markt&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;shop&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;convenience&amp;#34;&lt;/span>
},
&lt;span style="color:#000080">&amp;#34;geometry&amp;#34;&lt;/span>: {
&lt;span style="color:#000080">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;Point&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;coordinates&amp;#34;&lt;/span>: [&lt;span style="color:#099">6.95652930&lt;/span>, &lt;span style="color:#099">50.94593490&lt;/span>]
}
},
{
&lt;span style="color:#000080">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;Feature&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;properties&amp;#34;&lt;/span>: {
&lt;span style="color:#000080">&amp;#34;addr:city&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;Köln&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;addr:country&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;DE&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;addr:housenumber&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;12-14&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;addr:postcode&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;50668&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;addr:street&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;Riehler Straße&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;brand&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;REWE&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;contact:phone&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;0221 9726040&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;contact:website&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;https://togo.rewe.de/&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;name&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;REWE to go&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;opening_hours&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;24/7&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;shop&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;convenience&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;wheelchair&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;yes&amp;#34;&lt;/span>
},
&lt;span style="color:#000080">&amp;#34;geometry&amp;#34;&lt;/span>: {
&lt;span style="color:#000080">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#b84">&amp;#34;Point&amp;#34;&lt;/span>,
&lt;span style="color:#000080">&amp;#34;coordinates&amp;#34;&lt;/span>: [&lt;span style="color:#099">6.96184250&lt;/span>, &lt;span style="color:#099">50.95249260&lt;/span>]
}
}
]&lt;/code>&lt;/pre>&lt;/div>
&lt;p>&lt;em>The data is licensed under the ODbL, Copyright by OpenStreetMap contributors. See &lt;code>X-Copyright&lt;/code> HTTP header.&lt;/em>&lt;/p>
&lt;h2 id="getting-started">Getting Started&lt;/h2>
&lt;p>You can sign-up for the technology preview by entering your email address here:&lt;/p>
&lt;form onsubmit="register(this)">&lt;input name="suemail" type="email"> &lt;button>Sign Up&lt;/button>&lt;/form>
&lt;div id="signup-success" style="display:none; background-color: #b0df87; padding: 1em; margin-top: 0.5em;">You have successfully signed up. You should receive an email shortly.&lt;/div>
&lt;p>You will automatically receive an API key, which will allow you to use the service for free. The free evaluation will last until at least May 31, 2019. You will be automatically notified before the production-grade product launch.&lt;/p>
&lt;p>Please provide the received API key inside the &lt;code>Authorization&lt;/code> HTTP header.&lt;/p>
&lt;p>If you want to perform more than 100,000 requests, please contact us. Please also let us know if you want to use the API in production.&lt;/p>
&lt;p>There is no warranty for the free service. If you need professional support, a custom deployment or consulting, please contact us first: &lt;code>oora@skowron.eu&lt;/code>&lt;/p>
&lt;script>
function register(form) {
this.event.preventDefault();
var email = this.document.getElementsByName("suemail")[0].value;
fetch('https://api.skowron.eu/auth/register', {'method': 'POST', body: email}).then(function(resp) {
this.document.getElementById("signup-success").style.display = "block";
});
}
&lt;/script></description></item><item><title>Give Spaten a Spin: Introducing a Data Repository</title><link>https://thomas.skowron.eu/blog/spaten-data-repository/</link><pubDate>Thu, 21 Feb 2019 11:03:23 +0000</pubDate><guid>https://thomas.skowron.eu/blog/spaten-data-repository/</guid><description>&lt;p>In 2017 the &lt;a href="https://github.com/thomersch/grandine">Grandine project&lt;/a> has been introduced, which helps working with geodata in a streamable fashion. At the core of the project the new file format &amp;ldquo;Spaten&amp;rdquo; for passing around geo data has been introduced. A lot of people reached out to me, wanting to explore the possibilites.&lt;/p>
&lt;p>In order to facilitate work with the provided tooling, the &lt;a href="https://thomas.skowron.eu/spaten/download/">Spaten file repository&lt;/a> is released today: It is a collection of Geo Data files which are free to use and can be processed with any Spaten-compatible tool, like the &lt;a href="https://godoc.org/github.com/thomersch/grandine/lib/spaten">Go library&lt;/a>, the &lt;a href="https://pypi.org/project/spaten/">Python implementation&lt;/a> or the &lt;a href="https://github.com/thomersch/grandine/tree/master/cmd">Grandine Tools&lt;/a>.&lt;/p>
&lt;p>For now, the datasets are all based on &lt;a href="https://www.naturalearthdata.com">Natural Earth&lt;/a> and therefore released into the public domain. The data sets contain physical and cultural vector data at different scales: 10 m, 50 m and 100 m resolution.&lt;/p>
&lt;p>The download overview is reachable under &lt;a href="https://thomas.skowron.eu/spaten/download/">https://thomas.skowron.eu/spaten/download/&lt;/a>&lt;/p>
&lt;p>The Spaten file format is very efficient and therefore optimal for geo data applications at any scale: Decoding is at average 5x faster than GeoJSON and file sizes are up to 50 % smaller.&lt;/p>
&lt;p>To generate a vector tile set from a spaten file, just type:&lt;/p>
&lt;pre>&lt;code>grandine-tiler -in ne_10m_airports.spaten -zoom 10
&lt;/code>&lt;/pre></description></item><item><title>Envisioning a New Approach to Geodata Processing</title><link>https://thomas.skowron.eu/blog/geodata-processing-vision/</link><pubDate>Tue, 16 Oct 2018 11:05:00 +0200</pubDate><guid>https://thomas.skowron.eu/blog/geodata-processing-vision/</guid><description>&lt;p>&lt;img src="pipes.jpg" alt="Pipes">&lt;/p>
&lt;p>&lt;em>This is a recap of my work and the subsequently gained learnings of the last two years. Parts of it have been discussed in my talks on FOSSGIS and SOTM conferences, previous blog posts or podcasts.&lt;/em>&lt;/p>
&lt;h2 id="today">Today&lt;/h2>
&lt;p>Processing geo data can be approached in two ways: Either the algorithm is already clear and the process can be automated and data can be processed directly, or: a human element is needed for processing in some kind of exploratory phase.&lt;/p>
&lt;p>There are processes which are somewhere in between, because they are generally application driven, but need human interaction in some step, e.g. generalization, in which a human preprocesses geometries, but can later pass those into an automated pipeline which will use those for e.g. low resolution purposes.&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>&lt;/p>
&lt;p>Manual processing is especially present in the traditional geospatial work: Surveying, spatial analysis and print maps. This is most often done using desktop GIS applications, like QGIS. For persistence either shapefiles or a local PostgreSQL/PostGIS database are used frequently.&lt;/p>
&lt;p>Automated processes very often have PostGIS at its core, too. Projects like OpenMapTiles even use a Postgres database to import all data first just to retrieve all data from it after that.&lt;/p>
&lt;p>&lt;img src="automated%20manual.png" alt="Automatic vs. Manual">&lt;/p>
&lt;p>For manual processes computational performance is seemingly not very important. Operations controlled by humans are often the limiting factor. Nonetheless in larger data sets it can become tedious when loading pauses become noticeable: Speaking from past personal experience, waiting for &lt;a href="https://tilemill-project.github.io/tilemill/">Tilemill&lt;/a> to rerender the current view after a minor change can be very stressful and slow down any iterative development.&lt;/p>
&lt;p>Automated processes often have very different characteristics, because they generally involve much larger data sets and mostly require a sequential processing of all present data. Typically those are workloads which are not suitable for a full relational database system.&lt;/p>
&lt;p>Graphical map making is, as mentioned before, a mixture: Often large data set, but with a manual element. Especially in early stages, when a complete style needs to be designed, a lot of iterations may be needed, so low latency is necessary. Later on, throughput is much more important, which probably is the reason for the increasingly large popularity of vector tiles, because it offers both benefits: Low barrier for style changes while being very easy to scale.&lt;/p>
&lt;p>Spatial awareness is less important than it might seem for a lot of applications: Layers or other categorizations of features need to be considered, too, which often lead to a full read and write cycle: Every feature needs to be retrieved and read to be written. For such workflows, ACID semantics which are present in database suites are not needed.&lt;/p>
&lt;p>Also, big data sets are coming along: They are large enough to overflow available memory, but generally still small enough to be persisted on a single disk/machine.&lt;/p>
&lt;p>Sharing data sets can often be a large pain point: If sharing a central PostGIS is not feasible (e.g. because of too high latency or firewall and policy restrictions), passing on data can hinder collaboration: Shapefiles are severely limited, spatialite is too slow for large-ish data sets and shipping off a complete PostgreSQL cluster can be quite difficult, especially for non-programming users.&lt;/p>
&lt;p>&lt;img src="postgres%20behemoth.png" alt="Postgres Behemoth">&lt;/p>
&lt;h3 id="on-the-other-side-of-the-fence">On the other side of the fence&lt;/h3>
&lt;p>Aside from classical geo spatial work (like in government agencies) and neo-geographers (like OpenStreetMap contributors) a third class of users with geospatial needs have emerged: Companies which develop products like location based services or social networks, which are not in the tradition of using GIS tooling. They use software that is suitable to their existing toolkit, programming environment and scale. They generally do not participate in standardization, but rather stay in their own ecosystem.&lt;/p>
&lt;p>While generally geospatial data sets are comparatively small in contrast to user data, some organizations prefer to shard it across nodes and make additional steps to make data more available.&lt;/p>
&lt;h3 id="trends">Trends&lt;/h3>
&lt;p>Using tiles is not a new concept&lt;sup id="fnref:2">&lt;a href="#fn:2" class="footnote-ref" role="doc-noteref">2&lt;/a>&lt;/sup>: They have been introduced when web-based maps have been brought to the public. There is no need to download the whole data set, just to have a look at a geographical context of a few street blocks.&lt;/p>
&lt;p>With the inception of vector tiles, the same mechanism of reducing data delivery to small spatially index files has been adapted into a wide variety of applications: The obvious map rendering in browsers&lt;sup id="fnref:3">&lt;a href="#fn:3" class="footnote-ref" role="doc-noteref">3&lt;/a>&lt;/sup>, but also routing on tiles&lt;sup id="fnref:4">&lt;a href="#fn:4" class="footnote-ref" role="doc-noteref">4&lt;/a>&lt;/sup> and even geocoding&lt;sup id="fnref:5">&lt;a href="#fn:5" class="footnote-ref" role="doc-noteref">5&lt;/a>&lt;/sup>. Generally, engineers in VC funded companies seem to enjoy to build distributed toolchains in order to generate vector tiles.&lt;/p>
&lt;p>Also, more and more applications have an initial import stage, in which they transform all available data using domain specific application logic to be able to operate.&lt;sup id="fnref:6">&lt;a href="#fn:6" class="footnote-ref" role="doc-noteref">6&lt;/a>&lt;/sup>&lt;/p>
&lt;h3 id="in-process-recap">In-Process Recap&lt;/h3>
&lt;ul>
&lt;li>Automated geo data processing suffers from performance bottlenecks: Especially during development this can be tedious. Databases have a lot of nice characteristics, of which a lot is not needed for geospatial work.&lt;/li>
&lt;li>Distributed processes have huge performance impact (read: dozens of instances working on days for a planet-sized data set)&lt;sup id="fnref:7">&lt;a href="#fn:7" class="footnote-ref" role="doc-noteref">7&lt;/a>&lt;/sup>&lt;/li>
&lt;li>Not everyone is able to process worldwide/planet OpenStreetMap data on their own machine (OSM full planet needs &amp;gt;40 GB RAM).
&lt;ul>
&lt;li>The power of OSM comes from grassroot approach, not from large comities!&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="how-to-make-everyone-happy">How to Make Everyone Happy&lt;/h2>
&lt;p>It becomes clear that there is a need for a resource efficient approach, which solves common problems. Computers have become very fast for some classes of problems while other things are still rather slow. Generally speaking sequential processing of similar data has become very fast: with heavily-pipelined CPU cache architectures, the spread of GPU based processing or very efficient JIT systems. If embraced, we could benefit from huge performance gains by eliminating expensive index traversals and memory seeking.&lt;/p>
&lt;p>It needs to be possible that users can use data sets in their desktop GIS software while being able to write scripts or programs that can process those very efficiently – optimally in the programming language of their choice, while being able to interact with other software.&lt;/p>
&lt;p>By the help of good libraries in diverse programming environments it is possible to construct an approach of &lt;strong>stream in, stream out&lt;/strong>: Every written application takes a stream of spatial features, processes each one by one (or optionally multi-threaded using workers) and returns the output as a stream. Thus an application needs only a fraction of the memory that a full data set would require. From there on, only the speed of the CPU, the number of cores and the complexity of the applied algorithm would limit the performance.&lt;/p>
&lt;p>&lt;img src="suggestion.png" alt="Suggestion">&lt;/p>
&lt;h2 id="use-cases">Use Cases&lt;/h2>
&lt;p>Let&amp;rsquo;s have a look on different use cases (in a simplified way) and judge how they could fit into such a mode of operation:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Multilingual maps&lt;/strong> (transliteration/transcription)
&lt;ul>
&lt;li>Work steps:
&lt;ul>
&lt;li>Traverse every feature&lt;/li>
&lt;li>Look at location&lt;/li>
&lt;li>Look at name tags&lt;/li>
&lt;li>Return every feature with appropriately transformed name&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Tiled maps&lt;/strong> (e.g. vector tiles)
&lt;ul>
&lt;li>Work steps:
&lt;ul>
&lt;li>Traverse every feature&lt;/li>
&lt;li>Look at location&lt;/li>
&lt;li>Determine affected tile&lt;/li>
&lt;li>Append geometries to appropriate tiles&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Data analysis&lt;/strong>
&lt;ul>
&lt;li>Work steps are very dependent on the use case, but data sets are mostly not that large and can be scanned more efficiently than indexed and retrieved on every operation. Especially volatile data has a bad index utilization.&lt;/li>
&lt;li>When multiple people are involved, a shared DB can make more sense&lt;/li>
&lt;li>Also: PostGIS offers a lot of intuitive tools for processing&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Geocoding&lt;/strong>
&lt;ul>
&lt;li>Geocoding software like nominatim has special importers, which have to process every feature on initialization, but mostly do not rely on a spatial database.&lt;/li>
&lt;li>Scanning every feature and applying an algorithm is already exercised.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>For multilingual maps, tiled maps or geocoding the streamed approach is pretty natural and could be applied in an efficient manner. For data analysis, it depends on the task and the people involved. Especially in exploratory phases, a database that allows concurrent access might be useful.&lt;/p>
&lt;h3 id="limitations-and-obstacles">Limitations and Obstacles&lt;/h3>
&lt;p>Of course adoption needs aside from strong benefits also a reduced set of obstacles, but obviously not everyone is incentivized or motivated to switch to a new paradigm which is a hindrance on its own. Aside that non-programmers still want to use their GUI software without much change. When loading times improve or the friction of exchanging files gets reduced users might notice, but missing features or changes in the daily workflow will be much more painful. After all, almost no-one complains about a coffee break while the computer is busy.&lt;sup id="fnref:8">&lt;a href="#fn:8" class="footnote-ref" role="doc-noteref">8&lt;/a>&lt;/sup>&lt;/p>
&lt;p>PostGIS power users will probably never switch: They are used to a very specific declarative mind set, which is so wildly different that friction might always outweigh the speed or convenience benefits. Also, enterprise users will probably not consider switching before everything is standardized.&lt;/p>
&lt;p>When specifically considering an implementation of this approach and looking at the most probable reason why most software uses PostGIS as a geospatial toolkit, it seems that the majority of spatial libraries are of poor quality and correct handling of complex geometries is tricky, even with the support of specialized packages.&lt;/p>
&lt;h3 id="opportunities">Opportunities&lt;/h3>
&lt;p>This novel approach should not only cover existing use cases, but also bring new opportunities to the table: Hobbyists should be enabled to process worldwide data sets on their local computers (or on small, cheap cloud instances, if you prefer). Furthermore applications could start being decoupled from a specific database implementation, e.g. CartoCSS projects contain inline SQL queries&lt;sup id="fnref:9">&lt;a href="#fn:9" class="footnote-ref" role="doc-noteref">9&lt;/a>&lt;/sup>, which makes them not only not portable to other databases, but also generates a significant effort when trying to port it to a different table schema.&lt;/p>
&lt;h2 id="the-suggestion">The Suggestion&lt;/h2>
&lt;p>To recap, we should treat geo data as a stream of features that can be passed from application to application in an efficient manner. The same format to pass features should also be used for persistence on disk or sharing using the cloud. To pass data between applications, pipes can be used, reducing disk footprint and enabling more agile processing.&lt;/p>
&lt;p>This train of though is encapsulated into the &lt;a href="https://thomas.skowron.eu/spaten">Spaten file format&lt;/a>, which I suggest as a basis for pipeline processing. This format is already used in the &lt;a href="https://github.com/thomersch/grandine">Grandine tools&lt;/a>.&lt;/p>
&lt;h3 id="state-of-the-vision">State of the Vision&lt;/h3>
&lt;p>A foundation has been laid: The format for serializing features has been defined and a first library implementation in Go is available for integration. This is used in the Grandine toolset in order to pass streams in and out, where applicable:&lt;/p>
&lt;ul>
&lt;li>&lt;code>grandine-tiler&lt;/code> can build vector tile sets from streams&lt;/li>
&lt;li>&lt;code>grandine-spatialize&lt;/code> converts OpenStreetMap into a spatial format, serializing it as a Spaten stream&lt;/li>
&lt;li>&lt;code>grandine-converter&lt;/code> reads geojson and can stream out Spaten data&lt;/li>
&lt;/ul>
&lt;p>In discussions at conferences, meetups and hack days there is a general interest in the technology and some people are willing to have a deeper look or integrate Spaten into their software in the future.&lt;/p>
&lt;h2 id="how-can-you-help">How can you help?&lt;/h2>
&lt;p>Of course this is just a first step into the journey. I think a lot of applications and people can benefit from a faster, more efficient toolchain. I strongly believe, we can make open applications more cooperative and make tools written in different environments and languages more useful to everyone.&lt;/p>
&lt;p>But for that I need your help: &lt;strong>Tell me&lt;/strong>, what you are up to. Write an &lt;strong>email&lt;/strong>, talk to me at some event. You can also &lt;strong>help test the existing libraries&lt;/strong> and &lt;strong>command line tools&lt;/strong>. If you find bugs, write an email or &lt;strong>file an issue&lt;/strong>. Also &lt;strong>fuzzing&lt;/strong> or &lt;strong>expanding benchmarks&lt;/strong> and &lt;strong>tests&lt;/strong> would be great. &lt;strong>Go experts&lt;/strong> can help me write a better &lt;strong>polygon clipping function&lt;/strong>.&lt;/p>
&lt;p>I am also looking for people willing to help out build serialization libraries for more programming languages (especially JavaScript/node.js and C). If you are interested, let&amp;rsquo;s work together!&lt;/p>
&lt;p>Also if you want your company be part of this, you can make me work for you as a freelancer.&lt;/p>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>Special thanks to the Prototype Fund, the Open Knowledge Foundation Germany and German Ministry of Education and Research, who supported the project and funded six months of development.&lt;/p>
&lt;p>Also all of this would not have been possible without Jochen Topf, who convinced me not to jump into the Postgres rabbit hole again and supported a lot of initial thought.&lt;/p>
&lt;p>In the end I wouldn&amp;rsquo;t have come so far if not for the great conversations I had with people from the OpenStreetMap universe: If you have ever written an email, talked to me or just listened to my insane ideas: Thank you, you have been a great motivator!&lt;/p>
&lt;section class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1" role="doc-endnote">
&lt;p>&lt;a href="http://openstreetmapdata.com/processing/generalization">OpenStreetMapData&lt;/a> provides generalized coastlines for use in low zoom levels. &lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:2" role="doc-endnote">
&lt;p>OSGeo already &lt;a href="https://wiki.osgeo.org/wiki/Tile_Map_Service_Specification">published TMS in 2006&lt;/a> &lt;a href="#fnref:2" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:3" role="doc-endnote">
&lt;p>&lt;a href="https://www.mapbox.com/mapbox-gl-js/example/simple-map/">Mapbox GL JS&lt;/a> &lt;a href="#fnref:3" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:4" role="doc-endnote">
&lt;p>&lt;a href="https://github.com/valhalla">valhalla&lt;/a> &lt;a href="#fnref:4" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:5" role="doc-endnote">
&lt;p>&lt;a href="https://github.com/mapbox/carmen">carmen&lt;/a> &lt;a href="#fnref:5" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:6" role="doc-endnote">
&lt;p>OSRM needs to &lt;a href="https://github.com/Project-OSRM/osrm-backend/wiki/Running-OSRM">extract the road network&lt;/a> from source files before being able to run &lt;a href="#fnref:6" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:7" role="doc-endnote">
&lt;p>Approaches like OpenMapTiles need days to process a full OpenStreetMap planet. &lt;a href="#fnref:7" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:8" role="doc-endnote">
&lt;p>&lt;a href="https://www.xkcd.com/303">Relevant xkcd&lt;/a> &lt;a href="#fnref:8" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;li id="fn:9" role="doc-endnote">
&lt;p>Example from &lt;a href="https://github.com/gravitystorm/openstreetmap-carto/blob/master/project.mml#L88">OpenStreetMap CartoCSS code repository&lt;/a> &lt;a href="#fnref:9" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/section></description></item><item><title>Orphaned Gunicorn Processes with Supervisor</title><link>https://thomas.skowron.eu/blog/supervisor-orphaned-processes/</link><pubDate>Mon, 02 Jul 2018 23:06:00 +0200</pubDate><guid>https://thomas.skowron.eu/blog/supervisor-orphaned-processes/</guid><description>&lt;p>Python web applications generally expose a WSGI interface for client communication, which is rather unusual if you are used to more novel approaches like those of Go applications, where APIs are generally served via HTTP. Most applications have an HTTP gateway in front, either for load balancing, high availability or TLS termination, so a Python application, like e.g. Django projects need an HTTP server which interfaces with WSGI.&lt;/p>
&lt;p>WSGI servers also deal with worker spawning, because of Python&amp;rsquo;s Global Interpreter Lock, which would otherwise limit Python applications to be executed on one thread at any given moment.&lt;/p>
&lt;p>A popular WSGI server is &lt;a href="http://gunicorn.org">Gunicorn&lt;/a>, which itself is written in Python, so integrating it into an application is just one install command away. Launching a web application just takes a &lt;code>/app/venv/bin/gunicorn myapp.wsgi&lt;/code>. By specifying a number of workers using the &lt;code>--workers&lt;/code> parameter a higher parallelism is configurable.&lt;/p>
&lt;p>Now, in order to ensure that a web application is started on boot and restarted if it fails for some reason, you need an init system. A well-regarded approach is to use &lt;a href="http://supervisord.org">supervisord&lt;/a>, which makes it easy to write unit files, which define how an application will be run:&lt;/p>
&lt;pre>&lt;code>[program:myapp]
environment = DJANGO_SETTINGS_MODULE=&amp;quot;myapp.settings&amp;quot;
user = myapp
command = /app/venv/bin/gunicorn myapp.wsgi
&lt;/code>&lt;/pre>
&lt;p>A handy feature of Gunicorn are graceful restarts: You only need to send a &lt;code>HUP&lt;/code> signal to the master process, which will finish handling all current requests while rolling over to the new version of the application. Though, if you change supervisord&amp;rsquo;s config, you may need to truly restart the units.&lt;/p>
&lt;p>This can lead to an unexpected behavior: The Gunicorn master process gets seemingly shut down, but the spawned children processes are still there. When supervisord tries to launch the new master, it gets stuck as the binded port is still occupied. A look on htop reveals the problem:&lt;/p>
&lt;pre>&lt;code>kernel
`- /sbin/init
`- /app/venv/bin/gunicorn myapp.wsgi --workers 2
`- /app/venv/bin/gunicorn myapp.wsgi --workers 2
`- ...
`- /usr/local/bin/supervisord
&lt;/code>&lt;/pre>
&lt;p>Normally, the Gunicorn application server processes should be children of supervisord. But instead they have been orphaned and are now children of PID 1, the init system. They can&amp;rsquo;t handle any traffic, because the master is already gone, thus they are effectively deadlocked.&lt;/p>
&lt;p>A temporary solution is to &lt;code>kill -9&lt;/code> the orphaned processes, which will allow to respawn the application as soon as the supervisor retries the next time. This will unlock the application for the moment. Still, with the next supervisor restart the problem will appear again.&lt;/p>
&lt;p>The real cause of this problem is a specific supervisor behavior, which will send signals only to the controlled process but not cascade it to its children. This can be resolved by setting &lt;code>stopasgroup=true&lt;/code> in the unit file. Setting it will prevent orphaning.&lt;/p>
&lt;p>Note that restarting/reloading supervisor is not needed in most cases, e.g. when you are rolling out a new application version, it suffices to use &lt;code>supervisorctl signal hup myapp&lt;/code>, which will prevent a service disruption.&lt;/p>
&lt;p>The recommended supervisor unit configuration for a Python application run with Gunicorn is:&lt;/p>
&lt;pre>&lt;code>[program:myapp]
environment = DJANGO_SETTINGS_MODULE=&amp;quot;myapp.settings&amp;quot;
user = myapp
command = /app/venv/bin/gunicorn myapp.wsgi
redirect_stderr = true
stopasgroup = true
&lt;/code>&lt;/pre></description></item><item><title>Open Data: Best Practices</title><link>https://thomas.skowron.eu/blog/open-data-best-practices/</link><pubDate>Fri, 02 Feb 2018 18:48:00 +0100</pubDate><guid>https://thomas.skowron.eu/blog/open-data-best-practices/</guid><description>&lt;p>In case you are able to read German and are interested in pursuing an Open Data strategy, you might want to look at my &lt;a href="https://thomas.skowron.eu/opendata/best-practices/">Guide to Open Data&lt;/a>.&lt;/p></description></item><item><title>Grandine: Vector Tiles, Summary July 2017</title><link>https://thomas.skowron.eu/blog/grandine-summary-july-2017/</link><pubDate>Tue, 01 Aug 2017 00:52:00 +0200</pubDate><guid>https://thomas.skowron.eu/blog/grandine-summary-july-2017/</guid><description>&lt;p>&lt;em>Continuation of the documentation of my Prototype Fund related work. Previous summaries: &lt;a href="https://thomas.skowron.eu/blog/grandine-summary-march-2017/">March&lt;/a>, &lt;a href="https://thomas.skowron.eu/blog/grandine-summary-april-2017/">April&lt;/a>, &lt;a href="https://thomas.skowron.eu/blog/grandine-summary-may-2017/">May&lt;/a>, &lt;a href="https://thomas.skowron.eu/blog/grandine-summary-june-2017/">June&lt;/a>&lt;/em>&lt;/p>
&lt;p>This month I concentrated on picking up some loose ends and refining the both command line tools &lt;code>spatialize&lt;/code> and &lt;code>tiler&lt;/code>.&lt;/p>
&lt;h2 id="the-other-hard-thing-in-programming">The other hard thing in programming&lt;/h2>
&lt;p>I&amp;rsquo;ve been reminded that there are already tools that are called &amp;ldquo;tiler&amp;rdquo;. I don&amp;rsquo;t intend &amp;ldquo;tiler&amp;rdquo; to be a canonical name for this tool, but as it isn&amp;rsquo;t production ready yet I didn&amp;rsquo;t want to invest too much time into a good name. Nevertheless I consider &amp;ldquo;tiler&amp;rdquo; and &amp;ldquo;spatialize&amp;rdquo; to be subcommands of grandine. So it actually might be called &lt;code>grandine-tiler&lt;/code> in the future. But this is not set.&lt;/p>
&lt;h2 id="spatialize">Spatialize&lt;/h2>
&lt;p>Spatialize is the tool for converting OSM data into a geospatial data format with proper geometries (collects ways and relations and assembles polygons, if specified) and feature and layer name mapping. For starters I implemented parts of the &lt;a href="https://openmaptiles.org/schema/">OpenMapTiles Vector Tile Schema&lt;/a>.&lt;/p>
&lt;h2 id="tiler">Tiler&lt;/h2>
&lt;p>Tiler has been rather slow if it needed to cover large areas, because it has been seeking in potentially large lists. With newly introduces changes it builds a rtree for fast geometry lookup if a lot of geometries are present and/or a lot of tiles need to be generated. Furthermore the work is now spread across a number workers, so it is able to utilize all CPU cores.&lt;/p>
&lt;h2 id="libspatial-changes">&lt;code>lib/spatial&lt;/code> changes&lt;/h2>
&lt;p>The internal spatial library has been refined, but also received a new feature: It is now able to simplify lines using the Douglas-Peucker algorithm.&lt;/p>
&lt;h2 id="obstacles">Obstacles&lt;/h2>
&lt;p class="centered">
&lt;img src="https://thomas.skowron.eu/blog/grandine-summary-july-2017/broken-clipping.png" alt="Broken Clipping">
&lt;/p>
&lt;p>A thing that gets increasingly frustrating is the polygon clipping algorithm which still does not work correctly in some cases. If you have any insight into this and maybe have already implemented Weiler-Atherton I would love to hear from you.&lt;/p>
&lt;h2 id="time-ticking">Time ticking&lt;/h2>
&lt;p>There is one month of funding left for the project, so I am trying to push out a few more cool things, because on August 31st it&amp;rsquo;s demo time!&lt;/p></description></item></channel></rss>