I’ve started a new project working with watmildon. While we were working together on applying the USGS Sq___ name changes to OSM we noticed was that there were often features in OSM that were out of sync with official name changes that happened years ago.
That got us thinking about walking through the USGS GNIS data set to find places where names had changed and OSM could be updated. After all, there are many features in OSM that have gnis:feature_id (and similar) tags that can be directly matched back to the GNIS data set.
After kicking the idea around for a while, we recently started writing some code. I’ve been working on a matching engine in C# that matches records from GNIS to OSM by Feature ID. The code also looks for likely matches where the feature name, primary tags, and geometry are close to the information from GNIS. So far the results are pretty good, but we’re still working on improving the matching.
Meanwhile, watmildon did some large scale statistical analysis on a local PBF file to look at the scale and scope of the problem. The results were very interesting!
Of the 2.3 million features in GNIS, there are only 1 million corresponding features with GNIS IDs in OSM. Some portion of these are surely existing features that just don’t have the gnis:feature_id (or similar) tags. But given our manual review of results from the matching code, there are a lot of GNIS features that are not present in OSM at all.
That’s not too much of a surprise. Some of the most common types of missing features are Streams, Valleys, Lakes, Springs, and Ridges – all things that not widely mapped in the US.
GNIS recently archived the feature classes for civil names and man-made features. About half of the 1.3 million GNIS records that don’t have corresponding features in OSM are for those archived features. You might reasonably wonder whether it’s worth tagging the archived features in OSM. But that leaves about 600,000 current GNIS features that aren’t fully tagged in OSM. And a large portion of those are likely not mapped at all.
At this point, we’re still working on improving our tools, collecting, and analyzing the data. There do seem to be some opportunities for some automated tag cleanup, and if that makes sense we’ll follow community practices for anything like that.
But fixing the untagged/missing features is going to require manual review and there are too many features for us do that alone. We’ll have to keep working to find ways to enlist the rest of the community to help!