It’s no secret that getting data right is a constant battle that needs to be consistently refined, reworked, and analyzed. This is due to the fact that even the most comprehensive datasets begin to lose their accuracy as they age. This is because the world’s POIs change, and so too does the accuracy of our data if left unrefreshed. Unfortunately, most people working in the geospatial industry are very familiar with this issue. It’s the reason why flat file customers receive updated data according to set cadences, and our API customers receive calls from a dataset that is updated in real-time. In the case of POI and human movement, old data is bad data.
Data isn’t a fine red wine, and the idea that “old” really does mean “bad” doesn’t only apply to data, but also to the machine learning (ML) models that predominantly source the datasets. Your data should be looked at as a chain, where the old proverb of “only being as strong as the weakest link” rings true. If there is a fundamental flaw in the process from which you receive data, then any data from that point on will reflect those defects.
As a result, we at Foursquare have made a point to consistently and frequently analyze our data, their models, and the methods that go into aggregating over 120M+ POI worldwide. Each month, countless hours of work go into reviewing and evaluating the process across our various sources of data, such as ground truth datasets, authoritative third-party sources, and our own first-party data. Because of this due diligence, we occasionally find ways to improve our dataset, whether it be POI updates that we communicate on a monthly basis or in this month’s example, finding an ML model that needs to be reworked.
During an evaluation process, the data team discovered an area for improvement within our Geosummarizer model. As much as we hate to fall short of perfection, we’re always striving to learn and improve, so that we provide customers the most accurate data available.
For those who are unfamiliar, Geosummarizer is the model that selects a final Lat/ Long for a POI based on an analysis of geocodes from various inputs within its cluster. Examples of the inputs are our very own user-generated content, third party crawls, and many others. This matters because Geocodes represent our final, integrated view of what we think is the best representation of a real-world place.
For use cases such as mapping or delivery, an accurate geocode is absolutely essential as it’s a component in the algorithm that generates a “dropoff” point. A “dropoff” geocode is defined as the point in the center of a street orthogonally in front of the POI’s rooftop. These geocodes are targeted at use cases like rideshare, mapping, delivery or the tech that helps a driver find where to drop off a passenger. Accurate geocodes can make all the difference in achieving a successful outcome for the use case above.
Ultimately, the improved Geosummarizer model increased the accuracy of these geocodes by reducing the median distance between them in our dataset by 20%.
Moving forward, we would like to share more insights into the innovative improvements taking place behind the scenes here at Foursquare. While a minor release or update may not be the most flashy news, in many cases, these improvements make all the difference in the quality of our data, directly benefiting our customers. For readers interested in learning more about the latest data improvements and Geosummarizer, we invite you to watch a recent webinar where Foursquare’s leading data team will cover this process in-depth.