The geospatial industry is currently undergoing a wave of rapid innovations based on a number of new geospatial data formats that have been designed from the ground up to leverage the capabilities of the cloud.
We expect these cloud-based innovative advancements to benefit a wide range of geospatial users. And to provide a world-class, premier location intelligence platform, Foursquare actively participates in the ongoing innovation.
This post gives you a peek at our notes, sharing our thoughts on how the geospatial community can utilize new technologies and how products like Foursquare’s can help users adopt the tools and workflows of this new ecosystem.
Cloud-Native Geospatial Formats
Or… what would the geospatial world look like if we built everything from the ground up on the cloud?
The geospatial industry has always been characterized by the massive size of geospatial datasets.
Let’s go way back to 1972, when NASA’s first Landsat satellite flew with a set of 76-pound cameras with 1km of film… roughly amounting to 3.75 gigs of storage. Unfortunately, two-thirds of the data collected between 1972-1999 was rendered inaccessible due to the film format, downlinking to ground stations, and other hardware issues.
We have come a long, long way. Modern data storage, the internet, the proliferation of open source tooling, and the development of spatial databases allow us to study millions of layers that make up our planet.
Back to 2024… Why do we need new geo formats? What trends are we seeing in the last five years that we didn’t see 10 years ago?
Short answer: The availability of cheap, reliable, and scalable cloud storage is enabling new, more efficient and flexible ways of working with geospatial data.
Traditional databases and service providers are on the verge of becoming unnecessary overhead for accessing massive geospatial datasets.New cloud-native data formats have emerged that let us efficiently access data from cloud storage.
Below is a selection of cloud-native geospatial formats we believe are becoming cornerstones of the industry.
Format | Details | Use Cases |
---|---|---|
GeoArrow | columnar binary, in-memory optimized | • High-performance analytics • Loaded data can be accessed directly by multiple CPU processes and GPU • Efficient querying/processing • Partial reads via row groups (aka RecordBatches) |
GeoParquet | columnar binary, storage optimized | • Optimized for compact storage • Ideal for minimizing storage footprint • Internal encoding and compressions • Partial reads via row groups |
FlatGeobuf | row-oriented binary | • A binary GeoJSON replacement • Streaming loads • Supports append operations • Optional spatial index enables loading only rows covering a specified area |
PMtiles | entire tileset as single files | • Simplifies distribution and handling of massive tilesets • Internal tile compression, compact representation of map tiles |
COG | raster with built-in zoom level hierarchy | • Enables clients to load, analyze and visualize large-scale raster data |
GeoZarr | multidimensional georeferenced grid | • Scientific/environmental modeling • Storage and analysis on several dimensions |
COPC | hierarchically “tiled” point clouds | • Lidar and 3D point cloud data management • Efficient storage and retrieval of point cloud data |
STAC | spatio temporal asset catalog | • Cataloging satellite, remote sensing, and other spatiotemporal assets |
A common theme in the above formats is that they are designed to store data in single, very large (“larger than memory”) files that offer some form of internal structure (whether row groups, row ranges, internal tiles, etc.) that allows clients to load only the “chunks” of the file containing the data they need.
These formats are suited for loading data over the internet, since only relevant tiles/data chunks need to be loaded by the user.
Many of the aforementioned data formats support queries, partial reads, and other methods of streamlined data retrieval. You can zoom to the opposite side of the planet, request a different satellite layer, a different dataset entirely, or run a calculation across datasets – these formats are designed to only pull the data required.
Open Table Formats
While cloud-native geospatial formats such as GeoParquet, GeoArrow, and FlatGeobuf allow us to scan big tabular datasets directly from cloud storage they simply cannot on their own match the capabilities of a traditional database.
There is no logical table concept or standard SQL query engine, and since the data is very much read only, there is no support for adding, updating, or deleting table rows. A data format can only provide so much assistance when it comes to applying advanced operations.
To fill this gap, table formats such as Apache Iceberg, Apache Hudi, and DataBricks’ Delta Lake build on cloud-native data formats such as Parquet and add extra capabilities:
- Table-like schemas/raw files can be viewed as tables with rows and columns
- Inserts, updates, deletes, and merge operations let us update datasets as we would a relational database table
- Transactional safety, allowing multiple users to safely read/write data from tables
Not only are open table formats less complex than traditional databases, they enable us to work with our data stacks more flexibly.
Since open table formats leverage standardized open cloud-native geospatial formats, users can now build simple data stacks using only open tooling. In many cases, these formats have the potential to replace costly data warehouses, avoiding vendor lock-in, and letting users easily switch out or augment tooling with other tools that also work with cloud-native formats.
Massive Open Datasets
The evolution of in cloud-native data formats supports a rapid increase in the availability of high-quality global geospatial datasets. Anyone with an internet connection has a direct line to a substantially richer reservoir of geospatial data than ever before. Let’s go over a few notable, new datasets:
Protomaps
Protomaps is a “free and open source map of the world,” and it is offered in the PMTiles cloud-native format (weighing in at just over 100 GB).
Overture Map Foundation
The Overture Maps Foundation is funded by some of the world’s biggest tech companies with the goal of creating interoperable global map data. So far providing:
- Administrative boundaries
- Transportation networks
- Building footprints
- Places of interest
These datasets are still growing, but already consume over 200 GB of disk space when downloaded directly. Cloud-native data formats are well-suited for easily sharing and serving this data, and indeed the overture foundation is publishing its datasets in cloud-native Parquet.
Sentinel and Landsat Satellites: Raster (STAC / COG)
The combined data collected over the years by Copernicus’s Sentinel and NASA’s Landsat satellites are already available on several sources, including Amazon AWS.
These “petabyte sized” raster datasets truly exceed the limit of what is practical for users to copy and process locally.
Fortunately, these datasets are provided on public cloud storage as massive image collections in Cloud-Optimized GeoTIFF format indexed by STAC metadata. This cloud-native setup enables tool makers (including Foursquare) to provide users with access to tremendous amounts of high-resolution historical data and offer advanced search and analytics capabilities.
New Open Tools: Duck DB
At Foursquare we are typically database agnostic, but one database has waddled into our purview: DuckDB. It’s created quite the splash – we are even noticing rapid adoption among our customers.
DuckDB fits perfectly into the cloud-native open tooling ecosystem that is emerging around the new cloud-native geospatial formats.
- It has strong support for working directly with cloud-native formats such as Apache Parquet and Apache Arrow and of course their geospatial variants, and geospatial extensions are available.
- There are also new extensions that enable DuckDB to work against “open tables” in e.g. Apache Iceberg.
- Performance is impressive, with users reporting that DuckDB can read parquet directly from an AWS Express One S3 bucket at 1.2GB per second.
Standards and Open Source
Standardization: The new cloud-native geospatial formats are typically developed and standardized through cross-industry collaborations. Foursquare is a technical member of the Open Geospatial Consortium, and most recently we have contributed to the specification of the GeoParquet format to add geospatial extensions to Parquet.
Open Source Formats: Foursquare develops and maintains open source implementations of over 30 geospatial formats in the loaders.gl GitHub repository. We recently contributed implementations for multiple cloud-native geospatial formats including GeoArrow, GeoParquet, PMTiles and FlatGeobuf.
Naturally, Foursquare uses this open source in our own tools; we launched support for FlatGeobuf and PMTiles formats as well as Delta Lake integration in our Foursquare Studio product.
And on the pure open source applications side, Foursquare led the work on the recent kepler.gl 3.0 release, which includes support for GeoArrow.
Get Started
- Sign up for Foursquare Studio to start using cloud-native geospatial formats
- Join our Studio community Slack channel to ask questions and let us know what you think about these technologies
- Explore our open-source implementation of cloud-native formats in the loaders.gl repository
- Join the open source visualization community at https://www.openvisualization.org/