An Inside Look at Cloud-Native Geospatial Formats

Cloud-Native Geospatial Formats

The geospatial industry is currently undergoing a wave of rapid innovations based on a number of new geospatial data formats that have been designed from the ground up to leverage the capabilities of the cloud.

We expect these cloud-based innovative advancements to benefit a wide range of geospatial users. And to provide a world-class, premier location intelligence platform, Foursquare actively participates in the ongoing innovation. 

This post gives you a peek at our notes, sharing our thoughts on how the geospatial community can utilize new technologies and how products like Foursquare’s can help users adopt the tools and workflows of this new ecosystem.


Cloud-Native Geospatial Formats

Or… what would the geospatial world look like if we built everything from the ground up on the cloud?

The geospatial industry has always been characterized by the massive size of geospatial datasets. 

Let’s go way back to 1972, when NASA’s first Landsat satellite flew with a set of 76-pound cameras with 1km of film… roughly amounting to 3.75 gigs of storage. Unfortunately, two-thirds of the data collected between 1972-1999 was rendered inaccessible due to the film format, downlinking to ground stations, and other hardware issues.

Landsat 1 launched in 1972. Photo by NASA.

We have come a long, long way. Modern data storage, the internet, the proliferation of open source tooling, and the development of spatial databases allow us to study millions of layers that make up our planet.

Back to 2024… Why do we need new geo formats? What trends are we seeing in the last five years that we didn’t see 10 years ago?

Short answer: The availability of cheap, reliable, and scalable cloud storage is enabling new, more efficient and flexible ways of working with geospatial data. 

Traditional databases and service providers are on the verge of becoming unnecessary overhead for accessing massive geospatial datasets.New cloud-native data formats have emerged that let us efficiently access data from cloud storage.

Below is a selection of cloud-native geospatial formats we believe are becoming cornerstones of the industry. 

FormatDetailsUse Cases
GeoArrowcolumnar binary, in-memory optimized• High-performance analytics
• Loaded data can be accessed directly by
multiple CPU processes and GPU
• Efficient querying/processing
• Partial reads via row groups (aka RecordBatches)
GeoParquetcolumnar binary, storage optimized• Optimized for compact  storage
• Ideal for minimizing storage footprint
• Internal encoding and compressions
• Partial reads via row groups
FlatGeobufrow-oriented binary• A binary GeoJSON replacement
• Streaming loads
• Supports append operations
• Optional spatial index enables loading only rows covering a specified area
PMtilesentire tileset as single files• Simplifies distribution and handling of massive tilesets
• Internal tile compression, compact representation of map tiles
COGraster with built-in zoom level hierarchy• Enables clients to load, analyze and visualize large-scale raster data
GeoZarrmultidimensional georeferenced grid• Scientific/environmental modeling
• Storage and analysis on several dimensions
COPChierarchically “tiled” point clouds• Lidar and 3D point cloud data management
• Efficient storage and retrieval of point cloud data
STACspatio temporal asset catalog• Cataloging satellite, remote sensing, and other spatiotemporal assets

A common theme in the above formats is that they are designed to store data in single, very large (“larger than memory”) files that offer some form of internal structure (whether row groups, row ranges, internal tiles, etc.) that allows clients to load only the “chunks” of the file containing the data they need.

FSQ Studio: Loading comparison for 400,000 OSM building polygons. The cloud-native FlatGeobuf format loads significantly faster than GeoJSON, while PMTiles only loads what’s necessary for the visualization.

These formats are suited for loading data over the internet, since only relevant tiles/data chunks need to be loaded by the user.

Many of the aforementioned data formats support queries, partial reads, and other methods of streamlined data retrieval. You can zoom to the opposite side of the planet, request a different satellite layer, a different dataset entirely, or run a calculation across datasets – these formats are designed to only pull the data required. 


Open Table Formats

While cloud-native geospatial formats such as GeoParquet, GeoArrow, and FlatGeobuf allow us to scan big tabular datasets directly from cloud storage they simply cannot on their own match the capabilities of a traditional database.

There is no logical table concept or standard SQL query engine, and since the data is very much read only, there is no support for adding, updating, or deleting table rows. A data format can only provide so much assistance when it comes to applying advanced operations. 

To fill this gap, table formats such as Apache Iceberg, Apache Hudi, and DataBricks’ Delta Lake build on cloud-native data formats such as Parquet and add extra capabilities:

  • Table-like schemas/raw files can be viewed as tables with rows and columns
  • Inserts, updates, deletes, and merge operations let us update datasets as we would a relational database table
  • Transactional safety, allowing multiple users to safely read/write data from tables

Not only are open table formats less complex than traditional databases, they enable us to work with our data stacks more flexibly.

Since open table formats leverage standardized open cloud-native geospatial formats, users can now build simple data stacks using only open tooling. In many cases, these formats have the potential to replace costly data warehouses, avoiding vendor lock-in, and letting users easily switch out or augment tooling with other tools that also work with cloud-native formats. 


Massive Open Datasets

The evolution of in cloud-native data formats supports a rapid increase in the availability of high-quality global geospatial datasets. Anyone with an internet connection has a direct line to a substantially richer reservoir of geospatial data than ever before. Let’s go over a few notable, new datasets:

Protomaps

Protomaps is a “free and open source map of the world,” and it is offered in the PMTiles cloud-native format (weighing in at just over 100 GB).

Browsing 100GB Protomaps PM Tiles, with only the required data chunks loading to the browser.

Overture Map Foundation

The Overture Maps Foundation is funded by some of the world’s biggest tech companies with the goal of creating interoperable global map data. So far providing:

  • Administrative boundaries
  • Transportation networks
  • Building footprints
  • Places of interest

These datasets are still growing, but already consume over 200 GB of disk space when downloaded directly. Cloud-native data formats are well-suited for easily sharing and serving this data, and indeed the overture foundation is publishing its datasets in cloud-native Parquet

Sentinel and Landsat Satellites: Raster (STAC / COG) 

The combined data collected over the years by Copernicus’s Sentinel and NASA’s Landsat satellites are already available on several sources, including Amazon AWS.

These “petabyte sized” raster datasets truly exceed the limit of what is practical for users to copy and process locally.

Fortunately, these datasets are provided on public cloud storage as massive image collections in Cloud-Optimized GeoTIFF format indexed by STAC metadata. This cloud-native setup enables tool makers (including Foursquare) to provide users with access to tremendous amounts of high-resolution historical data and offer advanced search and analytics capabilities.

FSQ Studio users can access free raster datasets through its Raster Tile features.

New Open Tools: Duck DB

At Foursquare we are typically database agnostic, but one database has waddled into our purview: DuckDB. It’s created quite the splash – we are even noticing rapid adoption among our customers.

DuckDB fits perfectly into the cloud-native open tooling ecosystem that is emerging around the new cloud-native geospatial formats. 

  • It has strong support for working directly with cloud-native formats such as Apache Parquet and Apache Arrow and of course their geospatial variants, and geospatial extensions are available.
  • There are also new extensions that enable DuckDB to work against “open tables” in e.g. Apache Iceberg.
  • Performance is impressive, with users reporting that DuckDB can read parquet directly from an AWS Express One S3 bucket at 1.2GB per second.
DuckDB’s speed and flexibility is enabling new workflows for data scientists.

Standards and Open Source

Standardization: The new cloud-native geospatial formats are typically developed and standardized through cross-industry collaborations. Foursquare is a technical member of the Open Geospatial Consortium, and most recently we have contributed to the specification of the GeoParquet format to add geospatial extensions to Parquet.

Open Source Formats: Foursquare develops and maintains open source implementations of over 30 geospatial formats in the loaders.gl GitHub repository. We recently contributed implementations for multiple cloud-native geospatial formats including GeoArrow, GeoParquet, PMTiles and FlatGeobuf.

Naturally, Foursquare uses this open source in our own tools; we launched support for FlatGeobuf and PMTiles formats as well as Delta Lake integration in our Foursquare Studio product. 

And on the pure open source applications side, Foursquare led the work on the recent kepler.gl 3.0 release, which includes support for GeoArrow.


Get Started

More on capabilities

Introducing kepler.gl 3.0

Learn More

Visualize Big Data with Vector Tiles in Foursquare Studio

Learn More

How Geospatial Analysis Fuels Smarter Micro-Mobility Decisions

Learn More

Let us show you how you can take advantage of Studio

Click here to arrange a meeting