Big Data
Studio can load fairly large datasets directly into the browser, but there are limits.
When a user suddenly encounters a dataset that exceeds the limit, it becomes necessary to choose an alternative approach.
There is no single solution for handling big geospatial data that solves all use cases, there is a range of standard techniques that covers most cases:
Technique | Comments |
---|---|
Load data into browser | Dataset size is limited, see below for details. |
Preprocess data | Dataset size is reduced during query, filtering out rows, removing columns etc, often done in Python notebooks. |
Database queries | Store large datasets in databases, and then load partial data via queries. |
Tiled data | Convert data into tiles, such as Hex Tiles. |
Data Size Limits
While we suggest 250MB as a rough guidance for the size of a datasets that can be loaded directly into the browser, the exact limit will depends on:
- the capabilities of the user's computer. For instance, a mobile device will typically be able to load significantly less data than a laptop.
- the structure of the data: e.g. a GeoJSON file can have just 1000 rows, but each row can be a polygon with 1000 points and this representation generates some overhead.
- how many big datasets are already in the map.
Note: over time, we expect this limit to increase significantly, both as a result of computers becoming more powerful and as we continue to optimize the internal processing of data in the Studio platform "to the bone". However, there will always be a limit to the amount of data that can be loaded into the browser, so the alternative techniques below will still apply.
Data Preprocessing
In many cases, datasets are just a little bit too big for Studio (say perhaps around a gigabyte in size). In such cases, some quick pre-processing of the data can often reduce the size of a dataset to a point where Studio can load it directly. Sometimes it can be as simple as removing a few unused columns from the dataset to reduce its size before loading it. Sometimes other standard techniques like filtering or grouping operations can help get the data into a more efficient form.
If you are a data scientists who is prepared to write some code, Studio offers deep integration with the most common environments for data preparation, in particular Python Notebooks with full support for data processing libraries such as pandas and geopandas.
Database Queries
One approach to working with large tabular data sets is to upload them into databases. And your database of choice is supported by an Data Connector, partial data can then be loaded into Studio by performing a query.
Tip: The SQL limit
clause is your friend. It provides a simple but effective way of capping the amount of data returned from a query, ensuring that the result can be processed by Studio.
Tiled Data Formats
A standard approach to dealing with large geospatial datasets is to convert them into a tiled representation. The tiled representation is a large tree of small files, where a small subset of tiles can be loaded to cover the user's current viewpoint in a high level of detail. There are a growing number of tile formats available and the choice can be influenced by the structure of data.
Hex Tiles
Hex Tiles is Studio's solution to the problem of working with big geospatial data. They are a great choice when working analytically with massive datasets in the gigabyte or even terabyte range.
The Studio platform comes with integrated tools for converting your datasets to Hex Tiles.
Vector Tiles
For visualizing data that is in the form of standard geospatial "features" (polygons, lines and points), (such as very large Shapefiles or GeoJSON files), converting the data to vector tiles is often a good choice.
The Studio platform has full support for loading vector tile datasets, but currently does not offer services for generating vector tiles from user data.
For those willing to do a little bit of work, there are excellent open source tools such as Mapbox tippecanoe, however there are also various commercial map tiling services.
Raster Tiles
Raster tiles typically contain imagery but can also contain arbitrary analytical data.
Studio has full support for visualizing raster tiles and cloud-optimized GeoTIFFs,
and the petabyte-sized Sentinel and Landsat archives are available in the Studio Data Catalog.
Updated 4 months ago