Cluster and Outlier Analysis
The Cluster and Outlier analysis module allows you to calculate statistically significant hot-spots, cold-spots, and spatial outliers, then quickly visualize the results.
Using Cluster and Outlier Analysis, answer questions such as:
- Which areas of California are disproportionately affected by pollution?
- Are there any significant spatial patterns in the listing price of Berlin's Airbnbs?
- Which restaurants in New York have significantly more/fewer visits than nearby competitors?
To conduct the Cluster and Outlier analysis, Studio applies Anselin's Local Indicators of Spatial Association (LISA), specifically the local Moran statistic, to identify geographical clusters of values or find geographical outliers.
This method has been widely used in spatial applications including environmental and natural resource analysis, real estate analysis, criminology studies, public health research, political geography and demographics studies, and much more.
Find examples and more information in the Cluster and Outlier Analysis use case article.
Perform Cluster and Outlier Analysis
Follow these steps to perform a cluster and outlier analysis in Studio.
Cluster and Outlier Analysis requires your map to contain a
1. Open the Cluster and Outlier Analysis module
Navigate to the Analysis tab in Studio, then click Cluster and Outlier Analysis.
2. Select an Input Layer from your map
The input layer must be a
geojson layer with
polygon geometries. This is the layer on which Studio will conduct the analysis of local Moran statistics.
3. Select an Attribute Field from the dataset
Select a field to use as values for the analysis of local Moran statistics. The attribute field must be from a dataset associated with your input layer.
For local Moran statistics, we recommend you select an attribute field containing quantitative variables.
4. Configure the Spatial Weights Creation
Studio provides two types of spatial weights:
Use # of Nearest Neighbors Weighting
Input the number of nearest neighbors to ensure all spatial objects have the same number of neighbors. Defaults to
Use Distance Threshold Weighting
Input a distance unit (KM or Miles), creating a distance threshold to determine neighbors.
By default, Studio will suggest a distance that ensures each spatial object has at least one neighbor.
5. Configure the Local Moran Parameters
In local Moran statistics, permutation-based inference generates a pseudo p-value used to evaluate the significance of each cluster.
Studio allows you to modify the following local Moran parameters:
Control Distribution With Permutations
Permutations are used to determine the probability of finding the actual distribution of the values under analysis. This is accomplished by comparing many random datasets to the local Moran's I of your original data.
Input the number of permutations to compute the pseudo p-value. Defaults to
Hide Less Significant Clusters With Thresholds
Input a number serving as a P-value threshold, allowing you to only display significant clusters on the map. Defaults to
Note: In permutation-based inference, the smallest pseudo p-value is computed as
1/(permutations + 1).
For example, given a p-value of
999, the smallest pseudo p-value is
6. Generate the results of the analysis
Click Run to generate the results of your cluster and outlier analysis.
The results of the analysis are shown in a preview table. If you are not satisfied with the results, tweak the parameters and click Rerun.
The results are stored in a data table containing the following columns:
|Attribute Field||The value of the selected |
|latitude (optional)||The latitude value, only when |
|longitude (optional)||The longitude value, only when |
|lisa||The local Moran's I value|
|spatial_lag||The average (standardized) value of the neighbors|
|cluster||The type of spatial association - 0 for not significant, 1 for High-High, 2 for Low-Low, 3 for High-Low, 4 for Low-High, 5 for isolated (no neighbors)|
|pvalue||The pseudo p-value is the significance value computed from the random permutations|
|neighbors||The array of row indices of the neighbors|
When you are satisfied with the results of your analysis, click Confirm to proceed to the visualization.
Upon completing the cluster and outlier analysis, a new layer and dataset are generated.
Point Layer Results
If the input layer was a
point layer, a connectivity graph will appear to visualize the neighboring/connectivity relationship among spatial objects. Mouse over a point to highlight neighboring points (defined by the spatial weights configuration).
Cluster types are visualized by color-coding geometries to represent the cluster type. A chart will generate, serving as a legend for the cluster types.
The local Moran statistic takes the data values and the associated geographical locations as input, then returns statistically significant clusters in four types:
|High-High||Hot spot clusters with high values surrounded by other high values.|
|Low-Low||Cold spot clusters with low values surrounded by other low values.|
|High-Low||Spatial outlier with high values surrounded by low values.|
|Low-High||Spatial outlier with low values surrounded by high values.|
This visualization can be customized via the Layer configuration.
CalEnviroScreen is a screening methodology that can be used to help identify California communities that are disproportionately burdened by multiple sources of pollution. Use the slider to view statewide data on the left, and a cluster/outlier analysis on the right.
Data source: https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-40
Find examples and other information in the Cluster and Outlier Analysis use case article.
The Cluster-outlier Analysis module is undergoing continued development. Visit our community Slack channel, or contact us directly via email for any inquiries regarding this module.
Updated 2 months ago