Rising transformer-based imaginative and prescient fashions for geospatial information—additionally known as geospatial basis fashions (GeoFMs)—supply a brand new and highly effective know-how for mapping the earth’s floor at a continental scale, offering stakeholders with the tooling to detect and monitor surface-level ecosystem circumstances corresponding to forest degradation, pure catastrophe influence, crop yield, and plenty of others.
GeoFMs characterize an rising analysis area and are a kind of pre-trained imaginative and prescient transformer (ViT) particularly tailored to geospatial information sources. GeoFMs supply instant worth with out coaching. The fashions excel as embedding fashions for geospatial similarity search and ecosystem change detection. With minimal labeled information, GeoFMs will be fine-tuned for customized duties corresponding to land floor classification, semantic segmentation, or pixel-level regression. Many main fashions can be found underneath very permissive licenses making them accessible for a large viewers. Examples embrace SatVision-Base, Prithvi-100M, SatMAE, and Clay (used on this answer).
On this put up, we discover how Clay Basis’s Clay basis mannequin, accessible on Hugging Face, will be deployed for large-scale inference and fine-tuning on Amazon SageMaker. For illustrative functions, we give attention to a deforestation use case from the Amazon rainforest, one of many most biodiverse ecosystems on the earth. Given the sturdy proof that the Amazon forest system might quickly be reaching a tipping level, it presents an necessary area of examine and a high-impact software space for GeoFMs, for instance, via early detection of forest degradation. Nevertheless, the answer introduced right here generalizes to a variety of geospatial use instances. It additionally comes with ready-to-deploy code samples that can assist you get began rapidly with deploying GeoFMs in your individual functions on AWS.
Let’s dive in!
Answer overview
On the core of our answer is a GeoFM. Architecturally, GeoFMs construct on the ViT structure first launched within the seminal 2022 analysis paper An Picture is Price 16×16 Phrases: Transformers for Picture Recognition at Scale. To account for the particular properties of geospatial information (a number of channels starting from ultraviolet to infrared, various electromagnetic spectrum protection, and spatio-temporal nature of information), GeoFMs incorporate a number of architectural improvements corresponding to variable enter dimension (to seize a number of channels) or the addition of positional embeddings that seize spatio-temporal facets corresponding to seasonality and placement on earth. The pre-training of those fashions is carried out on unlabeled geospatial information sampled from throughout the globe utilizing masked autoencoders (MAE) as self-supervised learners. Sampling from global-scale information helps make sure that numerous ecosystems and floor sorts are represented appropriately within the coaching set. What outcomes are common objective fashions that can be utilized for 3 core use instances:
- Geospatial similarity search: Rapidly map numerous floor sorts with semantic geospatial search utilizing the embeddings to seek out comparable objects (corresponding to deforested areas).
- Embedding-based change detection: Analyze a time collection of geospatial embeddings to establish floor disruptions over time for a particular area.
- Customized geospatial machine studying: Effective-tune a specialised regression, classification, or segmentation mannequin for geospatial machine studying (ML) duties. Whereas this requires a certain quantity of labeled information, total information necessities are usually a lot decrease in comparison with coaching a devoted mannequin from the bottom up.
The overall answer stream is proven within the following diagram. Observe that this stream diagram is extremely abstracted and omits sure architectural particulars for causes of readability. For a full structure diagram demonstrating how the stream will be applied on AWS, see the accompanying GitHub repository. This repository additionally comprises detailed deployment directions to get you began rapidly with making use of GeoFMs to your individual use instances.
- Retrieve and course of satellite tv for pc imagery for GeoFM inference or coaching: Step one is to get the uncooked geospatial information right into a format that’s consumable by the GeoFM. This entails breaking down the big uncooked satellite tv for pc imagery into equally-sized 256×256 pixel chips (the scale that the mode expects) and normalizing pixel values, amongst different information preparation steps required by the GeoFM that you just select. This routine will be carried out at scale utilizing an Amazon SageMaker AI processing job.
- Retrieve mannequin weights and deploy the GeoFM: Subsequent, retrieve the open weights of the GeoFM from a mannequin registry of your alternative (HuggingFace on this instance) and deploy the mannequin for inference. The very best deployment possibility in the end is dependent upon how the mannequin is consumed. If you’ll want to generate embedding asynchronously, use a SageMaker AI processing or remodel step. For real-time inference, contemplate deploying to a SageMaker AI real-time endpoint, which will be configured to auto-scale with demand, permitting for large-scale inference. On this instance, we use a SageMaker AI processing job with a customized Docker picture for producing embeddings in batch.
- Generate geospatial embeddings: The GeoFM is an encoder-only mannequin, which means that it outputs an embedding vector. Throughout inference, you carry out a ahead cross of the pre-processed satellite tv for pc picture chip via the GeoFM. This produces the corresponding embedding vector, which will be considered a compressed illustration of the data contained within the picture. This course of is equal to utilizing textual content embedding fashions for RAG use instances or comparable.
The generated geospatial embeddings can be utilized largely as-is for 2 key use instances: geospatial similarity search and ecosystem change detection.
- Run similarity search on the embeddings to establish semantically comparable pictures: The GeoFM embeddings reside in the identical vector area. This enables us to establish comparable objects by figuring out vectors which are very near a given question level. A standard high-performance search algorithm for that is approximate nearest neighbor (ANN). For scalability and search efficiency, we index the embedding vectors in a vector database.
- Analyze time-series of embeddings for break factors that point out change: As a substitute of searching for similarity between embedding vectors, you too can search for distance. Doing this for a particular area and throughout time allows you to pinpoint particular instances the place change happens. This lets you use embeddings for floor change detection over time, a quite common use case in geospatial analytics.
Optionally, you too can fine-tune a mannequin on prime of the GeoFM.
- Practice a customized head and run inference: To fine-tune a mannequin you add a customized (and usually light-weight) head on prime of the GeoFM and fine-tune it on a (usually small) labeled dataset. The GeoFM weights stay frozen and are usually not retrained. The customized head takes the GeoFM-generated embedding vectors as enter and produces classification masks, pixel-level recessions outcomes, or just a category per picture, relying on the use case.
We discover the important thing steps of this workflow within the subsequent sections. For added particulars on the implementation—together with. easy methods to construct a high-quality consumer interface with Solara—see the accompanying GitHub repository.
Geospatial information processing and embedding technology
Our complete, four-stage information processing pipeline transforms uncooked satellite tv for pc imagery into analysis-ready vector embeddings that energy superior geospatial analytics. This orchestrated workflow makes use of Amazon SageMaker AI Pipelines to create a strong, reproducible, and scalable processing structure. The top-to-end answer can course of Earth commentary information for a specific area of curiosity, with built-in flexibility to adapt to totally different use instances. On this instance, we use Sentinel-2 imagery from the Amazon Registry of Open Knowledge for monitoring deforestation within the Brazilian rainforest. Nevertheless, our pipeline structure is designed to work seamlessly with different satellite tv for pc picture suppliers and resolutions (corresponding to NAIP with 1m/pixel decision, or Maxar and Planet Labs as much as beneath 1m/pixel decision).
Pipeline structure overview
The SageMaker pipeline consists of 4 processing steps, proven within the previous determine, every step builds on the outputs of the earlier steps with intermediate outcomes saved in Amazon Easy Storage Service (Amazon S3).
- Pre-process satellite tv for pc tiles: Divides the satellite tv for pc imagery into chips. We selected a chip dimension of 256×256 pixels as anticipated by Clay v1. For Sentinel-2 pictures this corresponds to an space of two.56 x 2.56 km2.
- Generate embeddings: Creates 768-dimensional vector representations for the chips utilizing the Clay v1 mannequin.
- Course of embeddings: Performs dimensionality discount and computes similarity metrics (for downstream analyses).
- Consolidate and index: Consolidates outputs and hundreds embeddings vectors right into a Vector retailer.
Step 1: Satellite tv for pc information acquisition and chipping
The pipeline begins by accessing Sentinel-2 multispectral satellite tv for pc imagery via the AWS Open Knowledge program from S3 buckets. This imagery gives 10-meter decision throughout a number of spectral bands together with RGB (seen mild) and NIR (near-infrared), that are important for environmental monitoring.
This step filters out chips which have extreme cloud cowl and divides massive satellite tv for pc scenes into manageable 256×256 pixel chips, which permits environment friendly parallel processing and creates uniform inputs for the inspiration mannequin. This step additionally runs on a SageMaker AI Processing job with a customized Docker picture optimized for geospatial operations.
For every chip, this step generates:
- NetCDF datacubes (.netcdf) containing the total multispectral data
- RGB thumbnails (.png) for visualization
- Wealthy metadata (.parquet) with geolocation, timestamps, and different metadata
Step 2: Embedding technology utilizing a Clay basis mannequin
The second step transforms the preprocessed picture chips into vector embeddings utilizing the Clay v1 basis mannequin. That is probably the most computationally intensive a part of the pipeline, utilizing a number of GPU situations (ml.g5.xlarge) to effectively course of the satellite tv for pc imagery.
For every chip, this step:
- Accesses the NetCDF datacube from Amazon S3
- Normalizes the spectral bands in response to the Clay v1 mannequin’s enter necessities
- Generates each patch-level and sophistication token (CLS) embeddings
- Shops the embeddings as NumPy arrays (.npy) alongside the unique information on S3 as intermediate retailer
Whereas Clay can use all Sentinel-2 spectral bands, our implementation makes use of RGB and NIR as enter bands to generate a 768-dimensional embedding, which give glorious leads to our examples. Prospects can simply adapt the enter bands based mostly on their particular use-cases. These embeddings encapsulate high-level options corresponding to vegetation patterns, city constructions, water our bodies, and land use traits—with out requiring specific characteristic engineering.
Step 3: Embedding processing and evaluation
The third step analyzes the embeddings to extract significant insights, significantly for time-series evaluation. Operating on high-memory situations, this step:
- Performs dimensionality discount on the embeddings utilizing principal element evaluation (PCA) and t-distributed stochastic neighbor embedding (t-SNE) (for use later for change detection)
- Computes cosine similarity between embeddings over time (an alternate for change detection)
- Identifies vital modifications within the embeddings that may point out floor modifications
- Saves processed embeddings in Parquet format for environment friendly querying
The output consists of processed embedding recordsdata that include each the unique high-dimensional vectors and their decreased representations, together with computed similarity metrics.
For change detection functions, this step establishes a baseline for every geographic location and calculates deviations from this baseline over time. These deviations, captured as vector distances, present a strong indicator of floor modifications like deforestation, city growth, or pure disasters.
Step 4: Consolidation and vector database integration
The ultimate pipeline step consolidates the processed embeddings right into a unified dataset and hundreds them into vector databases optimized for similarity search. The outputs embrace consolidated embedding recordsdata, GeoJSON grid recordsdata for visualization, and configuration recordsdata for frontend functions.
The answer helps two vector database choices:
Each choices present environment friendly ANN search capabilities, enabling sub-second question efficiency. The selection between them is dependent upon the size of deployment, integration necessities, and operational preferences.
With this sturdy information processing and embedding technology basis in place, let’s discover the real-world functions enabled by the pipeline, starting with geospatial similarity search.
Geospatial similarity search
Organizations working with Earth commentary information have historically struggled with effectively figuring out particular panorama patterns throughout massive geographic areas. Conventional Earth commentary evaluation requires specialised fashions skilled on labeled datasets for every goal characteristic. This method forces organizations right into a prolonged course of of information assortment, annotation, and mannequin coaching earlier than acquiring outcomes.
In distinction, the GeoFM-powered similarity search converts satellite tv for pc imagery into 768-dimensional vector embeddings that seize the semantic essence of panorama options, eliminating the necessity for handbook characteristic engineering and computation of specialised indices like NDVI or NDWI.
This functionality makes use of the Clay basis mannequin’s pre-training on numerous international landscapes to know advanced relationships between options with out specific programming. The result’s an intuitive image-to-image search functionality the place customers can choose a reference space—corresponding to early-stage deforestation or wildfire harm—and immediately discover comparable patterns throughout huge territories in seconds reasonably than weeks.
Similarity search implementation
Our implementation gives a streamlined workflow for locating comparable geographic areas utilizing the embeddings generated by the information processing pipeline. The search course of includes:
- Reference space choice: Customers choose a reference chip representing a search time period (for instance, a deforested patch, city growth, or agricultural area)
- Search parameters: Customers specify the variety of outcomes and a similarity threshold
- Vector search execution: The system retrieves comparable chips utilizing cosine similarity between embeddings
- Outcome visualization: Matching chips are highlighted on the map
Let’s dive deeper on a real-world software, taking our operating instance of detecting deforestation within the Mato Grosso area of the Brazilian Amazon. Conventional monitoring approaches usually detect forest loss too late—after vital harm has already occurred. The Clay-powered similarity search functionality presents a brand new method by enabling early detection of rising deforestation patterns earlier than they increase into large-scale clearing operations.
Utilizing a single reference chip displaying the preliminary indicators of forest degradation—corresponding to selective logging, small clearings, or new entry roads—analysts can immediately establish comparable patterns throughout huge areas of the Amazon rainforest. As demonstrated within the following instance pictures, the system successfully acknowledges the refined signatures of early-stage deforestation based mostly on a single reference picture. This functionality permits environmental safety companies and conservation organizations to deploy sources exactly, enhancing the anti-deforestation efforts by addressing threats to forestall main forest loss. Whereas a single reference chip picture led to good leads to our examples, various approaches exist, corresponding to a mean vector technique, which leverages embeddings from a number of reference pictures to reinforce the similarity search outcomes.
Ecosystem change detection
In contrast to vector-based similarity search, change detection focuses on measuring the gap between embedding vectors over time, the core assumption being that the extra distant embedding vectors are to one another, the extra dissimilar the underlying satellite tv for pc imagery is. If utilized to a single area over time, this allows you to pinpoint so known as change factors—intervals the place vital and long-lasting change in floor circumstances occurred.
Our answer implements a timeline view of Sentinel-2 satellite tv for pc observations from 2018 to current. Every commentary level corresponds to a singular satellite tv for pc picture, permitting for detailed temporal evaluation. Whereas embedding vectors are extremely dimensional, we use the beforehand computed PCA (and optionally t-SNE) to cut back dimensionality to a single dimension for visualization functions.
Let’s assessment a compelling instance from our evaluation of deforestation within the Amazon. The next picture is a timeseries plot of geospatial embeddings (first principal element) for a single 256×256 pixel chip. Cloudy pictures and main outliers have been eliminated.
Factors clustered intently on the y-axis point out comparable floor circumstances; sudden and chronic discontinuities within the embedding values sign vital change. Right here’s what the evaluation exhibits:
- Steady forest circumstances from 2018 via 2020
- A major discontinuity in embedding values throughout 2021. Nearer assessment of the underlying satellite tv for pc imagery exhibits clear proof of forest clearing and conversion to agricultural fields
- Additional transformation seen in 2024 imagery
Naturally, we want a technique to automate the method of change detection in order that it may be utilized at scale. Provided that we don’t usually have in depth changepoint coaching datasets, we want an unsupervised method that works with out labeled information. The instinct behind unsupervised change detection is the next: establish what regular appears like, then spotlight massive sufficient deviations from regular and flag them as change factors; after a change level has occurred, characterize the new regular and repeat the method.
The next perform performs harmonic regression evaluation on the embeddings timeseries information, particularly designed to mannequin yearly seasonality patterns. The perform suits a harmonic regression with a specified frequency (default 12 months for annual patterns) to the embedding information of a baseline interval (the yr 2018 on this instance). It then generates predictions and calculates error metrics (absolute and proportion deviations). Massive deviations from the traditional seasonal sample point out change and will be routinely flagged utilizing thresholding.
When utilized to the chips throughout an space of commentary and defining a threshold on the utmost deviation from the fitted harmonic regression, we are able to routinely map change depth permitting analysts to rapidly zoom in on problematic areas.
Whereas this technique performs effectively in our analyses, it’s also fairly inflexible in that it requires a cautious tuning of error thresholds and the definition of a baseline interval. There are extra subtle approaches accessible starting from general-purpose time-series analyses that automate the baseline definition and alter level detection utilizing recursive strategies (for instance, Gaussian Processes) to specialised algorithms for geospatial change detection (for instance, LandTrendr, and Steady Change Detection and Classification (CCDC)).
In sum, our method to alter detection demonstrates the ability of geospatial embedding vectors in monitoring environmental modifications over time, offering worthwhile insights for land use monitoring, environmental safety, and concrete planning functions.
GeoFM fine-tuning to your customized use case
Effective-tuning is a particular implementation of switch studying, through which a pre-trained basis mannequin is tailored to particular duties via focused extra coaching on specialised labeled datasets. For GeoFMs, these particular duties can goal agriculture, catastrophe monitoring or city evaluation. The mannequin retains its broad spatial understanding whereas creating experience for specific areas, ecosystems or analytical duties. This method considerably reduces computational and information necessities in comparison with constructing specialised fashions from scratch, with out sacrificing accuracy. Effective-tuning usually includes preserving the pre-trained Clay’s encoder—which has already discovered wealthy representations of spectral patterns, spatial relationships, and temporal dynamics from huge satellite tv for pc imagery, whereas attaching and coaching a specialised task-specific head.
For pixel-wise prediction duties—corresponding to land use segmentation—the specialised head is often a decoder structure, whereas for class-level outputs (classification duties) the pinnacle will be as fundamental as a multilayer perceptron community. Coaching focuses completely on the brand new decoder that captures the characteristic representations from mannequin’s frozen encoder and progressively transforms them again to full-resolution pictures the place every pixel is classed in response to its land use sort.
The segmentation framework combines the highly effective pre-trained Clay encoder with an environment friendly convolutional decoder, taking Clay’s wealthy understanding of satellite tv for pc imagery and changing it into detailed land use maps. The light-weight decoder options convolutional layers and pixel shuffle upsampling strategies that seize the characteristic representations from Clay’s frozen encoder and progressively transforms them again to full-resolution pictures the place every pixel is classed in response to its land use sort. By freezing the encoder (which comprises 24 transformer heads and 16 consideration heads) and solely coaching the compact decoder, the mannequin achieves an excellent steadiness between computational effectivity and segmentation accuracy.
We utilized this segmentation structure on a labeled land use land cowl (LULC) dataset from Influence Observatory and hosted on the Amazon Registry of Open Knowledge. For illustrative functions, we once more targeted on our operating instance from Brazil’s Mato Grosso area. We skilled the decoder head for 10 epochs which took 17 minutes complete and tracked intersection over union (IOU) and F1 rating as segmentation accuracy metrics. After only one coaching epoch, the mannequin already achieved 85.7% validation IOU. With the total 10 epochs accomplished, efficiency elevated to a powerful 92.4% IOU and 95.6% F1 rating. Within the following picture, we present floor reality satellite tv for pc imagery (higher) and the mannequin’s predictions (decrease). The visible comparability highlights how precisely this method can classify totally different land use classes.
Conclusion
Novel GeoFMs present an encouraging new method to geospatial analytics. By their in depth pre-training, these fashions have integrated a deep implicit understanding of geospatial information and can be utilized out-of-the-box for high-impact use instances corresponding to similarity search or change detection. They’ll additionally function the idea for specialised fashions utilizing a fine-tuning course of that’s considerably much less data-hungry (fewer labeled information wanted) and has decrease compute necessities.
On this put up, we now have proven how one can deploy a state-of-the-art GeoFM (Clay) on AWS and have explored one particular use case – monitoring deforestation within the Amazon rainforest – in higher element. The identical method is relevant to a big number of business use case. For instance, insurance coverage corporations can use an identical method to ours to evaluate harm after pure disasters together with hurricanes, floods or fires and maintain monitor of their insured belongings. Agricultural organizations can use GeoFMs for crop sort identification, crop yield predictions, or different use instances. We additionally envision high-impact use instances in industries like city planning, emergency and catastrophe response, provide chain and international commerce, sustainability and environmental modeling, and plenty of others. To get began making use of GeoFMs to your individual earth commentary use case, try the accompanying GitHub repository, which has the conditions and a step-by-step walkthrough to run it by yourself space of curiosity.
Concerning the Authors
Dr. Karsten Schroer is a Senior Machine Studying (ML) Prototyping Architect at AWS, targeted on serving to prospects leverage synthetic intelligence (AI), ML, and generative AI applied sciences. With deep ML experience, he collaborates with corporations throughout industries to design and implement data- and AI-driven options that generate enterprise worth. Karsten holds a PhD in utilized ML.
Bishesh Adhikari is a Senior ML Prototyping Architect at AWS with over a decade of expertise in software program engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS prospects to construct options for difficult issues via co-development. His experience accelerates prospects’ journey from idea to manufacturing, tackling advanced use instances throughout varied industries. In his free time, he enjoys mountaineering, touring, and spending time with household and mates.
Dr. Iza Moise is a Senior Machine Studying (ML) Prototyping Architect at AWS, with experience in each conventional ML and superior strategies like basis fashions and imaginative and prescient transformers. She focuses on utilized ML throughout numerous scientific fields, publishing and reviewing at Amazon’s inner ML conferences. Her power lies in translating theoretical advances into sensible options that ship measurable influence via considerate implementation.