Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Apple Breaks Precedent, Patches DarkSword for iOS 18

    April 5, 2026

    Watch Artemis II Dwell: When is NASA’s Historic Moon Launch?

    April 5, 2026

    To Infinity and Past: Software-Use Unlocks Size Generalization in State House Fashions

    April 5, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Scaling seismic basis fashions on AWS: Distributed coaching with Amazon SageMaker HyperPod and increasing context home windows
    Machine Learning & Research

    Scaling seismic basis fashions on AWS: Distributed coaching with Amazon SageMaker HyperPod and increasing context home windows

    Oliver ChambersBy Oliver ChambersApril 4, 2026No Comments13 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Scaling seismic basis fashions on AWS: Distributed coaching with Amazon SageMaker HyperPod and increasing context home windows
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    This submit is cowritten with Altay Sansal and Alejandro Valenciano from TGS.

    TGS, a geoscience knowledge supplier for the power sector, helps firms’ exploration and manufacturing workflows with superior seismic basis fashions (SFMs). These fashions analyze complicated 3D seismic knowledge to establish geological buildings important for power exploration. To assist improve their next-generation fashions as a part of their AWS infrastructure modernization, TGS partnered with the AWS Generative AI Innovation Heart (GenAIIC) to optimize their SFM coaching infrastructure.

    This submit describes how TGS achieved near-linear scaling for distributed coaching and expanded context home windows for his or her Imaginative and prescient Transformer-based SFM utilizing Amazon SageMaker HyperPod. This joint answer reduce coaching time from 6 months to simply 5 days whereas enabling evaluation of seismic volumes bigger than beforehand potential.

    Addressing seismic basis mannequin coaching challenges

    TGS’s SFM makes use of a Imaginative and prescient Transformer (ViT) structure with Masked AutoEncoder (MAE) coaching designed by the TGS group to research 3D seismic knowledge. Scaling such fashions presents a number of challenges:

    • Information scale and complexity – TGS works with massive volumes of proprietary 3D seismic knowledge saved in domain-specific codecs. The sheer quantity and construction of this knowledge required environment friendly streaming methods to keep up excessive throughput and assist forestall GPU idle time throughout coaching.
    • Coaching effectivity – Coaching massive FMs on 3D volumetric knowledge is computationally intensive. Accelerating coaching cycles would allow TGS to include new knowledge extra ceaselessly and iterate on mannequin enhancements quicker, delivering extra worth to their purchasers.
    • Expanded analytical capabilities – The geological context a mannequin can analyze is determined by how a lot 3D quantity it could possibly course of without delay. Increasing this functionality would permit the fashions to seize each native particulars and broader geological patterns concurrently.

    Understanding these challenges highlights the necessity for a complete method to distributed coaching and infrastructure optimization. The AWS GenAIIC partnered with TGS to develop a complete answer addressing these challenges.

    Resolution overview

    The collaboration between TGS and the AWS GenAIIC centered on three key areas: establishing an environment friendly knowledge pipeline, optimizing distributed coaching throughout a number of nodes, and increasing the mannequin’s context window to research bigger geological volumes. The next diagram illustrates the answer structure.

    The answer makes use of SageMaker HyperPod to assist present a resilient, scalable coaching infrastructure with computerized well being monitoring and checkpoint administration. The SageMaker HyperPod cluster is configured with AWS Identification and Entry Administration (IAM) execution roles scoped to the minimal permissions required for coaching operations, deployed inside a digital personal cloud (VPC) with community isolation and safety teams proscribing communication to licensed coaching nodes. Terabytes of coaching knowledge streams instantly from Amazon Easy Storage Service (Amazon S3), assuaging the necessity for intermediate storage layers whereas sustaining excessive throughput. AWS CloudTrail logs API calls to Amazon S3 and SageMaker companies, and Amazon S3 entry logging is enabled on coaching knowledge buckets to offer an in depth audit path of knowledge entry requests. The distributed coaching framework makes use of superior parallelization methods to effectively scale throughout a number of nodes, and context parallelism strategies allow the mannequin to course of considerably bigger 3D volumes than beforehand potential.

    The ultimate cluster configuration consisted of 16 Amazon Elastic Compute Cloud (Amazon EC2) P5 situations for the employee nodes built-in by the SageMaker AI versatile coaching plans, every containing:

    • 8 NVIDIA H200 GPUs with 141GB HBM3e reminiscence per GPU
    • 192 vCPUs
    • 2048 GB system RAM
    • 3200 Gbps EFAv3 networking for ultra-low latency communication

    Optimizing the coaching knowledge pipeline

    TGS’s coaching dataset consists of 3D seismic volumes saved within the TGS-developed MDIO format—an open supply format constructed on Zarr arrays designed for large-scale scientific knowledge within the cloud. Such volumes can include billions of knowledge factors representing underground geological buildings.

    Choosing the proper storage method

    The group evaluated two approaches for delivering knowledge to coaching GPUs:

    • Amazon FSx for Lustre – Copy knowledge from Amazon S3 to a high-speed distributed file system that the nodes learn from. This method offers sub-millisecond latency however requires pre-loading and provisioned storage capability.
    • Streaming instantly from Amazon S3 – Stream knowledge instantly from Amazon S3 utilizing MDIO’s native capabilities with multi-threaded libraries, opening a number of concurrent connections per node.

    Selecting streaming instantly from Amazon S3

    The important thing architectural distinction lies in how throughput scales with the cluster. With streaming instantly from Amazon S3, every coaching node creates unbiased Amazon S3 connections, so combination throughput can scale linearly. With Amazon FSx for Lustre, the nodes share a single file system whose throughput is tied to provisioned storage capability. Utilizing Amazon FSx along with Amazon S3 requires solely a small Amazon FSx storage quantity, which limits your entire cluster to that quantity’s throughput, making a bottleneck because the cluster grows.

    Complete testing and price evaluation revealed streaming from Amazon S3 instantly because the optimum alternative for this configuration:

    • Efficiency – Achieved 4–5 GBps sustained throughput per node utilizing a number of knowledge loader processes with pre-fetching over HTTPS endpoints (TLS 1.2)—enough to totally make the most of the GPUs.
    • Value effectivity – Streaming from Amazon S3 alleviated the necessity for Amazon FSx provisioning, decreasing storage infrastructure prices by over 90% whereas serving to ship 64-80 GBps cluster-wide throughput. The Amazon S3 pay-per-use mannequin was extra economical than provisioning high-throughput Amazon FSx capability.
    • Higher scaling – Streaming from Amazon S3 instantly scales naturally—every node brings its personal connection bandwidth, avoiding the necessity for complicated capability planning.
    • Operational simplicity – No intermediate storage to provision, handle, or synchronize.

    The group optimized Amazon S3 connection pooling and applied parallel knowledge loading to maintain excessive throughput throughout the 16 nodes.

    Deciding on the distributed coaching framework

    When coaching massive fashions throughout a number of GPUs, the mannequin’s parameters, gradients, and optimizer states should be distributed throughout gadgets. The group evaluated totally different distributed coaching approaches to search out the optimum steadiness between reminiscence effectivity and coaching throughput:

    • ZeRO-2 (Zero Redundancy Optimizer Stage 2) – This method partitions gradients and optimizer states throughout GPUs whereas holding a full copy of mannequin parameters on every GPU. This helps scale back reminiscence utilization whereas sustaining quick communication, as a result of every GPU can instantly entry the parameters throughout the ahead cross with out ready for knowledge from different GPUs.
    • ZeRO-3 – This method goes additional by additionally partitioning mannequin parameters throughout GPUs. Though this helps maximize reminiscence effectivity (enabling bigger fashions), it requires extra frequent communication between GPUs to collect parameters throughout computation, which may scale back throughput.
    • FSDP2 (Absolutely Sharded Information Parallel v2) – PyTorch’s native method equally shards parameters, gradients, and optimizer states. It gives tight integration with PyTorch however entails comparable communication trade-offs as ZeRO-3.

    Complete testing revealed DeepSpeed ZeRO-2 because the optimum framework for this configuration, delivering robust efficiency whereas effectively managing reminiscence:

    • ZeRO-2 – 1,974 samples per second (applied)
    • FSDP2 – 1,833 samples per second
    • ZeRO-3 – 869 samples per second

    This framework alternative offered the inspiration for attaining near-linear scaling throughout a number of nodes. The mix of those three key optimizations helped ship the dramatic coaching acceleration:

    • Environment friendly distributed coaching – DeepSpeed ZeRO-2 enabled near-linear scaling throughout 128 GPUs (16 nodes × 8 GPUs)
    • Excessive-throughput knowledge pipeline – Streaming from Amazon S3 instantly sustained 64–80 GBps combination throughput throughout the cluster

    Collectively, these enhancements helped scale back coaching time from 6 months to five days—enabling TGS to iterate on mannequin enhancements weekly quite than semi-annually.

    Increasing analytical capabilities

    Some of the important achievements was increasing the mannequin’s subject of view—how a lot 3D geological quantity it could possibly analyze concurrently. A bigger context window permits the mannequin to seize each positive particulars (small fractures) and broad patterns (basin-wide fault programs) in a single cross, serving to present insights that have been beforehand undetectable throughout the constraints of smaller evaluation home windows for TGS’s purchasers. The implementation by the TGS and AWS groups concerned adapting the next superior methods to allow ViTs to course of considerably bigger 3D seismic volumes:

    • Ring consideration implementation – Every GPU processes a portion of the enter sequence whereas circulating key-value pairs to neighboring GPUs, steadily accumulating consideration outcomes throughout the distributed system. PyTorch offers an API that makes this simple:
    from torch.distributed.tensor.parallel import context_parallel
    
    # Wrap consideration computation with context parallelism
    with context_parallel(
        buffers=[query, key, value],  # Tensors to shard
        buffer_seq_dims=[1, 1, 1]      # Dimension to shard alongside (sequence dimension)
    ):
        # Commonplace scaled dot-product consideration - robotically turns into Ring Consideration
        attention_output = torch.nn.practical.scaled_dot_product_attention(
            question, key, worth, attn_mask=None
        )

    • Dynamic masks ratio adjustment – The MAE coaching method required ensuring unmasked patches plus classification tokens are evenly divisible throughout gadgets, necessitating adaptive masking methods.
    • Decoder sequence administration – The decoder reconstructs the total picture by processing each the unmasked patches from the encoder and the masked patches. This creates a unique sequence size that additionally must be divisible by the variety of GPUs.

    The previous implementation enabled processing of considerably bigger 3D seismic volumes as illustrated within the following desk.

    Metric Earlier (Baseline) With Context Parallelism
    Most enter measurement 640 × 640 × 1,024 voxels 1,536 × 1,536 × 2,048 voxels
    Context size 102,400 tokens 1,170,000 tokens
    Quantity improve 1× 4.5×

    The next determine offers an instance of 2D mannequin context measurement.

    Seismic cross-section diagram titled "2D Model Context Size Example" showing three color-coded context window sizes — 256×256 (cyan), 512×512 (magenta), and 640×1024 (yellow) — overlaid at three locations across a grayscale subsurface geological profile, with crossline traces on the x-axis and depth samples on the y-axis.

    This growth permits TGS’s fashions to seize geological options throughout broader spatial contexts, serving to improve the analytical capabilities they’ll provide to purchasers.

    Outcomes and influence

    The collaboration between TGS and the AWS GenAIIC delivered substantial enhancements throughout a number of dimensions:

    • Vital coaching acceleration – The optimized distributed coaching structure diminished coaching time from 6 months to five days—an approximate 36-fold speedup, enabling TGS to iterate quicker and incorporate new geological knowledge extra ceaselessly into their fashions.
    • Close to-linear scaling – The answer demonstrated robust scaling effectivity from single-node to 16-node configurations, attaining roughly 90–95% parallel effectivity with minimal efficiency degradation because the cluster measurement elevated.
    • Expanded analytical capabilities – The context parallelism implementation allows coaching on bigger 3D volumes, permitting fashions to seize geological options throughout broader spatial contexts.
    • Manufacturing-ready, cost-efficient infrastructure – The SageMaker HyperPod primarily based answer with streaming from Amazon S3 helps present an economical basis that scales effectively as coaching necessities develop, whereas serving to ship the resilience, flexibility, and operational effectivity wanted for manufacturing AI workflows.

    These enhancements set up a powerful basis for TGS’s AI-powered analytics system, delivering quicker mannequin iteration cycles and broader geological context per evaluation to purchasers whereas serving to shield TGS’s priceless knowledge property.

    Classes realized and finest practices

    A number of key classes emerged from this collaboration that may profit different organizations working with large-scale 3D knowledge and distributed coaching:

    • Systematic scaling method – Beginning with a single-node baseline institution earlier than progressively increasing to bigger clusters enabled systematic optimization at every stage whereas managing prices successfully.
    • Information pipeline optimization is important – For data-intensive workloads, considerate knowledge pipeline design can present robust efficiency. Direct streaming from object storage with acceptable parallelization and prefetching delivered the throughput wanted with out complicated intermediate storage layers.
    • Batch measurement tuning is nuanced – Growing batch measurement doesn’t all the time enhance throughput. The group discovered excessively massive batch measurement can create bottlenecks in getting ready and transferring knowledge to GPUs. By systematic testing at totally different scales, the group recognized the purpose the place throughput plateaued, indicating the information loading pipeline had grow to be the limiting issue quite than GPU computation. This optimum steadiness maximized coaching effectivity with out over-provisioning assets.
    • Framework choice is determined by your particular necessities – Completely different distributed coaching frameworks contain trade-offs between reminiscence effectivity and communication overhead. The optimum alternative is determined by mannequin measurement, {hardware} traits, and scaling necessities.
    • Incremental validation – Testing configurations at smaller scales earlier than increasing to full manufacturing clusters helped establish optimum settings whereas controlling prices throughout the growth part.

    Conclusion

    By partnering with the AWS GenAIIC, TGS has established an optimized, scalable infrastructure for coaching SFMs on AWS. The answer helps speed up coaching cycles whereas increasing the fashions’ analytical capabilities, serving to TGS ship enhanced subsurface analytics to purchasers within the power sector. The technical improvements developed throughout this collaboration—significantly the variation of context parallelism to ViT architectures for 3D volumetric knowledge—show the potential for making use of superior AI methods to specialised scientific domains. As TGS continues to broaden its subsurface AI system and broader AI capabilities, this basis can help future enhancements corresponding to multi-modal integration and temporal evaluation.

    To study extra about scaling your individual FM coaching workloads, discover SageMaker HyperPod for resilient distributed coaching infrastructure, or evaluate the distributed coaching finest practices within the SageMaker documentation. For organizations involved in comparable collaborations, the AWS Generative AI Innovation Heart companions with prospects to assist speed up their AI initiatives.

    Acknowledgement

    Particular due to Andy Lapastora, Bingchen Liu, Prashanth Ramaswamy, Rohit Thekkanal, Jared Kramer, Arun Ramanathan and Roy Allela for his or her contribution.


    Concerning the authors

    Haotian An

    Haotian An

    Haotian An is a Machine Studying Engineer on the AWS Generative AI Innovation Heart, the place he focuses on customizing basis fashions and distributed coaching at scale. He works intently with prospects to adapt generative AI to their particular use instances, serving to them unlock new capabilities and drive measurable enterprise outcomes.

    Manoj Alwani

    Manoj Alwani

    Manoj Alwani is a Senior Utilized Scientist on the Generative AI Innovation Heart at AWS, the place he helps organizations unlock the potential of cutting-edge AI expertise. With deep experience throughout your entire generative AI analysis stack, Manoj works intently with prospects from various industries to speed up their GenAI adoption and drive significant enterprise outcomes. He brings over 13 years of hands-on expertise in growing and deploying machine studying options at scale.

    Debby Wehner

    Debby Wehner

    Debby Wehner is a Machine Studying Engineer on the AWS Generative AI Innovation Heart, specializing in massive language mannequin customization and optimization. Beforehand, as a full-stack software program engineer at Amazon, she constructed AI-powered buying purposes reaching over 100 million month-to-month customers. She holds a PhD in Computational Geophysics from the College of Cambridge, in addition to a BSc and MSc from Freie Universität Berlin.

    Altay Sansal

    Altay Sansal

    Altay Sansal is a Senior Information Science Lead at TGS in Houston, Texas, specializing in AI/ML purposes for geophysics and seismic knowledge, together with basis fashions, large-scale coaching, and open-source instruments just like the MDIO format. He holds an M.S. in Geophysics from the College of Houston and has authored key publications corresponding to “Scaling Seismic Basis Fashions” and “MDIO: Open-source format for multidimensional power knowledge”, whereas actively contributing to geoscience ML by GitHub and trade occasions.

    Alejandro Valenciano

    Alejandro Valenciano

    Alejandro Valenciano is the Director of Information Science at TGS, the place he leads superior analytics and knowledge science initiatives that unlock insights from subsurface and energy-related knowledge, driving innovation throughout seismic, nicely, and machine studying workflows. He has developed and utilized machine studying fashions for duties corresponding to basin-scale log prediction, superior seismic processing, and Basis Fashions. He ceaselessly contributes to trade conferences and technical publications. His work spans knowledge administration, ML/AI purposes in geoscience, and the mixing of scalable knowledge platforms to help exploration and power options.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    To Infinity and Past: Software-Use Unlocks Size Generalization in State House Fashions

    April 5, 2026

    5 Kinds of Loss Capabilities in Machine Studying

    April 4, 2026

    The Most Frequent Statistical Traps in FAANG Interviews

    April 4, 2026
    Top Posts

    Apple Breaks Precedent, Patches DarkSword for iOS 18

    April 5, 2026

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025
    Don't Miss

    Apple Breaks Precedent, Patches DarkSword for iOS 18

    By Declan MurphyApril 5, 2026

    After some delay, Apple has patched the vulnerabilities related to the DarkSword exploit chain for…

    Watch Artemis II Dwell: When is NASA’s Historic Moon Launch?

    April 5, 2026

    To Infinity and Past: Software-Use Unlocks Size Generalization in State House Fashions

    April 5, 2026

    DroneQ Robotics Expands Offshore with R/V Mintis – Roboticmagazine

    April 5, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.