Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Russian hackers accused of assault on Poland electrical energy grid

    January 26, 2026

    Palantir Defends Work With ICE to Workers Following Killing of Alex Pretti

    January 26, 2026

    The Workers Who Quietly Maintain Groups Collectively

    January 26, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Optimizing Salesforce’s mannequin endpoints with Amazon SageMaker AI inference parts
    Machine Learning & Research

    Optimizing Salesforce’s mannequin endpoints with Amazon SageMaker AI inference parts

    Oliver ChambersBy Oliver ChambersAugust 18, 2025No Comments12 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Optimizing Salesforce’s mannequin endpoints with Amazon SageMaker AI inference parts
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    This submit is a joint collaboration between Salesforce and AWS and is being cross-published on each the Salesforce Engineering Weblog and the AWS Machine Studying Weblog.

    The Salesforce AI Platform Mannequin Serving staff is devoted to creating and managing providers that energy giant language fashions (LLMs) and different AI workloads inside Salesforce. Their fundamental focus is on mannequin onboarding, offering clients with a strong infrastructure to host a wide range of ML fashions. Their mission is to streamline mannequin deployment, improve inference efficiency and optimize value effectivity, making certain seamless integration into Agentforce and different functions requiring inference. They’re dedicated to enhancing the mannequin inferencing efficiency and total effectivity by integrating state-of-the-art options and collaborating with main expertise suppliers, together with open supply communities and cloud providers equivalent to Amazon Internet Providers (AWS) and constructing it right into a unified AI platform. This helps guarantee Salesforce clients obtain essentially the most superior AI expertise obtainable whereas optimizing the cost-performance of the serving infrastructure.

    On this submit, we share how the Salesforce AI Platform staff optimized GPU utilization, improved useful resource effectivity and achieved value financial savings utilizing Amazon SageMaker AI, particularly inference parts.

    The problem with internet hosting fashions for inference: Optimizing compute and cost-to-serve whereas sustaining efficiency

    Deploying fashions effectively, reliably, and cost-effectively is a essential problem for organizations of all sizes. The Salesforce AI Platform staff is answerable for deploying their proprietary LLMs equivalent to CodeGen and XGen on SageMaker AI and optimizing them for inference. Salesforce has a number of fashions distributed throughout single mannequin endpoints (SMEs), supporting a various vary of mannequin sizes from a couple of gigabytes (GB) to 30 GB, every with distinctive efficiency necessities and infrastructure calls for.

    The staff confronted two distinct optimization challenges. Their bigger fashions (20–30 GB) with decrease visitors patterns had been operating on high-performance GPUs, leading to underutilized multi-GPU situations and inefficient useful resource allocation. In the meantime, their medium-sized fashions (roughly 15 GB) dealing with high-traffic workloads demanded low-latency, high-throughput processing capabilities. These fashions typically incurred larger prices as a consequence of over-provisioning on related multi-GPU setups. Right here’s a pattern illustration of Salesforce’s giant and medium SageMaker endpoints and the place sources are under-utilized:

    Working on Amazon EC2 P4d situations right now, with plans to make use of the most recent technology P5en situations geared up with NVIDIA H200 Tensor Core GPUs, the staff sought an environment friendly useful resource optimization technique that will maximize GPU utilization throughout their SageMaker AI endpoints whereas enabling scalable AI operations and extracting most worth from their high-performance situations—all with out compromising efficiency or over-provisioning {hardware}.

    This problem displays a essential steadiness that enterprises should strike when scaling their AI operations: maximizing the efficiency of refined AI workloads whereas optimizing infrastructure prices and useful resource effectivity. Salesforce wanted an answer that will not solely resolve their instant deployment challenges but additionally create a versatile basis able to supporting their evolving AI initiatives.

    To deal with these challenges, the Salesforce AI Platform staff used SageMaker AI inference parts that enabled deployment of a number of basis fashions (FMs) on a single SageMaker AI endpoint with granular management over the variety of accelerators and reminiscence allocation per mannequin. This helps enhance useful resource utilization, reduces mannequin deployment prices, and allows you to scale endpoints collectively along with your use instances.

    Resolution: Optimizing mannequin deployment with Amazon SageMaker AI inference parts

    With Amazon SageMaker AI inference parts, you may deploy a number of FMs on the identical SageMaker AI endpoint and management what number of accelerators and the way a lot reminiscence is reserved for every FM. This helps to enhance useful resource utilization, reduces mannequin deployment prices, and allows you to scale endpoints collectively along with your use instances. For every FM, you may outline separate scaling insurance policies to adapt to mannequin utilization patterns whereas additional optimizing infrastructure prices. Right here’s the illustration of Salesforce’s giant and medium SageMaker endpoints after utilization has been improved with Inference Elements:

    Optimizing Salesforce’s mannequin endpoints with Amazon SageMaker AI inference parts

    An inference part abstracts ML fashions and allows assigning CPUs, GPU, and scaling insurance policies per mannequin. Inference parts supply the next advantages:

    • SageMaker AI will optimally place and pack fashions onto ML situations to maximise utilization, resulting in value financial savings.
    • Every mannequin scales independently based mostly on customized configurations, offering optimum useful resource allocation to fulfill particular software necessities.
    • SageMaker AI will scale so as to add and take away situations dynamically to take care of availability whereas protecting idle compute to a minimal.
    • Organizations can scale right down to zero copies of a mannequin to release sources for different fashions or specify to maintain vital fashions all the time loaded and able to serve visitors for essential workloads.

    Configuring and managing inference part endpoints

    You create the SageMaker AI endpoint with an endpoint configuration that defines the occasion kind and preliminary occasion depend for the endpoint. The mannequin is configured in a brand new assemble, an inference part. Right here, you specify the variety of accelerators and quantity of reminiscence you wish to allocate to every copy of a mannequin, along with the mannequin artifacts, container picture, and variety of mannequin copies to deploy.

    As inference requests enhance or lower, the variety of copies of your inference parts may also scale up or down based mostly in your auto scaling insurance policies. SageMaker AI will deal with the position to optimize the packing of your fashions for availability and value.

    As well as, in case you allow managed occasion auto scaling, SageMaker AI will scale compute situations in accordance with the variety of inference parts that must be loaded at a given time to serve visitors. SageMaker AI will scale up the situations and pack your situations and inference parts to optimize for value whereas preserving mannequin efficiency.

    Seek advice from Cut back mannequin deployment prices by 50% on common utilizing the most recent options of Amazon SageMaker for extra particulars on the way to use inference parts.

    How Salesforce used Amazon SageMaker AI inference parts

    Salesforce has a number of totally different proprietary fashions equivalent to CodeGen initially unfold throughout a number of SMEs. CodeGen is Salesforce’s in-house open supply LLM for code understanding and code technology. Builders can use the CodeGen mannequin to translate pure language, equivalent to English, into programming languages, equivalent to Python. Salesforce developed an ensemble of CodeGen fashions (Inline for computerized code completion, BlockGen for code block technology, and FlowGPT for course of circulate technology) particularly tuned for the Apex programming language. The fashions are being utilized in ApexGuru, an answer throughout the Salesforce platform that helps Salesforce builders sort out essential anti-patterns and hotspots of their Apex code.

    Inference parts allow a number of fashions to share GPU sources effectively on the identical endpoint. This consolidation not solely delivers discount in infrastructure prices by means of clever useful resource sharing and dynamic scaling, it additionally reduces operational overhead with lesser endpoints to handle. For his or her CodeGen ensemble fashions, the answer enabled model-specific useful resource allocation and unbiased scaling based mostly on visitors patterns, offering optimum efficiency whereas maximizing infrastructure utilization.

    To broaden internet hosting choices on SageMaker AI with out affecting stability, efficiency, or usability, Salesforce launched inference part endpoints alongside the present SME.

    This hybrid method makes use of the strengths of every. SMEs present devoted internet hosting for every mannequin and predictable efficiency for essential workloads with constant visitors patterns, and inference parts optimize useful resource utilization for variable workloads by means of dynamic scaling and environment friendly GPU sharing.

    The Salesforce AI Platform staff created a SageMaker AI endpoint with the specified occasion kind and preliminary occasion depend for the endpoint to deal with their baseline inference necessities. Mannequin packages are then hooked up dynamically, spinning up particular person containers as wanted. They configured every mannequin, for instance, BlockGen and TextEval fashions as particular person inference parts specifying exact useful resource allocations, together with accelerator depend, reminiscence necessities, mannequin artifacts, container picture, and variety of mannequin copies to deploy. With this method, Salesforce might effectively host a number of mannequin variants on the identical endpoint whereas sustaining granular management over useful resource allocation and scaling behaviors.

    By utilizing the auto scaling capabilities, inference parts can arrange endpoints with a number of copies of fashions and robotically modify GPU sources as visitors fluctuates. This enables every mannequin to dynamically scale up or down inside an endpoint based mostly on configured GPU limits. By internet hosting a number of fashions on the identical endpoint and robotically adjusting capability in response to visitors fluctuations, Salesforce was capable of considerably cut back the prices related to visitors spikes. Which means that Salesforce AI fashions can deal with various workloads effectively with out compromising efficiency. The graphic beneath reveals Salesforce’s endpoints earlier than and after the fashions had been deployed with inference parts:

    Salesforce Endpoints Before and After Inference Component Implementation

    This answer has introduced a number of key advantages:

    • Optimized useful resource allocation – A number of fashions now effectively share GPU sources, eliminating pointless provisioning whereas sustaining optimum efficiency.
    • Value financial savings – Via clever GPU useful resource administration and dynamic scaling, Salesforce achieved important discount in infrastructure prices whereas eliminating idle compute sources.
    • Enhanced efficiency for smaller fashions – Smaller fashions now use high-performance GPUs to fulfill their latency and throughput wants with out incurring extreme prices.

    By refining GPU allocation on the mannequin stage by means of inference parts, Salesforce improved useful resource effectivity and achieved a considerable discount in operational value whereas sustaining the high-performance requirements their clients count on throughout a variety of AI workloads. The fee financial savings are substantial and open up new alternatives for utilizing high-end, costly GPUs in an economical method.

    Conclusion

    Via their implementation of Amazon SageMaker AI inference parts, Salesforce has remodeled their AI infrastructure administration, reaching as much as an eight-fold discount in deployment and infrastructure prices whereas sustaining excessive efficiency requirements. The staff realized that clever mannequin packing and dynamic useful resource allocation had been keys to fixing their GPU utilization challenges throughout their various mannequin portfolio. This implementation has remodeled efficiency economics, permitting smaller fashions to make use of excessive efficiency GPUs, offering excessive throughput and low latency with out the normal value overhead.

    Right now, their AI platform effectively serves each giant proprietary fashions equivalent to CodeGen and smaller workloads on the identical infrastructure, with optimized useful resource allocation making certain high-performance supply. With this method, Salesforce can maximize the utilization of compute situations, scale to a whole bunch of fashions, and optimize prices whereas offering predictable efficiency. This answer has not solely solved their instant challenges of optimizing GPU utilization and value administration however has additionally positioned them for future development. By establishing a extra environment friendly and scalable infrastructure basis, Salesforce can now confidently broaden their AI choices and discover extra superior use instances with costly, high-performance GPUs equivalent to P4d, P5, and P5en, realizing they’ll maximize the worth of each computing useful resource. This transformation represents a major step ahead of their mission to ship enterprise-grade AI options whereas sustaining operational effectivity and cost-effectiveness.

    Trying forward, Salesforce is poised to make use of the brand new Amazon SageMaker AI rolling updates functionality for inference part endpoints, a function designed to streamline updates for fashions of various sizes whereas minimizing operational overhead. This development will allow them to replace their fashions batch by batch, slightly than utilizing the normal blue/inexperienced deployment methodology, offering better flexibility and management over mannequin updates whereas utilizing minimal further situations, slightly than requiring doubled situations as up to now. By implementing these rolling updates alongside their present dynamic scaling infrastructure and incorporating real-time security checks, Salesforce is constructing a extra resilient and adaptable AI platform. This strategic method not solely supplies cost-effective and dependable deployments for his or her GPU-intensive workloads but additionally units the stage for seamless integration of future AI improvements and mannequin enhancements.

    Take a look at How Salesforce achieves high-performance mannequin deployment with Amazon SageMaker AI to study extra. For extra info on the way to get began with SageMaker AI, confer with Information to getting arrange with Amazon SageMaker AI. To study extra about Inference Elements, confer with Amazon SageMaker provides new inference capabilities to assist cut back basis mannequin deployment prices and latency.


    In regards to the Authors

    Rishu Aggarwal is a Director of Engineering at Salesforce based mostly in Bangalore, India. Rishu leads the Salesforce AI Platform Mannequin Serving Engineering staff in fixing the complicated issues of inference optimizations and deployment of LLMs at scale throughout the Salesforce ecosystem. Rishu is a staunch Tech Evangelist for AI and has deep pursuits in Synthetic Intelligence, Generative AI, Neural Networks and Huge Information.

    Rielah De Jesus is a Principal Options Architect at AWS who has efficiently helped numerous enterprise clients within the DC, Maryland, and Virginia space transfer to the cloud. In her present position, she acts as a buyer advocate and technical advisor centered on serving to organizations like Salesforce obtain success on the AWS platform. She can also be a staunch supporter of girls in IT and may be very obsessed with discovering methods to creatively use expertise and information to unravel on a regular basis challenges.

    Pavithra Hariharasudhan is a Senior Technical Account Supervisor and Enterprise Help Lead at AWS, supporting main AWS Strategic clients with their world cloud operations. She assists organizations in resolving operational challenges and sustaining environment friendly AWS environments, empowering them to realize operational excellence whereas accelerating enterprise outcomes.

    Ruchita Jadav is a Senior Member of Technical Workers at Salesforce, with over 10 years of expertise in software program and machine studying engineering. Her experience lies in constructing scalable platform options throughout the retail and CRM domains. At Salesforce, she leads initiatives centered on mannequin internet hosting, inference optimization, and LLMOps, enabling environment friendly and scalable deployment of AI and huge language fashions. She holds a Bachelor of Know-how in Electronics & Communication from Gujarat Technological College (GTU).

    Marc Karp is an ML Architect with the Amazon SageMaker Service staff. He focuses on serving to clients design, deploy, and handle ML workloads at scale. In his spare time, he enjoys touring and exploring new locations.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    How CLICKFORCE accelerates data-driven promoting with Amazon Bedrock Brokers

    January 26, 2026

    5 Breakthroughs in Graph Neural Networks to Watch in 2026

    January 26, 2026

    AI within the Workplace – O’Reilly

    January 26, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Russian hackers accused of assault on Poland electrical energy grid

    By Declan MurphyJanuary 26, 2026

    On Dec. 29 and 30, the Polish electrical energy grid was subjected to a cyberattack…

    Palantir Defends Work With ICE to Workers Following Killing of Alex Pretti

    January 26, 2026

    The Workers Who Quietly Maintain Groups Collectively

    January 26, 2026

    Nike Knowledge Breach Claims Floor as WorldLeaks Leaks 1.4TB of Recordsdata On-line – Hackread – Cybersecurity Information, Knowledge Breaches, AI, and Extra

    January 26, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.