Generative AI is quickly reshaping the music trade, empowering creators—no matter ability—to create studio-quality tracks with basis fashions (FMs) that personalize compositions in actual time. As demand for distinctive, immediately generated content material grows and creators search smarter, quicker instruments, Splash Music collaborated with AWS to develop and scale music technology FMs, making skilled music creation accessible to thousands and thousands.
On this publish, we present how Splash Music is setting a brand new commonplace for AI-powered music creation by utilizing its superior HummingLM mannequin with AWS Trainium on Amazon SageMaker HyperPod. As a specific startup within the 2024 AWS Generative AI Accelerator, Splash Music collaborated intently with AWS Startups and the AWS Generative AI Innovation Heart (GenAIIC) to fast-track innovation and speed up their music technology FM improvement lifecycle.
Problem: Scaling music technology
Splash Music has empowered a brand new technology of creators to make music, and has already pushed over 600 million streams worldwide. By giving customers instruments that adapt to their evolving tastes and types, the service makes music manufacturing accessible, enjoyable, and related to how followers truly need to create. Nonetheless, constructing the expertise to unlock this artistic freedom, particularly the fashions that energy it, meant overcoming a number of key challenges:
- Mannequin complexity and scale – Splash Music developed HummingLM—a cutting-edge, multi-billion-parameter mannequin tailor-made for generative music to ship its mission of creating music creation really accessible. HummingLM is engineered to seize the subtlety of human buzzing, changing artistic concepts into music tracks. Assembly these excessive requirements of constancy meant Splash needed to scale up computing energy and storage considerably, so the mannequin might ship studio-quality music.
- Speedy tempo of change – The tempo of trade and technological change, pushed by speedy AI development, means Splash Music should regularly adapt, prepare, fine-tune, and deploy new fashions to fulfill person expectations for contemporary, related options.
- Infrastructure scaling – Managing and scaling giant clusters within the generative AI mannequin improvement lifecycle introduced unpredictable prices, frequent interruptions, and time-consuming handbook administration. Previous to AWS, Splash Music relied on externally managed GPU clusters, which concerned unpredictable latency, further troubleshooting, and administration complexity that hindered their capability to experiment and scale as shortly as wanted.
The service wanted a scalable, automated, and cost-effective infrastructure.
Overview of HummingLM: Splash Music’s basis mannequin
HummingLM is Splash Music’s proprietary, multi-modal generative mannequin, developed in shut collaboration with the GenAIIC. It represents an enchancment in how AI can interpret and generate music. The mannequin’s structure is constructed round a transformer-based giant language mannequin (LLM) coupled with a specialised music encoder upsampler:
- HummingLM makes use of Descript-Audio-Codec (DAC) audio encoding to acquire compressed audio representations that seize each frequency and timbre traits
- The system transforms hummed melodies into skilled instrumental performances with out express timbre illustration studying
The innovation lies in how HummingLM fuses these token streams. Utilizing a transformer-based spine, the mannequin learns to mix the melodic intent from buzzing with the stylistic and structural cues from instrument sound (for instance, to make the buzzing sound like a guitar, piano, flute, or totally different synthesized sound). Customers can hum a tune, add an instrument management sign, and obtain a completely organized, high-fidelity observe in return. HummingLM’s structure is designed for each effectivity and expressiveness. By utilizing discrete token representations, the mannequin achieves quicker convergence and lowered computational overhead in comparison with conventional waveform-based approaches. This makes it potential to coach on numerous, large-scale datasets and adapt shortly to new genres or person preferences.
The next diagram illustrates how HummingLM is educated and the inference course of to generate high-quality music:
Answer overview: Accelerating mannequin improvement with AWS Trainium on Amazon SageMaker HyperPod
Splash Music collaborated with the GenAIIC to advance its HummingLM basis mannequin, utilizing the mixed capabilities of Amazon SageMaker HyperPod and AWS Trainium chips for mannequin coaching.
Splash Music’s structure follows SageMaker HyperPod best-practices utilizing Amazon Elastic Kubernetes Service (EKS) because the orchestrator, FSx for Lustre for storage to retailer over 2 PB of information, and AWS Trainium EC2 cases for acceleration. The next diagram illustrates the answer structure.
Within the following sections, we stroll by way of every step of the mannequin improvement lifecycle, from dataset preparation to compilation for optimized inference.
Dataset preparation
Environment friendly preparation and processing of large-scale audio datasets is vital for creating controllable music technology fashions:
- Function extraction pipeline – Splash Music constructed a function extraction pipeline for environment friendly, scalable processing of enormous volumes of audio knowledge, producing high-quality options for mannequin coaching. It begins by retrieving audio in batches from a centralized database, minimizing I/O overhead and supporting large-scale operations.
- Audio processing – Every audio file is resampled from 44,100 Hz to 22,050 Hz to standardize inputs and scale back computational load. A mono reference sign can be created by averaging the stereo channels from a reference audio file, serving as a constant benchmark for evaluation. In parallel, a Primary Pitch Extractor generates an artificial, MIDI-like model of the audio, offering a symbolic illustration of pitch and rhythm that enhances the richness of extracted options.
- Descript Audio Codec (DAC) extractor – The pipeline processes three audio streams: the stereo channels from the unique audio, the mono reference, and the artificial MIDI sign. This multi-stream method captures numerous facets of the audio sign, producing a sturdy set of options. Extracted knowledge is organized into two essential units: audio-feature, which incorporates options from the unique stereo channels, and sine-audio-feature, which accommodates options from the MIDI and mono reference audio. This construction streamlines downstream mannequin coaching.
- Parallel processing: To maximise efficiency, the pipeline makes use of parallel processing for concurrent function extraction and knowledge importing. This considerably boosts effectivity, ensuring the system handles giant datasets with velocity and consistency.
As well as, the answer makes use of a sophisticated stem separation system that isolates songs into six distinct audio stems: drums, bass, vocals, lead, chordal, and different devices:
- Stem Preparation: Splash Music creates high-quality coaching knowledge by getting ready separate stems for every musical component. Lead and chordal stems are generated utilizing a synthesizer device and a various dataset of music tracks. This wealthy dataset covers a number of genres and types. This gives a powerful basis for the mannequin to study exact part separation.
By streamlining knowledge dealing with from the outset, we make it possible for the following mannequin coaching phases have entry to scrub, well-structured options.
Mannequin structure and optimization
HummingLM employs a dual-component structure:
- LLM for coarse token technology – A 385 M parameter transformer-based language mannequin (24 layers, 1024 embedding dimension, 16 consideration heads) that generates foundational musical construction
- Upsampling part – A specialised part that expands the coarse illustration into full, high-fidelity audio.
This division of labor is essential to HummingLM’s effectiveness: the LLM captures high-level musical intent, and the upsampling part handles acoustic particulars. Along with the GenAIIC, Splash collaborated on analysis to optimize the HummingLM mannequin to facilitate optimum efficiency:
- Versatile management sign design – The mannequin accepts management indicators of various durations (1-5 seconds), a big enchancment over fixed-window approaches.
- Zero-shot functionality – In contrast to programs requiring express timbre embedding studying, HummingLM can generalize to unseen instrument presets with out further coaching.
- Non-autoregressive technology – The upsampling part makes use of parallel token prediction for considerably quicker inference in comparison with conventional autoregressive approaches.
Our analysis demonstrated HummingLM’s superior first codebook prediction capabilities – a vital consider residual quantization programs the place the primary codebook accommodates most acoustic info. The mannequin persistently outperformed baseline approaches like VALL-E throughout a number of high quality metrics. The analysis revealed a number of essential findings:
- HummingLM demonstrates important enhancements over baseline approaches in sign constancy (57.93% higher SI-SDR)
- The mannequin maintains strong efficiency throughout numerous musical situations, with explicit power within the Aeolian mode
- Zero-shot efficiency on unseen instrument presets is akin to seen presets, confirming robust generalization capabilities
- Knowledge augmentation methods present substantial advantages (27.70% enchancment in SI-SDR)
General, HummingLM achieves state-of-the-art controllable music technology by considerably bettering sign constancy, generalizing properly to unseen devices, and delivering robust efficiency throughout numerous musical types, boosted additional by efficient knowledge augmentation methods.
Environment friendly distributed coaching by way of parallelism, reminiscence, and AWS Neuron optimization
Splash Music compiled and optimized its mannequin for AWS Neuron, accelerating its mannequin improvement lifecycle and deployment on AWS Trainium chips. The staff thought-about scalability, parallelization, and reminiscence effectivity and designed a system for supporting fashions scaling from 2B to over 10B parameters. This contains:
- Allow distributed coaching with sequence parallelism (SP), tensor parallelism (TP), and knowledge parallelism (DP), scaling as much as 64 trn1.32xlarge cases
- Implement ZeRO-1 reminiscence optimization with selective checkpoint re-computation
- Combine Neuron Kernel Interface (NKI) to deploy Flash Consideration, accelerating dense consideration layers and streamlining causal masks administration
- Decompose the mannequin into core subcomponents (token processors, transformer layers, MLPs) and optimize every for Neuron execution
- Implement mixed-precision coaching (bfloat16 and float32)
When optimizations on the Neuron degree have been full, optimizing the orchestration layer was essential as properly. Orchestrated by SageMaker HyperPod, Splash Music developed a sturdy, Slurm-integrated pipeline that streamlines multi-node coaching, balances parallelism, and makes use of activation checkpointing for superior reminiscence effectivity. The pipeline processes knowledge by way of a number of vital phases:
- Tokenization – Audio inputs are processed by way of a Descript Audio Codec (DAC) encoder to generate a number of codebook representations
- Conditional technology – The mannequin learns to foretell codebooks given hummed melodies and timbre management indicators
- Loss features – The answer makes use of a specialised cross-entropy loss operate to optimize each token prediction and audio reconstruction high quality
Mannequin Inference on AWS Inferentia on Amazon Elastic Container Service (ECS)
After coaching, the mannequin is deployed on an Amazon Elastic Container Service (Amazon ECS) cluster with AWS Inferentia cases. The audio is uploaded to Amazon Easy Storage Service (Amazon S3) to deal with giant volumes of user-submitted recordings, which frequently differ in high quality. Every add triggers an AWS Lambda operate, which queues the file in Amazon Easy Queue Service (Amazon SQS) for supply to the ECS cluster the place inference runs. On the cluster, HummingLM performs two key steps: stem separation to isolate and clear vocals, and audio-to-melody conversion to extract musical construction. Lastly, the pipeline recombines the cleaned vocals by way of a post-processing step with backing tracks, producing the totally processed remixed audio.
Outcomes and affect
Splash Music’s analysis and improvement groups now depend on a unified infrastructure constructed on Amazon SageMaker HyperPod and AWS Trainium chips. The answer has yielded the next advantages:
- Automated, resilient and scalable coaching – SageMaker HyperPod provisions clusters of AWS Trainium EC2 cases at scale, managing orchestration, useful resource allocation, and fault restoration routinely. This removes weeks of handbook setup and facilitates dependable, repeatable coaching runs. SageMaker HyperPod repeatedly displays cluster well being, routinely rerouting jobs and repairing failed nodes, minimizing downtime and maximizing useful resource utilization. With SageMaker HyperPod, Splash Music lower operational downtime to close zero, enabling weekly mannequin refreshes and quicker deployment of recent options.
- AWS Trainium lowered Splash’s coaching prices by over 54% – Splash Music realized over twofold positive factors in coaching velocity and lower coaching prices by 54% utilizing AWS Trainium primarily based cases over conventional GPU-based options used with their earlier cloud supplier. With this leap in effectivity, Splash Music can prepare bigger fashions, launch updates extra regularly, and speed up innovation throughout their generative music service. The acceleration additionally delivers quicker mannequin iteration, with 8% enchancment in throughput, and elevated its most batch dimension from 70 to 512 for a extra environment friendly use of compute sources and better throughput per coaching run.
Splash achieved important throughput enhancements over standard architectures, to course of expansive datasets, supporting the mannequin’s complicated multimodal nature. The answer gives a sturdy basis for future progress as knowledge and fashions proceed to scale.
“AWS Trainium and SageMaker HyperPod took the friction out of our workflow at Splash Music.” says Daniel Hatadi, Software program Engineer, Splash Music. “We changed brittle GPU clusters with automated, self-healing distributed coaching that scales seamlessly. Coaching occasions are practically 50% quicker, and coaching prices have dropped by 54%. By counting on AWS AI chips and SageMaker HyperPod and collaborating with the AWS Generative AI Innovation Heart, we have been capable of give attention to mannequin design and music-specific analysis, as an alternative of cluster upkeep. This collaboration has made it simpler for us to iterate shortly, run extra experiments, prepare bigger fashions, and maintain delivery enhancements with no need a much bigger staff.”
Splash Music additionally featured within the AWS Summit Sydney 2025 keynote:
Conclusion and Subsequent Steps
Splash Music is redefining how creators convey their musical concepts to life, making it potential for anybody to generate contemporary, personalised tracks that resonate with thousands and thousands of listeners worldwide. To assist this imaginative and prescient at scale, Splash constructed its HummingLM FM in shut collaboration with AWS Startups and the GenAIIC, utilizing companies similar to SageMaker HyperPod and AWS Trainium. These options present the infrastructure and efficiency wanted to maintain tempo, serving to Splash to create much more intuitive and galvanizing experiences for creators.
“With SageMaker HyperPod and Trainium, our researchers experiment as quick as our group creates.” says Randeep Bhatia, Chief Know-how Officer, Splash Music. “We’re not simply maintaining with music developments—we’re setting them.”
Trying ahead, Splash Music plans to broaden its coaching datasets tenfold, discover multimodal audio/video technology, and moreover collaborate with the GenAIIC on further R&D and its subsequent model of HummingLM FM.
Attempt creating your individual music utilizing Splash Music, and study extra about Amazon SageMaker HyperPod and AWS Trainium.
Concerning the authors
Sheldon Liu is an Senior Utilized Scientist, ANZ Tech Lead on the AWS Generative AI Innovation Heart. He companions with AWS clients throughout numerous industries to develop and implement modern generative AI options, accelerating their AI adoption journey whereas driving important enterprise outcomes.
Mahsa Paknezhad is a Deep Studying Architect and a key member of the AWS Generative AI Innovation Heart. She works intently with enterprise purchasers to design, implement, and optimize cutting-edge generative AI options. With a give attention to scalability and manufacturing readiness, Mahsa helps organizations throughout numerous industries harness superior Generative AI fashions to attain significant enterprise outcomes.
Xiaoning Wang is a machine studying engineer on the AWS Generative AI Innovation Heart. He makes a speciality of giant language mannequin coaching and optimization on AWS Trainium and Inferentia, with expertise in distributed coaching, RAG, and low-latency inference. He works with enterprise clients to construct scalable generative AI options that drive actual enterprise affect.
Tianyu Liu is an utilized scientist on the AWS Generative AI Innovation Heart. He companions with enterprise clients to design, implement, and optimize cutting-edge generative AI fashions, advancing innovation and serving to organizations obtain transformative outcomes with scalable, production-ready AI options.
Xuefeng Liu leads a science staff on the AWS Generative AI Innovation Heart within the Asia Pacific areas. His staff companions with AWS clients on generative AI tasks, with the objective of accelerating clients’ adoption of generative AI.
Daniel Wirjo is a Options Architect at AWS, targeted on AI and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive progress and innovation on AWS. Outdoors of labor, Daniel enjoys taking walks with a espresso in hand, appreciating nature, and studying new concepts.