Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    AI use is altering how a lot firms pay for cyber insurance coverage

    March 12, 2026

    AI-Powered Cybercrime Is Surging. The US Misplaced $16.6 Billion in 2024.

    March 12, 2026

    Setting Up a Google Colab AI-Assisted Coding Surroundings That Really Works

    March 12, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»From newbie to champion: A scholar’s journey by way of the AWS AI League ASEAN finals
    Machine Learning & Research

    From newbie to champion: A scholar’s journey by way of the AWS AI League ASEAN finals

    Oliver ChambersBy Oliver ChambersJanuary 20, 2026No Comments27 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    From newbie to champion: A scholar’s journey by way of the AWS AI League ASEAN finals
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    The AWS AI League, launched by Amazon Net Companies (AWS), expanded its attain to the Affiliation of Southeast Asian Nations (ASEAN) final 12 months, welcoming scholar contributors from Singapore, Indonesia, Malaysia, Thailand, Vietnam, and the Philippines. The aim was to introduce college students of all backgrounds and expertise ranges to the thrilling world of generative AI by way of a gamified, hands-on problem targeted on fine-tuning giant language fashions (LLMs).

    On this weblog publish, you’ll hear instantly from the AWS AI League champion, Blix D. Foryasen, as he shares his reflection on the challenges, breakthroughs, and key classes found all through the competitors.

    Behind the competitors

    The AWS AI League competitors started with a tutorial session led by the AWS workforce and the Gen-C Generative AI Studying Group, that includes two highly effective user-friendly companies: Amazon SageMaker JumpStart and PartyRock.

    • SageMaker JumpStart enabled contributors to run the LLM fine-tuning course of in a cloud-based surroundings, providing flexibility to regulate hyperparameters and optimize efficiency.
    • PartyRock, powered by Amazon Bedrock, supplied an intuitive playground and interface to curate the dataset utilized in fine-tuning a Llama 3.2 3B Instruct mannequin. Amazon Bedrock presents a complete choice of high-performing basis fashions from main AI corporations, together with Anthropic Claude, Meta Llama, Mistral, and extra; all accessible by way of a single API.

    With the aim of outperforming a bigger LLM reference mannequin in a quiz-based analysis, contributors engaged with three core domains of generative AI: Basis fashions, accountable AI, and immediate engineering. The preliminary spherical featured an open leaderboard rating the best-performing fine-tuned fashions from throughout the area. Every submitted mannequin was examined towards a bigger baseline LLM utilizing an automatic, quiz-style analysis of generative AI-related questions. The analysis, carried out by an undisclosed LLM decide, prioritized each accuracy and comprehensiveness. A mannequin’s win charge improved every time it outperformed the baseline LLM. The problem required strategic planning past its technical nature. Members needed to maximize their restricted coaching hours on SageMaker JumpStart whereas fastidiously managing a restricted variety of leaderboard submissions. Initially capped at 5 hours, the restrict was later expanded to 30 hours in response to neighborhood suggestions. Submission rely would additionally affect tiebreakers for finalist choice.

    The highest tuner from every nation superior to the Regional Grand Finale, held on Might 29, 2025, in Singapore. There, finalists competed head-to-head, every presenting their fine-tuned mannequin’s responses to a brand new set of questions. Closing scores had been decided by a weighted judging system:

    • 40% by an LLM-as-a-judge,
    • 40% by specialists
    • 20% by a stay viewers.

    A realistic method to fine-tuning

    Earlier than diving into the technical particulars, a fast disclaimer: the approaches shared within the following sections are largely experimental and born from trial and error. They’re not essentially probably the most optimum strategies for fine-tuning, nor do they symbolize a definitive information. Different finalists had completely different approaches due to completely different technical backgrounds. What finally helped me succeed wasn’t simply technical precision, however collaboration, resourcefulness, and a willingness to discover how the competitors may unfold based mostly on insights from earlier iterations. I hope this account can function a baseline or inspiration for future contributors who may be navigating related constraints. Even in case you’re ranging from scratch, as I did, there’s actual worth in being strategic, curious, and community-driven. One of many largest hurdles I confronted was time, or the shortage of it. Due to a late affirmation of my participation, I joined the competitors 2 weeks after it had already begun. That left me with solely 2 weeks to plan, prepare, and iterate. Given the tight timeline and restricted compute hours on SageMaker JumpStart, I knew I needed to make each coaching session rely. Relatively than trying exhaustive experiments, I targeted my efforts on curating a powerful dataset and tweaking choose hyperparameters. Alongside the way in which, I drew inspiration from educational papers and present approaches in LLM fine-tuning, adjusting what I might inside the constraints.

    Crafting artificial brilliance

    As talked about earlier, one of many key studying classes at first of the competitors launched contributors to SageMaker JumpStart and PartyRock, instruments that make fine-tuning and artificial information technology each accessible and intuitive. Specifically, PartyRock allowed us to clone and customise apps to manage how artificial datasets had been generated. We might tweak parameters such because the immediate construction, creativity stage (temperature), and token sampling technique (top-p). PartyRock additionally gave us entry to a variety of basis fashions. From the beginning, I opted to generate my datasets utilizing Claude 3.5 Sonnet, aiming for broad and balanced protection throughout all three core sub-domains of the competitors. To reduce bias and implement truthful illustration throughout matters, I curated a number of dataset variations, every starting from 1,500 to 12,000 Q&A pairs, fastidiously sustaining balanced distributions throughout sub-domains. The next are just a few instance themes that I targeted on:

    • Immediate engineering: Zero-shot prompting, chain-of-thought (CoT) prompting, evaluating immediate effectiveness
    • Basis fashions: Transformer architectures, distinctions between pretraining and fine-tuning
    • Accountable AI: Dataset bias, illustration equity, and information safety in AI programs

    To keep up information high quality, I fine-tuned the dataset generator to emphasise factual accuracy, uniqueness, and utilized data. Every technology batch consisted of 10 Q&A pairs, with prompts particularly designed to encourage depth and readability

    Query immediate:

    You're a quiz grasp in an AI competitors getting ready a set of difficult quiz bee questions on [Topic to generate] The aim of those questions is to find out the higher LLM between a fine-tuned LLaMA 3.2 3B Instruct and bigger LLMs. Generate [Number of data rows to generate] questions on [Topic to generate], masking: 
    	* Primary Questions (1/3) → Direct Q&A with out reasoning. Should require a transparent rationalization, instance, or real-world software. Keep away from one-word fact-based questions.
    	* Hybrid Questions (1/3) → Requires a brief analytical breakdown (e.g., comparisons, trade-offs, weaknesses, implications). Prioritize scenario-based or real-world dilemma questions.
    	* Chain-of-thought (CoT) Questions (1/3) → Requires multi-step logical deductions. Deal with evaluating present AI strategies, figuring out dangers, and critiquing trade-offs. Keep away from open-ended "Design/Suggest/Create" questions. As an alternative, use "Evaluate, Consider, Critique, Assess, Analyze, What are the trade-offs of…" 
    
    Make sure the questions on [Topic to generate]: 
    	* Are particular, non-trivial, and informative.
    	* Keep away from overly easy questions (e.g., mere definitions or fact-based queries).
    	* Encourage utilized reasoning (i.e., linking theoretical ideas to real-world AI challenges).

    Reply immediate:

    You're an AI knowledgeable specializing in generative AI, basis fashions, agentic AI, immediate engineering, and accountable AI. Your job is to generate well-structured, logically reasoned responses to a listing of [Questions], guaranteeing that each one responses observe a chain-of-thought (CoT) method, no matter complexity, and formatted in legitimate JSONL. Listed below are the answering pointers: 
    	* Each response have to be complete, factually correct, and well-reasoned.
     	* Each response should use a step-by-step logical breakdown, even for seemingly direct questions.
    For all questions, use structured reasoning:
    	* For fundamental Questions, use a concise but structured rationalization. Easy Q&As ought to nonetheless observe CoT reasoning, explaining why the reply is right reasonably than simply stating info.
    	* For hybrid and CoT questions, use Chain of Thought and analyze the issue logically earlier than offering a concluding assertion.
    	* If relevant, use real-world examples or analysis references to reinforce explanations.
    	* If relevant, embody trade-offs between completely different AI methods.
    	* Draw logical connections between subtopics to bolster deep understanding.

    Answering immediate examples:

    
    	* Primary query (direct Q&A with out reasoning) → Use concise but complete, structured responses that present a transparent, well-explained, and well-structured definition and rationalization with out pointless verbosity.
    	* Functions. Spotlight key factors step-by-step in just a few complete sentences.
    	* Advanced CoT query (multi-step reasoning) → Use CoT naturally, fixing every step explicitly, with in-depth reasoning 

    For query technology, I set the temperature to 0.7, favoring artistic and novel phrasing with out drifting too removed from factual grounding. For reply technology, I used a decrease temperature of 0.2, focusing on precision and correctness. In each instances, I utilized top-p = 0.9, permitting the mannequin to pattern from a targeted but numerous vary of doubtless tokens, encouraging nuanced outputs. One necessary strategic assumption I made all through the competitors was that the evaluator LLM would favor extra structured, informative, and full responses over overly artistic or transient ones. To align with this, I included reasoning steps in my solutions to make them longer and extra complete. Analysis has proven that LLM-based evaluators usually rating detailed, well-explained solutions greater, and I leaned into that perception throughout dataset technology.

    Refining the submissions

    SageMaker JumpStart presents a big selection of hyperparameters to configure, which might really feel overwhelming, particularly if you’re racing towards time and not sure of what to prioritize. Luckily, the organizers emphasised focusing totally on epochs and studying charge, so I honed in on these variables. Every coaching job with a single epoch took roughly 10–quarter-hour, making time administration essential. To keep away from losing priceless compute hours, I started with a baseline dataset of 1,500 rows to check mixtures of epochs and studying charges. I explored:

    • Epochs: 1 to 4
    • Studying charges: 0.0001, 0.0002, 0.0003, and 0.0004

    After a number of iterations, the mix of two epochs and a studying charge of 0.0003 yielded the most effective outcome, attaining a 53% win charge on my thirteenth leaderboard submission. Inspired by this, I continued utilizing this mixture for a number of subsequent experiments, at the same time as I expanded my dataset. Initially, this technique appeared to work. With a dataset of roughly 3,500 rows, my mannequin reached a 57% win charge by my sixteenth submission. Nonetheless, as I additional elevated the dataset to five,500, 6,700, 8,500, and ultimately 12,000 rows, my win charge steadily declined to 53%, 51%, 45%, and 42% respectively. At that time, it was clear that solely growing dataset dimension wasn’t sufficient, in reality, it might need been counterproductive with out revisiting the hyperparameters. With solely 5 coaching hours remaining and 54 submissions logged, I discovered myself caught at 57%, whereas friends like the highest tuner from the Philippines had been already reaching a 71% win charge.

    Classes from the sphere

    With restricted time left, each for coaching and leaderboard submissions, I turned to cross-country collaboration for help. One of the crucial insightful conversations I had was with Michael Ismail Febrian, the highest tuner from Indonesia and the best scorer within the elimination spherical. He inspired me to discover LoRA (low-rank adaptation) hyperparameters, particularly:

    • lora_r
    • lora_alpha
    • target_modules

    Michael additionally urged enriching my dataset through the use of API-generated responses from extra succesful trainer fashions, particularly for answering PartyRock-generated questions. Wanting again at my present fine-tuning pipeline, I spotted a essential weak spot: the generated solutions had been usually too concise or shallow. Right here’s an instance of a typical Q&A pair from my earlier dataset:

    Instruction: What components affect mannequin hallucination?
    
    Response: Let's analyze this step-by-step:
    	* Coaching components:
    		* Knowledge high quality and protection
    		* Coaching methodology
    		* Mannequin structure
    	* Deployment components:
    		* Immediate design
    		* Temperature settings
    		* Context window utilization
    	* Mitigation methods:
    		*Retrieval augmentation
    		* Reality-checking mechanisms
    		* Confidence thresholds
    		
    Instance: GPT-3 exhibiting decrease hallucination charges (15% vs 21%) when utilizing retrieval-augmented technology in comparison with base mannequin.

    Whereas this construction is clear and arranged, it lacked deeper rationalization for every level, one thing fashions like ChatGPT and Gemini usually do properly. I believe this limitation got here from token constraints when producing a number of responses in bulk. In my case, I generated 10 responses at a time in JSONL format beneath a single immediate, which could have led PartyRock to truncate outputs. Not desirous to spend on paid APIs, I found OpenRouter.ai, which presents restricted entry to giant fashions, albeit rate-limited. With a cap of roughly 200 Q&A pairs per day per account, I acquired artistic—I created a number of accounts to help my expanded dataset. My trainer mannequin of selection was DeepSeek R1, a preferred choice identified for its effectiveness in coaching smaller, specialised fashions. It was a little bit of of venture, however one which paid off when it comes to output high quality.

    As for LoRA tuning, right here’s what I realized:

    • lora_r and lora_alpha decide how a lot and the way advanced new info the mannequin can take up. A typical rule of thumb is setting lora_alpha to 1x or 2x of lora_r.
    • target_modules defines which elements of the mannequin are up to date, usually the eye layers or the feed-forward community.

    I additionally consulted Kim, the highest tuner from Vietnam, who flagged my 0.0003 studying charge as doubtlessly too excessive. He, together with Michael, urged a distinct technique: improve the variety of epochs and scale back the training charge. This could enable the mannequin to higher seize advanced relationships and refined patterns, particularly as dataset dimension grows. Our conversations underscored a hard-learned reality: information high quality is extra necessary than information amount. There’s some extent of diminishing returns when growing dataset dimension with out adjusting hyperparameters or validating high quality—one thing I instantly skilled. In hindsight, I spotted I had underestimated how important fine-grained hyperparameter tuning is, particularly when scaling information. Extra information calls for extra exact tuning to match the rising complexity of what the mannequin must be taught.

    Final-minute gambits

    Armed with contemporary insights from my collaborators and hard-won classes from earlier iterations, I knew it was time to pivot my complete fine-tuning pipeline. Essentially the most important change was in how I generated my dataset. As an alternative of utilizing PartyRock to supply each questions and solutions, I opted to generate solely the questions in PartyRock, then feed these prompts into the DeepSeek-R1 API to generate high-quality responses. Every reply was saved in JSONL format, and, crucially, included detailed reasoning. This shift considerably elevated the depth and size of every reply, averaging round 900 tokens per response, in comparison with the a lot shorter outputs from PartyRock. On condition that my earlier dataset of roughly 1,500 high-quality rows produced promising outcomes, I caught with that dimension for my last dataset. Relatively than scale up in amount, I doubled down on high quality and complexity. For this last spherical, I made daring, blind tweaks to my hyperparameters:

    • Dropped the training charge to 0.00008
    • Elevated the LoRA parameters:
      • lora_r = 256
      • lora_alpha = 256
    • Expanded LoRA goal modules to cowl each consideration and feed-forward layers:

      q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

    These modifications had been made with one assumption: longer, extra advanced solutions require extra capability to soak up and generalize nuanced patterns. I hoped that these settings would allow the mannequin to totally use the high-quality, reasoning-rich information from DeepSeek-R1.With solely 5 hours of coaching time remaining, I had simply sufficient for 2 full coaching runs, every utilizing completely different epoch settings (3 and 4). It was a make-or-break second. If the primary run underperformed, I had one final probability to redeem it. Fortunately, my first check run achieved a 65% win charge, a large enchancment, however nonetheless behind the present chief from the Philippines and trailing Michael’s spectacular 89%. Every part now hinged on my last coaching job. It needed to run easily, keep away from errors, and outperform every part I had tried earlier than. And it did. That last submission achieved a 77% win charge, pushing me to the highest of the leaderboard and securing my slot for the Grand Finale. After weeks of experimentation, sleepless nights, setbacks, and late-game changes, the journey, from a two-week-late entrant to nationwide champion, was full.

    What I want I had identified sooner

    I received’t faux that my success within the elimination spherical was purely technical—luck performed a giant half. Nonetheless, the journey revealed a number of insights that would save future contributors priceless time, coaching hours, and submissions. Listed below are some key takeaways I want I had identified from the beginning:

    • High quality is extra necessary than amount: Extra information doesn’t all the time imply higher outcomes. Whether or not you’re including rows or growing context size, you’re additionally growing the complexity that the mannequin should be taught from. Deal with crafting high-quality, well-structured examples reasonably than blindly scaling up.
    • Quick learner in comparison with Sluggish learner: If you happen to’re avoiding deep dives into LoRA or different superior tweaks, understanding the trade-off between studying charge and epochs is crucial. A better studying charge with fewer epochs may converge sooner, however might miss the refined patterns captured by a decrease studying charge over extra epochs. Select fastidiously based mostly in your information’s complexity.
    • Don’t neglect hyperparameters: One among my largest missteps was treating hyperparameters as static, no matter modifications in dataset dimension or complexity. As your information evolves, your mannequin settings ought to too. Hyperparameters ought to scale together with your information.
    • Do your homework: Keep away from extreme guesswork by studying related analysis papers, documentation, or weblog posts. Late within the competitors, I stumbled upon useful assets that I might have used to make higher choices earlier. A bit studying can go a good distance.
    • Observe every part: When experimenting, it’s simple to overlook what labored and what didn’t. Keep a log of your datasets, hyperparameter mixtures, and efficiency outcomes. This helps optimize your runs and aids in debugging.
    • Collaboration is a superpower: Whereas it’s a contest, it’s additionally an opportunity to be taught. Connecting with different contributors, whether or not they’re forward or behind, gave me invaluable insights. You won’t all the time stroll away with a trophy, however you’ll depart with data, relationships, and actual progress.

    Grand Finale

    The Grand Finale passed off on the second day of the Nationwide AI Pupil Problem, serving because the fruits of weeks of experimentation, technique, and collaboration. Earlier than the ultimate showdown, all nationwide champions had the chance to interact within the AI Pupil Developer Convention, the place we shared insights, exchanged classes, and constructed connections with fellow finalists from throughout the ASEAN area. Throughout our conversations, I used to be struck by how remarkably related lots of our fine-tuning methods had been. Throughout the board, contributors had used a mixture of exterior APIs, dataset curation methods, and cloud-based coaching programs like SageMaker JumpStart. It grew to become clear that device choice and inventive problem-solving performed simply as huge a task as uncooked technical data. One significantly eye-opening perception got here from a finalist who achieved an 85% win charge, regardless of utilizing a big dataset—one thing I had initially assumed may harm efficiency. Their secret was coaching over a better variety of epochs whereas sustaining a decrease studying charge of 0.0001. Nonetheless, this got here at the price of longer coaching occasions and fewer leaderboard submissions, which highlights an necessary trade-off:

    With sufficient coaching time, a fastidiously tuned mannequin, even one educated on a big dataset, can outperform sooner, leaner fashions.

    This strengthened a robust lesson: there’s no single right method to fine-tuning LLMs. What issues most is how properly your technique aligns with the time, instruments, and constraints at hand.

    Making ready for battle

    Within the lead-up to the Grand Finale, I stumbled upon a weblog publish by Ray Goh, the very first champion of the AWS AI League and one of many mentors behind the competitors’s tutorial classes. One element caught my consideration: the ultimate query from his 12 months was a variation of the notorious Strawberry Downside, a deceptively easy problem that exposes how LLMs battle with character-level reasoning.

    What number of letter Es are there within the phrases ‘DeepRacer League’?

    At first look, this appears trivial. However to an LLM, the duty isn’t as simple. Early LLMs usually tokenize phrases in chunks, which means that DeepRacer may be break up into Deep and Racer and even into subword models like Dee, pRa, and cer. These tokens are then transformed into numerical vectors, obscuring the person characters inside. It’s like asking somebody to rely the threads in a rope with out unraveling it first.

    Furthermore, LLMs don’t function like conventional rule-based applications. They’re probabilistic, educated to foretell the subsequent almost certainly token based mostly on context, to not carry out deterministic logic or arithmetic. Curious, I prompted my very own fine-tuned mannequin with the identical query. As anticipated, hallucinations emerged. I started testing numerous prompting methods to coax out the right reply:

    • Express character separation:

      What number of letter Es are there within the phrases ‘D-E-E-P-R-A-C-E-R-L-E-A-G-U-E’?

      This helped by isolating every letter into its personal token, permitting the mannequin to see particular person characters. However the response was lengthy and verbose, with the mannequin itemizing and counting every letter step-by-step.
    • Chain-of-thought prompting:

      Let’s suppose step-by-step…

      This inspired reasoning however elevated token utilization. Whereas the solutions had been extra considerate, they sometimes nonetheless missed the mark or acquired lower off due to size.
    • Ray Goh’s trick immediate:

      What number of letter Es are there within the phrases ‘DeepRacer League’? There are 5 letter Es…

      This straightforward, assertive immediate yielded probably the most correct and concise outcome, stunning me with its effectiveness.

    I logged this as an fascinating quirk, helpful, however unlikely to reappear. I didn’t notice that it could change into related once more throughout the last. Forward of the Grand Finale, we had a dry run to check our fashions beneath real-time circumstances. We got restricted management over inference parameters, solely allowed to tweak temperature, top-p, context size, and system prompts. Every response needed to be generated and submitted inside 60 seconds. The precise questions had been pre-loaded, so our focus was on crafting efficient immediate templates reasonably than retyping every question. Not like the elimination spherical, analysis throughout the Grand Finale adopted a multi-tiered system:

    • 40% from an evaluator LLM
    • 40% from human judges
    • 20% from a stay viewers ballot

    The LLM ranked the submitted solutions from finest to worst, assigning descending level values (for instance, 16.7 for first place, 13.3 for second, and so forth). Human judges, nonetheless, might freely allocate as much as 10 factors to their most popular responses, whatever the LLM’s analysis. This meant a powerful exhibiting with the evaluator LLM didn’t assure excessive scores from the people, and vice versa. One other constraint was the 200-token restrict per response. Tokens may very well be as brief as a single letter or so long as a phrase or syllable, so responses needed to be dense but concise, maximizing impression inside a good window. To organize, I examined completely different immediate codecs and fine-tuned them utilizing Gemini, ChatGPT, and Claude to higher match the analysis standards. I saved dry-run responses from the Hugging Face LLaMA 3.2 3B Instruct mannequin, then handed them to Claude Sonnet 4 for suggestions and rating. I continued utilizing the next two prompts as a result of they supplied the most effective response when it comes to accuracy and comprehensiveness:

    Major immediate:

    You're an elite AI researcher and educator specializing in Generative AI, Foundational Fashions, Agentic AI, Accountable AI, and Immediate Engineering. Your job is to generate a extremely correct, complete, and well-structured response to the query under in not more than 200 phrases.
    
    Analysis shall be carried out by Claude Sonnet 4, which prioritizes:
    	* Factual Accuracy – All claims have to be right and verifiable. Keep away from hypothesis.
    	* Comprehensiveness – Cowl all important dimensions, together with interrelated ideas or mechanisms.
    	* Readability & Construction – Use concise, well-organized sections (e.g., transient intro, bullet factors, and/or transitions). Markdown formatting (headings/lists) is non-compulsory.
    	* Effectivity – Each sentence should ship distinctive perception. Keep away from filler.
    	* Tone – Keep knowledgeable, impartial, and goal tone.
    	
    Your response ought to be dense with worth whereas remaining readable and exact.

    Backup immediate:

    You're a aggressive AI practitioner with deep experience in [Insert domain: e.g., Agentic AI or Prompt Engineering], answering a technical query evaluated by Claude Sonnet 4 for accuracy and comprehensiveness. You could reply in precisely 200 phrases.
    
    Format your reply as follows: 
    	* Direct Reply (1–2 sentences) – Instantly state the core conclusion or definition.
    	* Key Technical Factors (3–4 bullet factors) – Important mechanisms, distinctions, or rules.
    	* Sensible Software (1–2 sentences) – Particular real-world use instances or design implications.
    	* Vital Perception (1 sentence) – Point out a key problem, trade-off, or future course.

    Extra necessities:

    • Use exact technical language and terminology.
    • Embrace particular instruments, frameworks, or metrics if related.
    • Each sentence should contribute uniquely—no redundancy.
    • Keep a proper tone and reply density with out over-compression.

    When it comes to hyperparameters, I used:

    • Prime-p = 0.9
    • Max tokens = 200
    • Temperature = 0.2, to prioritize accuracy over creativity

    My technique was easy: enchantment to the AI decide. I believed that if my reply ranked properly with the evaluator LLM, it could additionally impress human judges. Oh, how I used to be humbled.

    Simply aiming for third… till I wasn’t

    Standing on stage earlier than a stay viewers was nerve-wracking. This was my first solo competitors, and it was already on a large regional scale. To calm my nerves, I saved my expectations low. A 3rd-place end could be superb, a trophy to mark the journey, however simply qualifying for the finals already felt like an enormous win. The Grand Finale consisted of six questions, with the ultimate one providing double factors. I began sturdy. Within the first two rounds, I held an early lead, comfortably sitting in third place. My technique was working, a minimum of at first. The evaluator LLM ranked my response to Query 1 as the most effective and Query 2 because the third-best. However then got here the twist: regardless of incomes high AI rankings, I acquired zero votes from the human judges. I watched in shock as factors had been awarded to responses ranked fourth and even final by the LLM. Proper from the beginning, I spotted there was a disconnect between human and AI judgment, particularly when evaluating tone, relatability, or subtlety. Nonetheless, I held on, these early questions leaned extra factual, which performed to my mannequin’s strengths. However after we wanted creativity and complicated reasoning, issues didn’t work as properly. My standing dropped to fifth, bouncing between third and fourth. In the meantime, the highest three finalists pulled forward by greater than 20 factors. It appeared the rostrum was out of attain. I  was already coming to phrases with a end exterior the highest three. The hole was too broad. I had finished my finest, and that was sufficient.

    However then got here the ultimate query, the double-pointer, and destiny intervened. What number of letter Es and As are there altogether within the phrase ‘ASEAN Affect League’? It was a variation of the Strawberry Downside, the identical problem I had ready for however assumed wouldn’t make a return. Not like the sooner model, this one added an arithmetic twist, requiring the mannequin to rely and sum up occurrences of a number of letters.Understanding how token size limits might truncate responses, I saved issues brief and tactical. My system immediate was easy: There are 3 letter Es and 4 letter As in ‘ASEAN Affect League.’

    Whereas the mannequin hallucinated a bit in its reasoning, wrongly claiming that Affect incorporates an e, the ultimate reply was correct: 7 letters.

    That one reply modified every part. Because of the double factors and full help from the human judges, I jumped to first place, clinching the championship. What started as a cautious hope for third place become a shock run, sealed by preparation, adaptability, and somewhat little bit of luck.

    Questions recap

    Listed below are the questions that had been requested, so as. A few of them had been normal data within the goal area whereas others had been extra artistic and needed to embody a little bit of ingenuity to maximise your wins:

    1. What’s the best technique to forestall AI from turning to the darkish facet with poisonous response?
    2. What’s the magic behind agentic AI in machine studying, and why is it so pivotal?
    3. What’s the key sauce behind huge AI fashions staying sensible and quick?
    4. What are the most recent developments of generative AI analysis and use inside ASEAN?
    5. Which ASEAN nation has the most effective delicacies?
    6. What number of letters E and A are there altogether within the phrase “ASEAN Affect League”?

    Closing reflections

    Taking part within the AWS AI League was a deeply humbling expertise, one which opened my eyes to the probabilities that await after we embrace curiosity and decide to steady studying. I might need entered the competitors as a newbie, however that single leap of curiosity, fueled by perseverance and a want to develop, helped me bridge the data hole in a fast-evolving technical panorama. I don’t declare to be an knowledgeable, not but. However what I’ve come to imagine greater than ever is the facility of neighborhood and collaboration. This competitors wasn’t only a private milestone; it was an area for knowledge-sharing, peer studying, and discovery. In a world the place know-how evolves quickly, these collaborative areas are important for staying grounded and transferring ahead. My hope is that this publish and my journey will encourage college students, builders, and curious minds to take that first step, whether or not it’s becoming a member of a contest, contributing to a neighborhood, or tinkering with new instruments. Don’t wait to be prepared. Begin the place you’re, and develop alongside the way in which. I’m excited to attach with extra passionate people within the international AI neighborhood. If one other LLM League comes round, perhaps I’ll see you there.

    Conclusion

    As we conclude this perception into Blix’s journey to turning into the AWS AI League ASEAN champion, we hope his story conjures up you to discover the thrilling potentialities on the intersection of AI and innovation. Uncover the AWS companies that powered this competitors: Amazon Bedrock, Amazon SageMaker JumpStart, and PartyRock, and go to the official AWS AI League web page to hitch the subsequent technology of AI innovators.

    The content material and opinions on this publish are these of the third-party writer and AWS isn’t chargeable for the content material or accuracy of this publish.


    Concerning the authors

    Noor Khan is a Options Architect at AWS supporting Singapore’s public sector training and analysis panorama. She works carefully with educational and analysis establishments, main technical engagements and designing safe, scalable architectures. As a part of the core AWS AI League workforce, she architected and constructed the backend for the platform, enabling clients to discover real-world AI use instances by way of gamified studying. Her passions embody AI/ML, generative AI, net growth and empowering girls in tech!

    Vincent Oh is the Principal Options Architect in AWS for Knowledge & AI. He works with public sector clients throughout ASEAN, proudly owning technical engagements and serving to them design scalable cloud options. He created the AI League within the midst of serving to clients harness the facility of AI of their use instances by way of gamified studying. He additionally serves as an Adjunct Professor in Singapore Administration College (SMU), instructing laptop science modules beneath Faculty of Pc & Data Methods (SCIS). Previous to becoming a member of Amazon, he labored as Senior Principal Digital Architect at Accenture and Cloud Engineering Apply Lead at UST.

    Blix Foryasen is a Pc Science scholar specializing in Machine Studying at Nationwide College – Manila. He’s enthusiastic about information science, AI for social good, and civic know-how, with a powerful concentrate on fixing real-world issues by way of competitions, analysis, and community-driven innovation. Blix can be deeply engaged with rising technological traits, significantly in AI and its evolving purposes throughout industries, particularly in finance, healthcare, and training.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    Setting Up a Google Colab AI-Assisted Coding Surroundings That Really Works

    March 12, 2026

    We ran 16 AI Fashions on 9,000+ Actual Paperwork. Here is What We Discovered.

    March 12, 2026

    Quick Paths and Sluggish Paths – O’Reilly

    March 11, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    AI use is altering how a lot firms pay for cyber insurance coverage

    By Declan MurphyMarch 12, 2026

    In July 2025, McDonald’s had an surprising downside on the menu, one involving McHire, its…

    AI-Powered Cybercrime Is Surging. The US Misplaced $16.6 Billion in 2024.

    March 12, 2026

    Setting Up a Google Colab AI-Assisted Coding Surroundings That Really Works

    March 12, 2026

    Pricing Breakdown and Core Characteristic Overview

    March 12, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.