Picture by Writer | Canva
We use giant language fashions in a lot of our every day duties. These fashions have been educated on billions of on-line paperwork and numerous datasets, making them able to understanding, comprehending, and responding in human-like language. Nevertheless, not all LLMs are created the identical method. Whereas the core concept stays comparable, they differ of their underlying architectures and these variations have a big affect on their capabilities. For instance, as seen throughout numerous benchmarks, DeepSeek excels at reasoning duties, Claude performs properly in coding, and ChatGPT stands out in inventive writing.
On this article, I’ll stroll you thru 7 in style LLM architectures to provide you a transparent overview, all in simply as many minutes. So, let’s get began.
1. BERT
Paper Hyperlink: https://arxiv.org/pdf/1810.04805
Developed by Google in 2018, BERT marked a big shift in pure language understanding by introducing deep bidirectional consideration in language modeling. Not like earlier fashions that learn textual content in a left-to-right or right-to-left method, BERT makes use of a transformer encoder to think about each instructions concurrently. It’s educated utilizing two duties: masked language modeling (predicting randomly masked phrases) and next-sentence prediction (figuring out if one sentence logically follows one other). Architecturally, BERT is available in two sizes: BERT Base (12 layers, 110M parameters) and BERT Giant (24 layers, 340M parameters). Its construction depends solely on encoder stacks and contains particular tokens like [CLS] to symbolize the total sentence and [SEP] to separate two sentences. You’ll be able to fine-tune it for duties like sentiment evaluation, query answering (like SQuAD), and extra. It was the primary of its variety to really perceive the total that means of sentences.
2. GPT
Paper Hyperlink (GPT 4): https://arxiv.org/pdf/2303.08774
The GPT (Generative Pre-trained Transformer) household was launched by OpenAI. The collection started with GPT-1 in 2018 and has advanced to GPT-4 by 2023, with the most recent model, GPT-4o, launched in Might 2024, showcasing multimodal capabilities, dealing with each textual content and pictures. They’re pre-trained on very giant textual content corpora with a typical next-token prediction language modeling goal: at every step the mannequin predicts the following phrase in a sequence given all earlier phrases. After this unsupervised pre-training stage, the identical mannequin may be fine-tuned on particular duties or utilized in a zero-/few-shot method with minimal extra parameters. The decoder-only design means GPT attends solely to earlier tokens in contrast to BERT’s bidirectional encoder. What was notable at introduction was the sheer scale and functionality of GPT: as every successive era (GPT‑2, GPT‑3) grew bigger, the mannequin demonstrated very fluent textual content era and few-shot studying talents, establishing the “pre-train and immediate/fine-tune” paradigm for big language fashions. Nevertheless, they’re proprietary, with entry sometimes offered by way of APIs, and their precise architectures, particularly for current variations, usually are not absolutely disclosed.
3. LLaMA
LLaMA 4 Weblog Hyperlink: https://ai.meta.com/weblog/llama-4-multimodal-intelligence/
Paper Hyperlink (LLaMA 3) : https://arxiv.org/abs/2407.21783
LLaMA, developed by Meta AI and first launched in February 2023, is a household of open-source decoder-only transformer fashions. It ranges from 7 billion to 70 billion parameters, with the most recent model, Llama 4, launched in April 2025. Like GPT, LLaMA makes use of a Transformer decoder-only structure (every mannequin is an autoregressive Transformer) however with some architectural tweaks. For instance, the unique LLaMA fashions used the SwiGLU activation as an alternative of GeLU, rotary positional embeddings (RoPE) as an alternative of fastened ones, and RMSNorm rather than layer norm. The LLaMA household was launched in a number of sizes from 7B as much as 65B parameters in LLaMA1, later even bigger in LLaMA3 to make large-scale fashions extra accessible. Notably, regardless of comparatively modest parameter counts, these fashions carried out competitively with a lot bigger contemporaries: Meta reported that LLaMA’s 13B mannequin outperformed OpenAI’s 175B GPT-3 on many benchmarks, and its 65B mannequin was aggressive with contemporaries like Google’s PaLM and DeepMind’s Chinchilla. LLaMA’s open (although research-restricted) launch spawned in depth neighborhood use; its key novelty was combining environment friendly coaching at scale with extra open entry to mannequin weights.
4. PaLM
PaLM 2 Technical Report: https://arxiv.org/abs/2305.10403
Paper Hyperlink (PaLM): https://arxiv.org/pdf/2204.02311
PaLM (Pathways Language Mannequin) is a collection of enormous language fashions developed by Google Analysis. The unique PaLM (introduced 2022) was a 540-billion parameter, decoder-only Transformer and is a part of Google’s Pathways system. It was educated on a high-quality corpus of 780 billion tokens and throughout 1000’s of TPU v4 chips in Google’s infrastructure, using parallelism to realize excessive {hardware} utilization. The mannequin additionally has multi-query consideration to scale back reminiscence bandwidth necessities throughout inference. PaLM is understood for its few-shot studying capabilities, performing properly on new duties with minimal examples due to its enormous and numerous coaching knowledge, which incorporates webpages, books, Wikipedia, information, GitHub code, and social media conversations. PaLM 2, introduced in Might 2023, additional improved multilingual, reasoning, and coding capabilities, powering purposes like Google Bard and Workspace AI options.
5. Gemini
Gemini 2.5 Weblog: https://weblog.google/expertise/google-deepmind/gemini-model-thinking-updates-march-2025/
Paper Hyperlink (Gemini 1.5): https://arxiv.org/abs/2403.05530
Paper Hyperlink (Gemini): https://arxiv.org/abs/2312.11805
Gemini is Google’s next-generation LLM household (from Google DeepMind and Google Analysis), launched in late 2023. Gemini fashions are natively multimodal, that means they’re designed from the bottom as much as deal with textual content, photos, audio, video, and even code in a single mannequin. Like PaLM and GPT, Gemini is predicated on the Transformer, however its key options embody huge scale, help for terribly lengthy contexts, and (in Gemini 1.5) a Combination-of-Consultants (MoE) structure for effectivity. For instance, Gemini 1.5 (“Professional”) makes use of sparsely activated skilled layers (a whole bunch of skilled sub-networks, with just a few lively per enter) to spice up capability with out proportional compute value. The Gemini 2.5 collection, launched in March 2025, constructed upon this basis with even deeper “considering” capabilities. In June 2025, Google launched Gemini 2.5 Flash and Professional as secure fashions and previewed Flash‑Lite, their most cost-efficient, quickest model but, optimized for high-throughput duties whereas nonetheless supporting the million-token context window and gear integrations like search and code execution. The Gemini household is available in a number of sizes (Extremely, Professional, Nano) so it could possibly run from cloud servers all the way down to cellular units. The mixture of multimodal pretraining and MoE-based scaling makes Gemini a versatile, extremely succesful basis mannequin.
6. Mistral
Paper Hyperlink (Mistral 7B): https://arxiv.org/abs/2310.06825
Mistral is a French AI startup that launched its first LLMs in 2023. Its flagship mannequin, Mistral 7B (Sept 2023), is a 7.3 billion-parameter Transformer-based decoder mannequin. Architecturally, Mistral 7B is just like a GPT-style mannequin however contains optimizations for inference: it makes use of grouped-query consideration (GQA) to hurry up self-attention and sliding-window consideration to deal with longer contexts extra effectively. When it comes to efficiency, Mistral 7B outperformed Meta’s Llama 2 13B and even gave robust outcomes versus 34B fashions, whereas being a lot smaller. Mistral AI launched the mannequin below an Apache 2.0 license, making it freely obtainable to be used. Its subsequent main launch was Mixtral 8×7B, a sparse Combination-of-Consultants (MoE) mannequin that includes eight 7 B-parameter skilled networks per layer. This design helped Mixtral match or beat GPT‑3.5 and LLaMA 2 70B on duties like arithmetic, coding, and multilingual benchmarks. In Might 2025, Mistral launched Mistral Medium 3, a proprietary mid-sized mannequin geared toward enterprises. This mannequin delivers over 90% of the rating of pricier fashions like Claude 3.7 Sonnet on commonplace benchmarks, whereas decreasing per-token value dramatically ( roughly $0.40 in vs $3.00 for Sonnet). It helps multimodal duties (textual content + photos), skilled reasoning, and is obtainable by means of an API or for on-prem deployment on as few as 4 GPUs. Nevertheless, in contrast to earlier fashions, Medium 3 is closed-source, prompting neighborhood criticism that Mistral is shifting away from its open-source ethos. Shortly after, in June 2025, Mistral launched Magistral, their first mannequin devoted to specific reasoning. The small model is open below Apache 2.0, whereas Magistral Medium is enterprise-only. Magistral Medium scored 73.6% on AIME2024, with the small model scoring 70.7%, demonstrating robust math and logic expertise in a number of languages.
7. DeepSeek
Paper Hyperlink (DeepSeek-R1): https://arxiv.org/abs/2501.12948
DeepSeek is a Chinese language AI firm (spin-off of Excessive-Flyer AI, based 2023) that develops giant LLMs. Its current fashions (like DeepSeek v3 and DeepSeek-R1) make use of a extremely sparsely activated Combination-of-Consultants Transformer structure. In DeepSeek v3/R1, every Transformer layer has a whole bunch of skilled sub-networks, however just a few are activated per token. This implies as an alternative of working all elements of the mannequin without delay, the mannequin has a whole bunch of skilled networks and prompts just a few (like 9 out of 257) relying on what’s wanted for every enter. This enables DeepSeek to have an enormous whole mannequin measurement (over 670 billion parameters) whereas solely utilizing about 37 billion throughout every response, making it a lot sooner and cheaper to run than a dense mannequin of comparable measurement. Like different fashionable LMs, it makes use of SwiGLU activations, rotary embeddings (RoPE), and superior optimizations (together with experimental FP8 precision throughout coaching) to make it extra environment friendly. This aggressive MoE design lets DeepSeek obtain very excessive functionality (corresponding to a lot bigger dense fashions) at decrease compute value. DeepSeek’s fashions (launched below open licenses) attracted consideration for rivaling main fashions like GPT-4 in multilingual era and reasoning, all whereas considerably decreasing coaching and inference useful resource necessities.
Kanwal Mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with medication. She co-authored the book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions variety and tutorial excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.