Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Google Begins Rolling Out Lengthy-Awaited @gmail.com Electronic mail Function to Customers

    January 17, 2026

    Black Forest Labs launches open supply Flux.2 [klein] to generate AI photos in lower than a second

    January 17, 2026

    Enterprise AI’s New Architectural Management Level – O’Reilly

    January 17, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»Quantizing LLMs Step-by-Step: Changing FP16 Fashions to GGUF
    Machine Learning & Research

    Quantizing LLMs Step-by-Step: Changing FP16 Fashions to GGUF

    Oliver ChambersBy Oliver ChambersJanuary 8, 2026No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Quantizing LLMs Step-by-Step: Changing FP16 Fashions to GGUF
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    On this article, you’ll find out how quantization shrinks giant language fashions and methods to convert an FP16 checkpoint into an environment friendly GGUF file you’ll be able to share and run regionally.

    Subjects we’ll cowl embrace:

    • What precision varieties (FP32, FP16, 8-bit, 4-bit) imply for mannequin measurement and velocity
    • Learn how to use huggingface_hub to fetch a mannequin and authenticate
    • Learn how to convert to GGUF with llama.cpp and add the end result to Hugging Face

    And away we go.

    Quantizing LLMs Step-by-Step: Changing FP16 Fashions to GGUF
    Picture by Creator

    Introduction

    Giant language fashions like LLaMA, Mistral, and Qwen have billions of parameters that demand plenty of reminiscence and compute energy. For instance, working LLaMA 7B in full precision can require over 12 GB of VRAM, making it impractical for a lot of customers. You possibly can test the small print on this Hugging Face dialogue. Don’t fear about what “full precision” means but; we’ll break it down quickly. The principle concept is that this: these fashions are too large to run on commonplace {hardware} with out assist. Quantization is that assist.

    Quantization permits unbiased researchers and hobbyists to run giant fashions on private computer systems by shrinking the dimensions of the mannequin with out severely impacting efficiency. On this information, we’ll discover how quantization works, what totally different precision codecs imply, after which stroll by quantizing a pattern FP16 mannequin right into a GGUF format and importing it to Hugging Face.

    What Is Quantization?

    At a really fundamental stage, quantization is about making a mannequin smaller with out breaking it. Giant language fashions are made up of billions of numerical values known as weights. These numbers management how strongly totally different elements of the community affect one another when producing an output. By default, these weights are saved utilizing high-precision codecs akin to FP32 or FP16, which implies each quantity takes up plenty of reminiscence, and when you will have billions of them, issues get out of hand in a short time. Take a single quantity like 2.31384. In FP32, that one quantity alone makes use of 32 bits of reminiscence. Now think about storing billions of numbers like that. For this reason a 7B mannequin can simply take round 28 GB in FP32 and about 14 GB even in FP16. For many laptops and GPUs, that’s already an excessive amount of.

    Quantization fixes this by saying: we don’t really want that a lot precision anymore. As an alternative of storing 2.31384 precisely, we retailer one thing near it utilizing fewer bits. Possibly it turns into 2.3 or a close-by integer worth beneath the hood. The quantity is barely much less correct, however the mannequin nonetheless behaves the identical in apply. Neural networks can tolerate these small errors as a result of the ultimate output relies on billions of calculations, not a single quantity. Small variations common out, very similar to picture compression reduces file measurement with out ruining how the picture seems. However the payoff is big. A mannequin that wants 14 GB in FP16 can usually run in about 7 GB with 8-bit quantization, and even round 4 GB with 4-bit quantization. That is what makes it potential to run giant language fashions regionally as a substitute of counting on costly servers.

    After quantizing, we frequently retailer the mannequin in a unified file format. One in style format is GGUF, created by Georgi Gerganov (creator of llama.cpp). GGUF is a single-file format that features each the quantized weights and helpful metadata. It’s optimized for fast loading and inference on CPUs or different light-weight runtimes. GGUF additionally helps a number of quantization varieties (like Q4_0, Q8_0) and works properly on CPUs and low-end GPUs. Hopefully, this clarifies each the idea and the motivation behind quantization. Now let’s transfer on to writing some code.

    Step-by-Step: Quantizing a Mannequin to GGUF

    1. Putting in Dependencies and Logging to Hugging Face

    Earlier than downloading or changing any mannequin, we have to set up the required Python packages and authenticate with Hugging Face. We’ll use huggingface_hub, Transformers, and SentencePiece. This ensures we will entry public or gated fashions with out errors:

    !pip set up –U huggingface_hub transformers sentencepiece –q

     

    from huggingface_hub import login

    login()

    2. Downloading a Pre-trained Mannequin

    We’ll decide a small FP16 mannequin from Hugging Face. Right here we use TinyLlama 1.1B, which is sufficiently small to run in Colab however nonetheless offers a great demonstration. Utilizing Python, we will obtain it with huggingface_hub:

    from huggingface_hub import snapshot_download

     

    model_id = “TinyLlama/TinyLlama-1.1B-Chat-v1.0”

    snapshot_download(

        repo_id=model_id,

        local_dir=“model_folder”,

        local_dir_use_symlinks=False

    )

    This command saves the mannequin information into the model_folder listing. You possibly can substitute model_id with any Hugging Face mannequin ID that you just wish to quantize. (If wanted, you may also use AutoModel.from_pretrained with torch.float16 to load it first, however snapshot_download is easy for grabbing the information.)

    3. Setting Up the Conversion Instruments

    Subsequent, we clone the llama.cpp repository, which incorporates the conversion scripts. In Colab:

    !git clone https://github.com/ggml-org/llama.cpp

    !pip set up –r llama.cpp/necessities.txt –q

    This provides you entry to convert_hf_to_gguf.py. The Python necessities guarantee you will have all wanted libraries to run the script.

    4. Changing the Mannequin to GGUF with Quantization

    Now, run the conversion script, specifying the enter folder, output filename, and quantization kind. We’ll use q8_0 (8-bit quantization). This may roughly halve the reminiscence footprint of the mannequin:

    !python3 llama.cpp/convert_hf_to_gguf.py /content material/mannequin_folder

        —outfile /content material/tinyllama–1.1b–chat.Q8_0.gguf

        —outtype q8_0

    Right here /content material/model_folder is the place we downloaded the mannequin, /content material/tinyllama-1.1b-chat.Q8_0.gguf is the output GGUF file, and the --outtype q8_0 flag means “quantize to 8-bit.” The script hundreds the FP16 weights, converts them into 8-bit values, and writes a single GGUF file. This file is now a lot smaller and prepared for inference with GGUF-compatible instruments.

    Output:

    INFO:gguf.gguf_writer:Writing the following information:

    INFO:gguf.gguf_writer:/content material/tinyllama–1.1b–chat.Q8_0.gguf: n_tensors = 201, total_size = 1.2G

    Writing: 100% 1.17G/1.17G [00:26<00:00, 44.5Mbyte/s]

    INFO:hf–to–gguf:Mannequin efficiently exported to /content material/tinyllama–1.1b–chat.Q8_0.gguf

    You possibly can confirm the output:

    !ls –lh /content material/tinyllama–1.1b–chat.Q8_0.gguf

    You must see a file just a few GB in measurement, diminished from the unique FP16 mannequin.

    –rw–r—r— 1 root root 1.1G Dec 30 20:23 /content material/tinyllama–1.1b–chat.Q8_0.gguf

    5. Importing the Quantized Mannequin to Hugging Face

    Lastly, you’ll be able to publish the GGUF mannequin so others can simply obtain and use it utilizing the huggingface_hub Python library:

    from huggingface_hub import HfApi

     

    api = HfApi()

    repo_id = “kanwal-mehreen18/tinyllama-1.1b-gguf”

    api.create_repo(repo_id, exist_ok=True)

     

    api.upload_file(

        path_or_fileobj=“/content material/tinyllama-1.1b-chat.Q8_0.gguf”,

        path_in_repo=“tinyllama-1.1b-chat.Q8_0.gguf”,

        repo_id=repo_id

    )

    This creates a brand new repository (if it doesn’t exist) and uploads your quantized GGUF file. Anybody can now load it with llama.cpp, llama-cpp-python, or Ollama. You possibly can entry the quantized GGUF file that we created right here.

    Wrapping Up

    By following the steps above, you’ll be able to take any supported Hugging Face mannequin, quantize it (e.g. to 4-bit or 8-bit), and put it aside as GGUF. Then push it to Hugging Face to share or deploy. This makes it simpler than ever to compress and use giant language fashions on on a regular basis {hardware}.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    Enterprise AI’s New Architectural Management Level – O’Reilly

    January 17, 2026

    The Knowledge-High quality Phantasm: Rethinking Classifier-Primarily based High quality Filtering for LLM Pretraining

    January 16, 2026

    How the Amazon AMET Funds crew accelerates check case technology with Strands Brokers

    January 16, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    Google Begins Rolling Out Lengthy-Awaited @gmail.com Electronic mail Function to Customers

    By Declan MurphyJanuary 17, 2026

    Google has initiated a gradual rollout of a extremely requested function that permits customers to vary their…

    Black Forest Labs launches open supply Flux.2 [klein] to generate AI photos in lower than a second

    January 17, 2026

    Enterprise AI’s New Architectural Management Level – O’Reilly

    January 17, 2026

    Simplify cloud networking with Lumen® Multi-Cloud Gateway

    January 17, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.