Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    SloppyLemming Espionage Marketing campaign Targets Pakistan, Bangladesh with BurrowShell Backdoor and Rust RAT

    March 3, 2026

    How MAGA and the manosphere are being examined by Trump’s Iran struggle

    March 3, 2026

    Getting Began with Python Async Programming

    March 3, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Thought Leadership in AI»LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?
    Thought Leadership in AI

    LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?

    Yasmin BhattiBy Yasmin BhattiMarch 3, 2026No Comments15 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    On this article, you’ll learn the way Bag-of-Phrases, TF-IDF, and LLM-generated embeddings examine when used as textual content options for classification and clustering in scikit-learn.

    Matters we’ll cowl embody:

    • How one can generate Bag-of-Phrases, TF-IDF, and LLM embeddings for a similar dataset.
    • How these representations examine on textual content classification efficiency and coaching velocity.
    • How they behave otherwise for unsupervised doc clustering.

    Let’s get proper to it.

    LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn? (click on to enlarge)
    Picture by Writer

    Introduction

    Machine studying fashions constructed with frameworks like scikit-learn can accommodate unstructured information like textual content, so long as this uncooked textual content is transformed right into a numerical illustration that’s comprehensible by algorithms, fashions, and machines in a broader sense.

    This text takes three well-known textual content illustration approaches — TF-IDF, Bag-of-Phrases, and LLM-generated embeddings — to supply an analytical and example-based comparability between them, within the context of downstream machine studying modeling with scikit-learn.

    For a glimpse of textual content illustration approaches, together with an introduction to the three used on this article, we advocate you check out this text and this one.

    The article will first navigate you thru a Python instance the place we’ll use the BBC information dataset — a labeled dataset containing a couple of thousand information articles categorized into 5 sorts — to acquire the three goal representations for every textual content, construct some textual content classifiers and examine them, and in addition construct and examine some clustering fashions. After that, we undertake a extra basic and analytical perspective to debate which strategy is best — and when to make use of one or one other.

    Setup and Getting Textual content Representations

    First, we import all of the modules and libraries we’ll want, arrange some configurations, and cargo the BBC information dataset:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    import seaborn as sns

    from time import time

     

    # Scikit-learn imports

    from sklearn.feature_extraction.textual content import CountVectorizer, TfidfVectorizer

    from sklearn.model_selection import train_test_split, cross_val_score

    from sklearn.linear_model import LogisticRegression

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.svm import SVC

    from sklearn.cluster import KMeans

    from sklearn.metrics import (

        accuracy_score, f1_score, classification_report,

        silhouette_score, adjusted_rand_rating

    )

    from sklearn.preprocessing import LabelEncoder

     

    # Our key import for constructing LLM embeddings: a Sentence Transformer mannequin

    from sentence_transformers import SentenceTransformer

     

    # Plotting configuration – for later analyzing and evaluating outcomes

    sns.set_style(“whitegrid”)

    plt.rcParams[‘figure.figsize’] = (14, 6)

     

    # Loading BBC Information dataset

    print(“Loading BBC Information dataset…”)

    url = “https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv”

    df = pd.read_csv(url)

     

    print(f“Dataset loaded: {len(df)} paperwork”)

    print(f“Classes: {df[‘category’].distinctive()}”)

    print(f“nClass distribution:”)

    print(df[‘category’].value_counts())

    On the time of writing, the dataset model we’re utilizing incorporates 2225 cases, that’s, paperwork containing information articles.

    Since we’ll prepare some supervised machine studying fashions for classification in a while, earlier than acquiring the three representations for our textual content information, we separate the enter texts from their labels and cut up the entire dataset into coaching and check subsets:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    print(“n” + “=”*70)

    print(“DATA PREPARATION PRIOR TO GENERATING TEXT REPRESENTATIONS”)

    print(“=”*70)

     

    texts = df[‘text’].tolist()

    labels = df[‘category’].tolist()

     

    # Encoding labels for classification

    le = LabelEncoder()

    y = le.fit_transform(labels)

     

    # Splitting information (identical cut up for all illustration strategies and ML fashions skilled later)

    X_text_train, X_text_test, y_train, y_test = train_test_split(

        texts, y, test_size=0.2, random_state=42, stratify=y

    )

     

    print(f“nTrain set: {len(X_text_train)} | Check set: {len(X_text_test)}”)

    Illustration 1: Bag-of-Phrases (BoW)

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    print(“n[1] Bag-of-Phrases…”)

    begin = time()

     

    # The CountVectorizer class is used to use BoW

    bow_vectorizer = CountVectorizer(

        max_features=5000,      

        min_df=2,              

        stop_words=‘english’    

    )

     

    X_bow_train = bow_vectorizer.fit_transform(X_text_train)

    X_bow_test = bow_vectorizer.remodel(X_text_test)

     

    bow_time = time() – begin

     

    print(f”   Completed in {bow_time:.2f}s”)

    print(f”   Form: {X_bow_train.form} (paperwork × vocabulary)”)

    print(f”   Sparsity: {(1 – X_bow_train.nnz / (X_bow_train.form[0] * X_bow_train.form[1])) * 100:.1f}%”)

    print(f”   Reminiscence: {X_bow_train.information.nbytes / 1024:.1f} KB”)

    Illustration 2: TF-IDF

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    print(“n[2] TF-IDF…”)

    begin = time()

     

    # Utilizing TfidfVectorizer class to use TF-IDF based mostly on phrase frequencies

    tfidf_vectorizer = TfidfVectorizer(

        max_features=5000,

        min_df=2,

        stop_words=‘english’

    )

     

    X_tfidf_train = tfidf_vectorizer.fit_transform(X_text_train)

    X_tfidf_test = tfidf_vectorizer.remodel(X_text_test)

     

    tfidf_time = time() – begin

     

    print(f”   Completed in {tfidf_time:.2f}s”)

    print(f”   Form: {X_tfidf_train.form}”)

    print(f”   Sparsity: {(1 – X_tfidf_train.nnz / (X_tfidf_train.form[0] * X_tfidf_train.form[1])) * 100:.1f}%”)

    print(f”   Reminiscence: {X_tfidf_train.information.nbytes / 1024:.1f} KB”)

    Illustration 3: LLM Embeddings

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    print(“n[3] LLM Embeddings…”)

    begin = time()

     

    # Loading a pre-trained sentence transformer mannequin to generate 384-dimensional embeddings

    embedding_model = SentenceTransformer(‘all-MiniLM-L6-v2’)

     

    X_emb_train = embedding_model.encode(

        X_text_train,

        show_progress_bar=True,

        batch_size=32

    )

    X_emb_test = embedding_model.encode(

        X_text_test,

        show_progress_bar=False,

        batch_size=32

    )

     

    emb_time = time() – begin

     

    print(f”   Completed in {emb_time:.2f}s”)

    print(f”   Form: {X_emb_train.form} (paperwork × embedding_dim)”)

    print(f”   Sparsity: 0.0% (dense illustration)”)

    print(f”   Reminiscence: {X_emb_train.nbytes / 1024:.1f} KB”)

    Comparability 1: Textual content Classification

    That was an intensive preparatory stage! Now we’re prepared for a primary comparability instance, targeted on coaching a number of kinds of machine studying classifiers and evaluating how every sort of classifier performs when skilled on one textual content illustration or one other.

    In a nutshell, the code supplied under will:

    1. Think about three classifier sorts: logistic regression, random forests, and help vector machines (SVM).
    2. Practice and consider every of the three×3 = 9 classifiers skilled, utilizing two analysis metrics: accuracy and F1 rating.
    3. Listing and visualize the outcomes obtained from every mannequin sort and textual content illustration strategy used.

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

    50

    51

    52

    53

    print(“n” + “=”*70)

    print(“COMPARISON 1: SUPERVISED CLASSIFICATION”)

    print(“=”*70)

     

    # Defining the three kinds of classifiers to coach

    classifiers = {

        ‘Logistic Regression’: LogisticRegression(max_iter=1000, random_state=42),

        ‘Random Forest’: RandomForestClassifier(n_estimators=100, random_state=42),

        ‘SVM’: SVC(kernel=‘linear’, random_state=42)

    }

     

    # Storing ends in a Python assortment (record)

    classification_results = []

     

    # Evaluating every illustration with every classifier

    representations = {

        ‘BoW’: (X_bow_train, X_bow_test),

        ‘TF-IDF’: (X_tfidf_train, X_tfidf_test),

        ‘LLM Embeddings’: (X_emb_train, X_emb_test)

    }

     

    for rep_name, (X_tr, X_te) in representations.gadgets():

        print(f“nTesting {rep_name}:”)

        print(“-“ * 50)

        

        for clf_name, clf in classifiers.gadgets():

            # Practice

            begin = time()

            clf.match(X_tr, y_train)

            train_time = time() – begin

            

            # Predict

            begin = time()

            y_pred = clf.predict(X_te)

            pred_time = time() – begin

            

            # Consider

            acc = accuracy_score(y_test, y_pred)

            f1 = f1_score(y_test, y_pred, common=‘weighted’)

            

            print(f”   {clf_name:20s} | Acc: {acc:.3f} | F1: {f1:.3f} | Practice: {train_time:.2f}s”)

            

            classification_results.append({

                ‘Illustration’: rep_name,

                ‘Classifier’: clf_name,

                ‘Accuracy’: acc,

                ‘F1-Rating’: f1,

                ‘Practice Time’: train_time,

                ‘Predict Time’: pred_time

            })

     

    # Changing outcomes to DataFrame for interpretability and simpler comparability

    results_df = pd.DataFrame(classification_results)

    Output:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    ======================================================================

    COMPARISON 1: SUPERVISED CLASSIFICATION

    ======================================================================

     

    Testing BoW:

    —————————————————————————

       Logistic Regression  | Acc: 0.982 | F1: 0.982 | Practice: 0.86s

       Random Forest        | Acc: 0.973 | F1: 0.973 | Practice: 2.20s

       SVM                  | Acc: 0.984 | F1: 0.984 | Practice: 2.02s

     

    Testing TF–IDF:

    —————————————————————————

       Logistic Regression  | Acc: 0.984 | F1: 0.984 | Practice: 0.52s

       Random Forest        | Acc: 0.978 | F1: 0.977 | Practice: 1.79s

       SVM                  | Acc: 0.987 | F1: 0.987 | Practice: 2.99s

     

    Testing LLM Embeddings:

    —————————————————————————

       Logistic Regression  | Acc: 0.982 | F1: 0.982 | Practice: 0.27s

       Random Forest        | Acc: 0.960 | F1: 0.959 | Practice: 5.21s

       SVM                  | Acc: 0.980 | F1: 0.980 | Practice: 0.15s

    Enter code for visualizing outcomes:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    # Creating visualization plots for direct comparability

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

     

    # Plot 1: Accuracy comparability

    pivot_acc = results_df.pivot(index=‘Classifier’, columns=‘Illustration’, values=‘Accuracy’)

    pivot_acc.plot(form=‘bar’, ax=axes[0], width=0.8)

    axes[0].set_title(‘Classification Accuracy by Illustration’, fontsize=14, fontweight=‘daring’)

    axes[0].set_ylabel(‘Accuracy’)

    axes[0].set_xlabel(‘Classifier’)

    axes[0].legend(title=‘Illustration’)

    axes[0].grid(axis=‘y’, alpha=0.3)

    axes[0].set_ylim([0.9, 1.0])

     

    # Plot 2: Coaching time comparability

    pivot_time = results_df.pivot(index=‘Classifier’, columns=‘Illustration’, values=‘Practice Time’)

    pivot_time.plot(form=‘bar’, ax=axes[1], width=0.8, colour=[‘#1f77b4’, ‘#ff7f0e’, ‘#2ca02c’])

    axes[1].set_title(‘Coaching Time by Illustration’, fontsize=14, fontweight=‘daring’)

    axes[1].set_ylabel(‘Time (seconds)’)

    axes[1].set_xlabel(‘Classifier’)

    axes[1].legend(title=‘Illustration’)

    axes[1].grid(axis=‘y’, alpha=0.3)

     

    plt.tight_layout()

    plt.present()

     

    # Figuring out finest performers

    print(“nBEST PERFORMERS:”)

    print(“-“ * 50)

    best_acc = results_df.loc[results_df[‘Accuracy’].idxmax()]

    print(f“Finest Accuracy: {best_acc[‘Representation’]} + {best_acc[‘Classifier’]} = {best_acc[‘Accuracy’]:.3f}”)

     

    quickest = results_df.loc[results_df[‘Train Time’].idxmin()]

    print(f“Quickest Coaching: {quickest[‘Representation’]} + {quickest[‘Classifier’]} = {quickest[‘Train Time’]:.2f}s”)

    Comparing classifiers trained on different text representations

    Let’s take these outcomes with a pinch of salt, as they’re particular to the dataset and mannequin sorts skilled, and on no account generalizable. TF-IDF mixed with an SVM classifier led to the most effective accuracy (0.987), whereas LLM embeddings with SVM yielded the quickest mannequin to coach (0.15s). In the meantime, the finest total mixture by way of performance-speed steadiness is logistic regression with TF-IDF, with an almost good accuracy of 0.984 and a really quick coaching time of 0.52s.

    Why did LLM embeddings, supposedly essentially the most superior of the three textual content illustration approaches, not present the most effective efficiency? There are a number of causes for this. First, the prevailing 5 courses (information classes) within the BBC information dataset are strongly word-discriminative; in different phrases, they’re simply separable by class, so reasonably easier representations like TF-IDF are sufficient to seize these patterns very properly. This additionally implies there’s no need for the deep semantic understanding that LLM embeddings obtain; in actual fact, this may typically be counterproductive and result in overfitting. As well as, due to the close to separability between information sorts, linear and easier fashions work nice, in comparison with complicated ones like random forests.

    If we had a tougher, real-world dataset than BBC information, with points like noise, paraphrasing, slang, and even cross-lingual information, LLM embeddings would most likely outperform the opposite two representations.

    Relating to Bag-of-Phrases, on this state of affairs it solely marginally outperforms by way of inference velocity, so it’s primarily beneficial for quite simple duties requiring most interpretability, or as a part of a baseline mannequin earlier than making an attempt different methods.

    Comparability 2: Doc Clustering

    We are going to contemplate a second state of affairs: making use of k-means clustering with okay=5 and evaluating the cluster high quality throughout the three textual content illustration schemes. Discover within the code under that, since clustering is an unsupervised job not requiring labels or train-test splitting, we’ll re-generate all three representations once more for the entire dataset.

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

    print(“n” + “=”*70)

    print(“COMPARISON 2: DOCUMENT CLUSTERING”)

    print(“=”*70)

     

    # Utilizing full dataset for clustering (no prepare/check cut up wanted)

    all_texts = texts

    all_labels = y

     

    # Producing representations as soon as extra

    print(“nGenerating representations for full dataset…”)

     

    X_bow_full = bow_vectorizer.fit_transform(all_texts)

    X_tfidf_full = tfidf_vectorizer.fit_transform(all_texts)

    X_emb_full = embedding_model.encode(all_texts, show_progress_bar=True, batch_size=32)

     

    # Clustering with Ok-Means (okay=5, matching ground-truth classes)

    n_clusters = len(np.distinctive(all_labels))

    clustering_results = []

     

    representations_full = {

        ‘BoW’: X_bow_full,

        ‘TF-IDF’: X_tfidf_full,

        ‘LLM Embeddings’: X_emb_full

    }

     

    for rep_name, X_full in representations_full.gadgets():

        print(f“nClustering with {rep_name}:”)

        

        begin = time()

        kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)

        cluster_labels = kmeans.fit_predict(X_full)

        cluster_time = time() – begin

        

        # Consider

        silhouette = silhouette_score(X_full, cluster_labels)

        ari = adjusted_rand_score(all_labels, cluster_labels)

        

        print(f”   Silhouette Rating: {silhouette:.3f}”)

        print(f”   Adjusted Rand Index: {ari:.3f}”)

        print(f”   Time: {cluster_time:.2f}s”)

        

        clustering_results.append({

            ‘Illustration’: rep_name,

            ‘Silhouette’: silhouette,

            ‘ARI’: ari,

            ‘Time’: cluster_time

        })

     

    clustering_df = pd.DataFrame(clustering_results)

    Output:

    Clustering with BoW:

       Silhouette Rating: 0.124

       Adjusted Rand Index: 0.102

       Time: 1.19s

     

    Clustering with TF–IDF:

       Silhouette Rating: 0.016

       Adjusted Rand Index: 0.698

       Time: 0.94s

     

    Clustering with LLM Embeddings:

       Silhouette Rating: 0.066

       Adjusted Rand Index: 0.899

       Time: 0.41s

    Code for visualizing outcomes:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    # Creating comparability plots

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

     

    # Plot 1: Clustering high quality metrics

    x = np.arange(len(clustering_df))

    width = 0.35

     

    axes[0].bar(x – width/2, clustering_df[‘Silhouette’], width, label=‘Silhouette’, alpha=0.8)

    axes[0].bar(x + width/2, clustering_df[‘ARI’], width, label=‘Adjusted Rand Index’, alpha=0.8)

    axes[0].set_xlabel(‘Illustration’)

    axes[0].set_ylabel(‘Rating’)

    axes[0].set_title(‘Clustering High quality Metrics’, fontsize=14, fontweight=‘daring’)

    axes[0].set_xticks(x)

    axes[0].set_xticklabels(clustering_df[‘Representation’])

    axes[0].legend()

    axes[0].grid(axis=‘y’, alpha=0.3)

     

    # Plot 2: Clustering time

    axes[1].bar(clustering_df[‘Representation’], clustering_df[‘Time’], colour=[‘#1f77b4’, ‘#ff7f0e’, ‘#2ca02c’], alpha=0.8)

    axes[1].set_xlabel(‘Illustration’)

    axes[1].set_ylabel(‘Time (seconds)’)

    axes[1].set_title(‘Clustering Computation Time’, fontsize=14, fontweight=‘daring’)

    axes[1].grid(axis=‘y’, alpha=0.3)

     

    plt.tight_layout()

    plt.present()

     

    print(“nBEST CLUSTERING PERFORMER:”)

    print(“-“ * 50)

    best_cluster = clustering_df.loc[clustering_df[‘ARI’].idxmax()]

    print(f“{best_cluster[‘Representation’]}: ARI = {best_cluster[‘ARI’]:.3f}, Silhouette = {best_cluster[‘Silhouette’]:.3f}”)

    Clustering results with three text representations

    LLM embeddings received this time, with an ARI rating of 0.899, exhibiting sturdy alignment between clusters discovered and actual subgroups that abide by true doc classes. That is largely as a result of clustering is an unsupervised studying job and, not like classification, it is a territory the place semantic understanding like that supplied by embeddings turns into much more vital for capturing patterns, even on easier datasets.

    Abstract

    Easier, well-behaved datasets like BBC information are a fantastic instance of an issue the place superior and LLM-based representations like embeddings don’t all the time win. Conventional pure language processing approaches for textual content illustration could excel in issues with clear class boundaries, linear separability, and clear, formal textual content with out noisy patterns.

    In sum, when addressing real-world machine studying tasks, contemplate all the time beginning with easier baselines and keyword-based representations like TF-IDF, earlier than straight leaping into state-of-the-art or most superior methods. The smaller your problem, the lighter the outfit it’s essential costume it with that good machine studying look!

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Yasmin Bhatti
    • Website

    Related Posts

    Agentify Your App with GitHub Copilot’s Agentic Coding SDK

    March 3, 2026

    A Newbie’s Studying Checklist for Giant Language Fashions for 2026

    March 2, 2026

    Constructing a Easy MCP Server in Python

    March 2, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    SloppyLemming Espionage Marketing campaign Targets Pakistan, Bangladesh with BurrowShell Backdoor and Rust RAT

    By Declan MurphyMarch 3, 2026

    SloppyLemming, an India-linked espionage group also referred to as Outrider Tiger and Fishing Elephant, has…

    How MAGA and the manosphere are being examined by Trump’s Iran struggle

    March 3, 2026

    Getting Began with Python Async Programming

    March 3, 2026

    Teledyne FLIR Protection Indicators Memorandum of Understanding with STORM Adapt Group at EnforceTac 2026

    March 3, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.