LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?

On this article, you’ll learn the way Bag-of-Phrases, TF-IDF, and LLM-generated embeddings examine when used as textual content options for classification and clustering in scikit-learn.

Matters we’ll cowl embody:

How one can generate Bag-of-Phrases, TF-IDF, and LLM embeddings for a similar dataset.
How these representations examine on textual content classification efficiency and coaching velocity.
How they behave otherwise for unsupervised doc clustering.

Let’s get proper to it.

LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn? (click on to enlarge)
Picture by Writer

Introduction

Machine studying fashions constructed with frameworks like scikit-learn can accommodate unstructured information like textual content, so long as this uncooked textual content is transformed right into a numerical illustration that’s comprehensible by algorithms, fashions, and machines in a broader sense.

This text takes three well-known textual content illustration approaches — TF-IDF, Bag-of-Phrases, and LLM-generated embeddings — to supply an analytical and example-based comparability between them, within the context of downstream machine studying modeling with scikit-learn.

For a glimpse of textual content illustration approaches, together with an introduction to the three used on this article, we advocate you check out this text and this one.

The article will first navigate you thru a Python instance the place we’ll use the BBC information dataset — a labeled dataset containing a couple of thousand information articles categorized into 5 sorts — to acquire the three goal representations for every textual content, construct some textual content classifiers and examine them, and in addition construct and examine some clustering fashions. After that, we undertake a extra basic and analytical perspective to debate which strategy is best — and when to make use of one or one other.

Setup and Getting Textual content Representations

First, we import all of the modules and libraries we’ll want, arrange some configurations, and cargo the BBC information dataset:

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from time import time # Scikit-learn imports from sklearn.feature_extraction.textual content import CountVectorizer, TfidfVectorizer from sklearn.model_selection import train_test_split, cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.cluster import KMeans from sklearn.metrics import ( accuracy_score, f1_score, classification_report, silhouette_score, adjusted_rand_score ) from sklearn.preprocessing import LabelEncoder # Our key import for constructing LLM embeddings: a Sentence Transformer mannequin from sentence_transformers import SentenceTransformer # Plotting configuration – for later analyzing and evaluating outcomes sns.set_style(“whitegrid”) plt.rcParams[‘figure.figsize’] = (14, 6) # Loading BBC Information dataset print(“Loading BBC Information dataset…”) url = “https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv” df = pd.read_csv(url) print(f”Dataset loaded: {len(df)} paperwork”) print(f”Classes: {df[‘category’].distinctive()}”) print(f”nClass distribution:”) print(df[‘category’].value_counts())

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from time import time

# Scikit-learn imports

from sklearn.feature_extraction.textual content import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC

from sklearn.cluster import KMeans

from sklearn.metrics import (

accuracy_score, f1_score, classification_report,

silhouette_score, adjusted_rand_rating

)

from sklearn.preprocessing import LabelEncoder

# Our key import for constructing LLM embeddings: a Sentence Transformer mannequin

from sentence_transformers import SentenceTransformer

# Plotting configuration – for later analyzing and evaluating outcomes

sns.set_style(“whitegrid”)

plt.rcParams[‘figure.figsize’] = (14, 6)

# Loading BBC Information dataset

print(“Loading BBC Information dataset…”)

url = “https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv”

df = pd.read_csv(url)

print(f“Dataset loaded: {len(df)} paperwork”)

print(f“Classes: {df[‘category’].distinctive()}”)

print(f“nClass distribution:”)

print(df[‘category’].value_counts())

On the time of writing, the dataset model we’re utilizing incorporates 2225 cases, that’s, paperwork containing information articles.

Since we’ll prepare some supervised machine studying fashions for classification in a while, earlier than acquiring the three representations for our textual content information, we separate the enter texts from their labels and cut up the entire dataset into coaching and check subsets:

print(“n” + “=”*70) print(“DATA PREPARATION PRIOR TO GENERATING TEXT REPRESENTATIONS”) print(“=”*70) texts = df[‘text’].tolist() labels = df[‘category’].tolist() # Encoding labels for classification le = LabelEncoder() y = le.fit_transform(labels) # Splitting information (identical cut up for all illustration strategies and ML fashions skilled later) X_text_train, X_text_test, y_train, y_test = train_test_split( texts, y, test_size=0.2, random_state=42, stratify=y ) print(f”nTrain set: {len(X_text_train)} | Check set: {len(X_text_test)}”)

print(“n” + “=”*70)

print(“DATA PREPARATION PRIOR TO GENERATING TEXT REPRESENTATIONS”)

print(“=”*70)

texts = df[‘text’].tolist()

labels = df[‘category’].tolist()

# Encoding labels for classification

le = LabelEncoder()

y = le.fit_transform(labels)

# Splitting information (identical cut up for all illustration strategies and ML fashions skilled later)

X_text_train, X_text_test, y_train, y_test = train_test_split(

texts, y, test_size=0.2, random_state=42, stratify=y

)

print(f“nTrain set: {len(X_text_train)} | Check set: {len(X_text_test)}”)

Illustration 1: Bag-of-Phrases (BoW)

print(“n[1] Bag-of-Phrases…”) begin = time() # The CountVectorizer class is used to use BoW bow_vectorizer = CountVectorizer( max_features=5000, min_df=2, stop_words=”english” ) X_bow_train = bow_vectorizer.fit_transform(X_text_train) X_bow_test = bow_vectorizer.remodel(X_text_test) bow_time = time() – begin print(f” Completed in {bow_time:.2f}s”) print(f” Form: {X_bow_train.form} (paperwork × vocabulary)”) print(f” Sparsity: {(1 – X_bow_train.nnz / (X_bow_train.form[0] * X_bow_train.form[1])) * 100:.1f}%”) print(f” Reminiscence: {X_bow_train.information.nbytes / 1024:.1f} KB”)

print(“n[1] Bag-of-Phrases…”)

begin = time()

# The CountVectorizer class is used to use BoW

bow_vectorizer = CountVectorizer(

max_features=5000,

min_df=2,

stop_words=‘english’

)

X_bow_train = bow_vectorizer.fit_transform(X_text_train)

X_bow_test = bow_vectorizer.remodel(X_text_test)

bow_time = time() – begin

print(f” Completed in {bow_time:.2f}s”)

print(f” Form: {X_bow_train.form} (paperwork × vocabulary)”)

print(f” Sparsity: {(1 – X_bow_train.nnz / (X_bow_train.form[0] * X_bow_train.form[1])) * 100:.1f}%”)

print(f” Reminiscence: {X_bow_train.information.nbytes / 1024:.1f} KB”)

Illustration 2: TF-IDF

print(“n[2] TF-IDF…”) begin = time() # Utilizing TfidfVectorizer class to use TF-IDF based mostly on phrase frequencies tfidf_vectorizer = TfidfVectorizer( max_features=5000, min_df=2, stop_words=”english” ) X_tfidf_train = tfidf_vectorizer.fit_transform(X_text_train) X_tfidf_test = tfidf_vectorizer.remodel(X_text_test) tfidf_time = time() – begin print(f” Completed in {tfidf_time:.2f}s”) print(f” Form: {X_tfidf_train.form}”) print(f” Sparsity: {(1 – X_tfidf_train.nnz / (X_tfidf_train.form[0] * X_tfidf_train.form[1])) * 100:.1f}%”) print(f” Reminiscence: {X_tfidf_train.information.nbytes / 1024:.1f} KB”)

print(“n[2] TF-IDF…”)

begin = time()

# Utilizing TfidfVectorizer class to use TF-IDF based mostly on phrase frequencies

tfidf_vectorizer = TfidfVectorizer(

max_features=5000,

min_df=2,

stop_words=‘english’

)

X_tfidf_train = tfidf_vectorizer.fit_transform(X_text_train)

X_tfidf_test = tfidf_vectorizer.remodel(X_text_test)

tfidf_time = time() – begin

print(f” Completed in {tfidf_time:.2f}s”)

print(f” Form: {X_tfidf_train.form}”)

print(f” Sparsity: {(1 – X_tfidf_train.nnz / (X_tfidf_train.form[0] * X_tfidf_train.form[1])) * 100:.1f}%”)

print(f” Reminiscence: {X_tfidf_train.information.nbytes / 1024:.1f} KB”)

Illustration 3: LLM Embeddings

print(“n[3] LLM Embeddings…”) begin = time() # Loading a pre-trained sentence transformer mannequin to generate 384-dimensional embeddings embedding_model = SentenceTransformer(‘all-MiniLM-L6-v2’) X_emb_train = embedding_model.encode( X_text_train, show_progress_bar=True, batch_size=32 ) X_emb_test = embedding_model.encode( X_text_test, show_progress_bar=False, batch_size=32 ) emb_time = time() – begin print(f” Completed in {emb_time:.2f}s”) print(f” Form: {X_emb_train.form} (paperwork × embedding_dim)”) print(f” Sparsity: 0.0% (dense illustration)”) print(f” Reminiscence: {X_emb_train.nbytes / 1024:.1f} KB”)

print(“n[3] LLM Embeddings…”)

begin = time()

# Loading a pre-trained sentence transformer mannequin to generate 384-dimensional embeddings

embedding_model = SentenceTransformer(‘all-MiniLM-L6-v2’)

X_emb_train = embedding_model.encode(

X_text_train,

show_progress_bar=True,

batch_size=32

)

X_emb_test = embedding_model.encode(

X_text_test,

show_progress_bar=False,

batch_size=32

)

emb_time = time() – begin

print(f” Completed in {emb_time:.2f}s”)

print(f” Form: {X_emb_train.form} (paperwork × embedding_dim)”)

print(f” Sparsity: 0.0% (dense illustration)”)

print(f” Reminiscence: {X_emb_train.nbytes / 1024:.1f} KB”)

Comparability 1: Textual content Classification

That was an intensive preparatory stage! Now we’re prepared for a primary comparability instance, targeted on coaching a number of kinds of machine studying classifiers and evaluating how every sort of classifier performs when skilled on one textual content illustration or one other.

In a nutshell, the code supplied under will:

Think about three classifier sorts: logistic regression, random forests, and help vector machines (SVM).
Practice and consider every of the three×3 = 9 classifiers skilled, utilizing two analysis metrics: accuracy and F1 rating.
Listing and visualize the outcomes obtained from every mannequin sort and textual content illustration strategy used.

print(“n” + “=”*70) print(“COMPARISON 1: SUPERVISED CLASSIFICATION”) print(“=”*70) # Defining the three kinds of classifiers to coach classifiers = { ‘Logistic Regression’: LogisticRegression(max_iter=1000, random_state=42), ‘Random Forest’: RandomForestClassifier(n_estimators=100, random_state=42), ‘SVM’: SVC(kernel=”linear”, random_state=42) } # Storing ends in a Python assortment (record) classification_results = [] # Evaluating every illustration with every classifier representations = { ‘BoW’: (X_bow_train, X_bow_test), ‘TF-IDF’: (X_tfidf_train, X_tfidf_test), ‘LLM Embeddings’: (X_emb_train, X_emb_test) } for rep_name, (X_tr, X_te) in representations.gadgets(): print(f”nTesting {rep_name}:”) print(“-” * 50) for clf_name, clf in classifiers.gadgets(): # Practice begin = time() clf.match(X_tr, y_train) train_time = time() – begin # Predict begin = time() y_pred = clf.predict(X_te) pred_time = time() – begin # Consider acc = accuracy_score(y_test, y_pred) f1 = f1_score(y_test, y_pred, common=”weighted”) print(f” {clf_name:20s} | Acc: {acc:.3f} | F1: {f1:.3f} | Practice: {train_time:.2f}s”) classification_results.append({ ‘Illustration’: rep_name, ‘Classifier’: clf_name, ‘Accuracy’: acc, ‘F1-Rating’: f1, ‘Practice Time’: train_time, ‘Predict Time’: pred_time }) # Changing outcomes to DataFrame for interpretability and simpler comparability results_df = pd.DataFrame(classification_results)

print(“n” + “=”*70)

print(“COMPARISON 1: SUPERVISED CLASSIFICATION”)

print(“=”*70)

# Defining the three kinds of classifiers to coach

classifiers = {

‘Logistic Regression’: LogisticRegression(max_iter=1000, random_state=42),

‘Random Forest’: RandomForestClassifier(n_estimators=100, random_state=42),

‘SVM’: SVC(kernel=‘linear’, random_state=42)

}

# Storing ends in a Python assortment (record)

classification_results = []

# Evaluating every illustration with every classifier

representations = {

‘BoW’: (X_bow_train, X_bow_test),

‘TF-IDF’: (X_tfidf_train, X_tfidf_test),

‘LLM Embeddings’: (X_emb_train, X_emb_test)

}

for rep_name, (X_tr, X_te) in representations.gadgets():

print(f“nTesting {rep_name}:”)

print(“-“ * 50)

for clf_name, clf in classifiers.gadgets():

# Practice

begin = time()

clf.match(X_tr, y_train)

train_time = time() – begin

# Predict

begin = time()

y_pred = clf.predict(X_te)

pred_time = time() – begin

# Consider

acc = accuracy_score(y_test, y_pred)

f1 = f1_score(y_test, y_pred, common=‘weighted’)

print(f” {clf_name:20s} | Acc: {acc:.3f} | F1: {f1:.3f} | Practice: {train_time:.2f}s”)

classification_results.append({

‘Illustration’: rep_name,

‘Classifier’: clf_name,

‘Accuracy’: acc,

‘F1-Rating’: f1,

‘Practice Time’: train_time,

‘Predict Time’: pred_time

})

# Changing outcomes to DataFrame for interpretability and simpler comparability

results_df = pd.DataFrame(classification_results)

Output:

====================================================================== COMPARISON 1: SUPERVISED CLASSIFICATION ====================================================================== Testing BoW: ————————————————– Logistic Regression | Acc: 0.982 | F1: 0.982 | Practice: 0.86s Random Forest | Acc: 0.973 | F1: 0.973 | Practice: 2.20s SVM | Acc: 0.984 | F1: 0.984 | Practice: 2.02s Testing TF-IDF: ————————————————– Logistic Regression | Acc: 0.984 | F1: 0.984 | Practice: 0.52s Random Forest | Acc: 0.978 | F1: 0.977 | Practice: 1.79s SVM | Acc: 0.987 | F1: 0.987 | Practice: 2.99s Testing LLM Embeddings: ————————————————– Logistic Regression | Acc: 0.982 | F1: 0.982 | Practice: 0.27s Random Forest | Acc: 0.960 | F1: 0.959 | Practice: 5.21s SVM | Acc: 0.980 | F1: 0.980 | Practice: 0.15s

======================================================================

COMPARISON 1: SUPERVISED CLASSIFICATION

======================================================================

Testing BoW:

—————————————————————————

Logistic Regression | Acc: 0.982 | F1: 0.982 | Practice: 0.86s

Random Forest | Acc: 0.973 | F1: 0.973 | Practice: 2.20s

SVM | Acc: 0.984 | F1: 0.984 | Practice: 2.02s

Testing TF–IDF:

—————————————————————————

Logistic Regression | Acc: 0.984 | F1: 0.984 | Practice: 0.52s

Random Forest | Acc: 0.978 | F1: 0.977 | Practice: 1.79s

SVM | Acc: 0.987 | F1: 0.987 | Practice: 2.99s

Testing LLM Embeddings:

—————————————————————————

Logistic Regression | Acc: 0.982 | F1: 0.982 | Practice: 0.27s

Random Forest | Acc: 0.960 | F1: 0.959 | Practice: 5.21s

SVM | Acc: 0.980 | F1: 0.980 | Practice: 0.15s

Enter code for visualizing outcomes:

# Creating visualization plots for direct comparability fig, axes = plt.subplots(1, 2, figsize=(16, 6)) # Plot 1: Accuracy comparability pivot_acc = results_df.pivot(index=’Classifier’, columns=”Illustration”, values=”Accuracy”) pivot_acc.plot(form=’bar’, ax=axes[0], width=0.8) axes[0].set_title(‘Classification Accuracy by Illustration’, fontsize=14, fontweight=”daring”) axes[0].set_ylabel(‘Accuracy’) axes[0].set_xlabel(‘Classifier’) axes[0].legend(title=”Illustration”) axes[0].grid(axis=”y”, alpha=0.3) axes[0].set_ylim([0.9, 1.0]) # Plot 2: Coaching time comparability pivot_time = results_df.pivot(index=’Classifier’, columns=”Illustration”, values=”Practice Time”) pivot_time.plot(form=’bar’, ax=axes[1], width=0.8, colour=[‘#1f77b4’, ‘#ff7f0e’, ‘#2ca02c’]) axes[1].set_title(‘Coaching Time by Illustration’, fontsize=14, fontweight=”daring”) axes[1].set_ylabel(‘Time (seconds)’) axes[1].set_xlabel(‘Classifier’) axes[1].legend(title=”Illustration”) axes[1].grid(axis=”y”, alpha=0.3) plt.tight_layout() plt.present() # Figuring out finest performers print(“nBEST PERFORMERS:”) print(“-” * 50) best_acc = results_df.loc[results_df[‘Accuracy’].idxmax()] print(f”Finest Accuracy: {best_acc[‘Representation’]} + {best_acc[‘Classifier’]} = {best_acc[‘Accuracy’]:.3f}”) quickest = results_df.loc[results_df[‘Train Time’].idxmin()] print(f”Quickest Coaching: {quickest[‘Representation’]} + {quickest[‘Classifier’]} = {quickest[‘Train Time’]:.2f}s”)

# Creating visualization plots for direct comparability

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Accuracy comparability

pivot_acc = results_df.pivot(index=‘Classifier’, columns=‘Illustration’, values=‘Accuracy’)

pivot_acc.plot(form=‘bar’, ax=axes[0], width=0.8)

axes[0].set_title(‘Classification Accuracy by Illustration’, fontsize=14, fontweight=‘daring’)

axes[0].set_ylabel(‘Accuracy’)

axes[0].set_xlabel(‘Classifier’)

axes[0].legend(title=‘Illustration’)

axes[0].grid(axis=‘y’, alpha=0.3)

axes[0].set_ylim([0.9, 1.0])

# Plot 2: Coaching time comparability

pivot_time = results_df.pivot(index=‘Classifier’, columns=‘Illustration’, values=‘Practice Time’)

pivot_time.plot(form=‘bar’, ax=axes[1], width=0.8, colour=[‘#1f77b4’, ‘#ff7f0e’, ‘#2ca02c’])

axes[1].set_title(‘Coaching Time by Illustration’, fontsize=14, fontweight=‘daring’)

axes[1].set_ylabel(‘Time (seconds)’)

axes[1].set_xlabel(‘Classifier’)

axes[1].legend(title=‘Illustration’)

axes[1].grid(axis=‘y’, alpha=0.3)

plt.tight_layout()

plt.present()

# Figuring out finest performers

print(“nBEST PERFORMERS:”)

print(“-“ * 50)

best_acc = results_df.loc[results_df[‘Accuracy’].idxmax()]

print(f“Finest Accuracy: {best_acc[‘Representation’]} + {best_acc[‘Classifier’]} = {best_acc[‘Accuracy’]:.3f}”)

quickest = results_df.loc[results_df[‘Train Time’].idxmin()]

print(f“Quickest Coaching: {quickest[‘Representation’]} + {quickest[‘Classifier’]} = {quickest[‘Train Time’]:.2f}s”)

Let’s take these outcomes with a pinch of salt, as they’re particular to the dataset and mannequin sorts skilled, and on no account generalizable. TF-IDF mixed with an SVM classifier led to the most effective accuracy (0.987), whereas LLM embeddings with SVM yielded the quickest mannequin to coach (0.15s). In the meantime, the finest total mixture by way of performance-speed steadiness is logistic regression with TF-IDF, with an almost good accuracy of 0.984 and a really quick coaching time of 0.52s.

Why did LLM embeddings, supposedly essentially the most superior of the three textual content illustration approaches, not present the most effective efficiency? There are a number of causes for this. First, the prevailing 5 courses (information classes) within the BBC information dataset are strongly word-discriminative; in different phrases, they’re simply separable by class, so reasonably easier representations like TF-IDF are sufficient to seize these patterns very properly. This additionally implies there’s no need for the deep semantic understanding that LLM embeddings obtain; in actual fact, this may typically be counterproductive and result in overfitting. As well as, due to the close to separability between information sorts, linear and easier fashions work nice, in comparison with complicated ones like random forests.

If we had a tougher, real-world dataset than BBC information, with points like noise, paraphrasing, slang, and even cross-lingual information, LLM embeddings would most likely outperform the opposite two representations.

Relating to Bag-of-Phrases, on this state of affairs it solely marginally outperforms by way of inference velocity, so it’s primarily beneficial for quite simple duties requiring most interpretability, or as a part of a baseline mannequin earlier than making an attempt different methods.

Comparability 2: Doc Clustering

We are going to contemplate a second state of affairs: making use of k-means clustering with okay=5 and evaluating the cluster high quality throughout the three textual content illustration schemes. Discover within the code under that, since clustering is an unsupervised job not requiring labels or train-test splitting, we’ll re-generate all three representations once more for the entire dataset.

print(“n” + “=”*70) print(“COMPARISON 2: DOCUMENT CLUSTERING”) print(“=”*70) # Utilizing full dataset for clustering (no prepare/check cut up wanted) all_texts = texts all_labels = y # Producing representations as soon as extra print(“nGenerating representations for full dataset…”) X_bow_full = bow_vectorizer.fit_transform(all_texts) X_tfidf_full = tfidf_vectorizer.fit_transform(all_texts) X_emb_full = embedding_model.encode(all_texts, show_progress_bar=True, batch_size=32) # Clustering with Ok-Means (okay=5, matching ground-truth classes) n_clusters = len(np.distinctive(all_labels)) clustering_results = [] representations_full = { ‘BoW’: X_bow_full, ‘TF-IDF’: X_tfidf_full, ‘LLM Embeddings’: X_emb_full } for rep_name, X_full in representations_full.gadgets(): print(f”nClustering with {rep_name}:”) begin = time() kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(X_full) cluster_time = time() – begin # Consider silhouette = silhouette_score(X_full, cluster_labels) ari = adjusted_rand_score(all_labels, cluster_labels) print(f” Silhouette Rating: {silhouette:.3f}”) print(f” Adjusted Rand Index: {ari:.3f}”) print(f” Time: {cluster_time:.2f}s”) clustering_results.append({ ‘Illustration’: rep_name, ‘Silhouette’: silhouette, ‘ARI’: ari, ‘Time’: cluster_time }) clustering_df = pd.DataFrame(clustering_results)

print(“n” + “=”*70)

print(“COMPARISON 2: DOCUMENT CLUSTERING”)

print(“=”*70)

# Utilizing full dataset for clustering (no prepare/check cut up wanted)

all_texts = texts

all_labels = y

# Producing representations as soon as extra

print(“nGenerating representations for full dataset…”)

X_bow_full = bow_vectorizer.fit_transform(all_texts)

X_tfidf_full = tfidf_vectorizer.fit_transform(all_texts)

X_emb_full = embedding_model.encode(all_texts, show_progress_bar=True, batch_size=32)

# Clustering with Ok-Means (okay=5, matching ground-truth classes)

n_clusters = len(np.distinctive(all_labels))

clustering_results = []

representations_full = {

‘BoW’: X_bow_full,

‘TF-IDF’: X_tfidf_full,

‘LLM Embeddings’: X_emb_full

}

for rep_name, X_full in representations_full.gadgets():

print(f“nClustering with {rep_name}:”)

begin = time()

kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)

cluster_labels = kmeans.fit_predict(X_full)

cluster_time = time() – begin

# Consider

silhouette = silhouette_score(X_full, cluster_labels)

ari = adjusted_rand_score(all_labels, cluster_labels)

print(f” Silhouette Rating: {silhouette:.3f}”)

print(f” Adjusted Rand Index: {ari:.3f}”)

print(f” Time: {cluster_time:.2f}s”)

clustering_results.append({

‘Illustration’: rep_name,

‘Silhouette’: silhouette,

‘ARI’: ari,

‘Time’: cluster_time

})

clustering_df = pd.DataFrame(clustering_results)

Output:

Clustering with BoW: Silhouette Rating: 0.124 Adjusted Rand Index: 0.102 Time: 1.19s Clustering with TF-IDF: Silhouette Rating: 0.016 Adjusted Rand Index: 0.698 Time: 0.94s Clustering with LLM Embeddings: Silhouette Rating: 0.066 Adjusted Rand Index: 0.899 Time: 0.41s

Clustering with BoW:

Silhouette Rating: 0.124

Adjusted Rand Index: 0.102

Time: 1.19s

Clustering with TF–IDF:

Silhouette Rating: 0.016

Adjusted Rand Index: 0.698

Time: 0.94s

Clustering with LLM Embeddings:

Silhouette Rating: 0.066

Adjusted Rand Index: 0.899

Time: 0.41s

Code for visualizing outcomes:

# Creating comparability plots fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Clustering high quality metrics x = np.arange(len(clustering_df)) width = 0.35 axes[0].bar(x – width/2, clustering_df[‘Silhouette’], width, label=”Silhouette”, alpha=0.8) axes[0].bar(x + width/2, clustering_df[‘ARI’], width, label=”Adjusted Rand Index”, alpha=0.8) axes[0].set_xlabel(‘Illustration’) axes[0].set_ylabel(‘Rating’) axes[0].set_title(‘Clustering High quality Metrics’, fontsize=14, fontweight=”daring”) axes[0].set_xticks(x) axes[0].set_xticklabels(clustering_df[‘Representation’]) axes[0].legend() axes[0].grid(axis=”y”, alpha=0.3) # Plot 2: Clustering time axes[1].bar(clustering_df[‘Representation’], clustering_df[‘Time’], colour=[‘#1f77b4’, ‘#ff7f0e’, ‘#2ca02c’], alpha=0.8) axes[1].set_xlabel(‘Illustration’) axes[1].set_ylabel(‘Time (seconds)’) axes[1].set_title(‘Clustering Computation Time’, fontsize=14, fontweight=”daring”) axes[1].grid(axis=”y”, alpha=0.3) plt.tight_layout() plt.present() print(“nBEST CLUSTERING PERFORMER:”) print(“-” * 50) best_cluster = clustering_df.loc[clustering_df[‘ARI’].idxmax()] print(f”{best_cluster[‘Representation’]}: ARI = {best_cluster[‘ARI’]:.3f}, Silhouette = {best_cluster[‘Silhouette’]:.3f}”)

# Creating comparability plots

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Clustering high quality metrics

x = np.arange(len(clustering_df))

width = 0.35

axes[0].bar(x – width/2, clustering_df[‘Silhouette’], width, label=‘Silhouette’, alpha=0.8)

axes[0].bar(x + width/2, clustering_df[‘ARI’], width, label=‘Adjusted Rand Index’, alpha=0.8)

axes[0].set_xlabel(‘Illustration’)

axes[0].set_ylabel(‘Rating’)

axes[0].set_title(‘Clustering High quality Metrics’, fontsize=14, fontweight=‘daring’)

axes[0].set_xticks(x)

axes[0].set_xticklabels(clustering_df[‘Representation’])

axes[0].legend()

axes[0].grid(axis=‘y’, alpha=0.3)

# Plot 2: Clustering time

axes[1].bar(clustering_df[‘Representation’], clustering_df[‘Time’], colour=[‘#1f77b4’, ‘#ff7f0e’, ‘#2ca02c’], alpha=0.8)

axes[1].set_xlabel(‘Illustration’)

axes[1].set_ylabel(‘Time (seconds)’)

axes[1].set_title(‘Clustering Computation Time’, fontsize=14, fontweight=‘daring’)

axes[1].grid(axis=‘y’, alpha=0.3)

plt.tight_layout()

plt.present()

print(“nBEST CLUSTERING PERFORMER:”)

print(“-“ * 50)

best_cluster = clustering_df.loc[clustering_df[‘ARI’].idxmax()]

print(f“{best_cluster[‘Representation’]}: ARI = {best_cluster[‘ARI’]:.3f}, Silhouette = {best_cluster[‘Silhouette’]:.3f}”)

LLM embeddings received this time, with an ARI rating of 0.899, exhibiting sturdy alignment between clusters discovered and actual subgroups that abide by true doc classes. That is largely as a result of clustering is an unsupervised studying job and, not like classification, it is a territory the place semantic understanding like that supplied by embeddings turns into much more vital for capturing patterns, even on easier datasets.

Abstract

Easier, well-behaved datasets like BBC information are a fantastic instance of an issue the place superior and LLM-based representations like embeddings don’t all the time win. Conventional pure language processing approaches for textual content illustration could excel in issues with clear class boundaries, linear separability, and clear, formal textual content with out noisy patterns.

In sum, when addressing real-world machine studying tasks, contemplate all the time beginning with easier baselines and keyword-based representations like TF-IDF, earlier than straight leaping into state-of-the-art or most superior methods. The smaller your problem, the lighter the outfit it’s essential costume it with that good machine studying look!

Main Menu

What's Hot

Getting Began with Python Async Programming

Teledyne FLIR Protection Indicators Memorandum of Understanding with STORM Adapt Group at EnforceTac 2026

LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?

LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?

Agentify Your App with GitHub Copilot’s Agentic Coding SDK

A Newbie’s Studying Checklist for Giant Language Fashions for 2026

Constructing a Easy MCP Server in Python

Evaluating the Finest AI Video Mills for Social Media

Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

Midjourney V7: Quicker, smarter, extra reasonable

Meta resumes AI coaching utilizing EU person knowledge

Getting Began with Python Async Programming

Teledyne FLIR Protection Indicators Memorandum of Understanding with STORM Adapt Group at EnforceTac 2026

LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?

Understanding Audio Annotation for Speech Recognition Fashions

Main Menu

Subscribe to Updates

What's Hot

LLM Embeddings vs TF-IDF vs Bag-of-Phrases: Which Works Higher in Scikit-learn?

Introduction

Setup and Getting Textual content Representations

Comparability 1: Textual content Classification

Comparability 2: Doc Clustering

Abstract

Related Posts