Close Menu
    Main Menu
    • Home
    • News
    • Tech
    • Robotics
    • ML & Research
    • AI
    • Digital Transformation
    • AI Ethics & Regulation
    • Thought Leadership in AI

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    ShinyHunters Hackers Threaten 400 Companies Over Stolen Salesforce Information

    March 10, 2026

    From Textual content to Tables: Characteristic Engineering with LLMs for Tabular Knowledge

    March 10, 2026

    Teradyne sues Chinese language cobot maker over UR software program

    March 10, 2026
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Facebook X (Twitter) Instagram
    UK Tech InsiderUK Tech Insider
    Home»Machine Learning & Research»From Textual content to Tables: Characteristic Engineering with LLMs for Tabular Knowledge
    Machine Learning & Research

    From Textual content to Tables: Characteristic Engineering with LLMs for Tabular Knowledge

    Oliver ChambersBy Oliver ChambersMarch 10, 2026No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    From Textual content to Tables: Characteristic Engineering with LLMs for Tabular Knowledge
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link


    On this article, you’ll discover ways to use a pre-trained giant language mannequin to extract structured options from textual content and mix them with numeric columns to coach a supervised classifier.

    Subjects we are going to cowl embody:

    • Making a toy dataset with combined textual content and numeric fields for classification
    • Utilizing a Groq-hosted LLaMA mannequin to extract JSON options from ticket textual content with a Pydantic schema
    • Coaching and evaluating a scikit-learn classifier on the engineered tabular dataset

    Let’s not waste any extra time.

    From Textual content to Tables: Characteristic Engineering with LLMs for Tabular Knowledge
    Picture by Editor

    Introduction

    Whereas giant language fashions (LLMs) are sometimes used for conversational functions in use circumstances that revolve round pure language interactions, they will additionally help with duties like function engineering on complicated datasets. Particularly, you possibly can leverage pre-trained LLMs from suppliers like Groq (for instance, fashions from the Llama household) to undertake knowledge transformation and preprocessing duties, together with turning unstructured knowledge like textual content into totally structured, tabular knowledge that can be utilized to gasoline predictive machine studying fashions.

    On this article, I’ll information you thru the total means of making use of function engineering to structured textual content, turning it into tabular knowledge appropriate for a machine studying mannequin — particularly, a classifier skilled on options created from textual content by utilizing an LLM.

    Setup and Imports

    First, we are going to make all the mandatory imports for this sensible instance:

    import pandas as pd

    import json

    from pydantic import BaseModel, Area

    from openai import OpenAI

    from google.colab import userdata

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.model_selection import train_test_split

    from sklearn.metrics import classification_report

    from sklearn.preprocessing import StandardScaler

    Be aware that apart from frequent libraries for machine studying and knowledge preprocessing like scikit-learn, we import the OpenAI class — not as a result of we are going to straight use an OpenAI mannequin, however as a result of many LLM APIs (together with Groq’s) have adopted the identical interface type and specs as OpenAI. This class due to this fact helps you work together with a wide range of suppliers and entry a variety of LLMs by a single shopper, together with Llama fashions through Groq, as we are going to see shortly.

    Subsequent, we arrange a Groq shopper to allow entry to a pre-trained LLM that we will name through API for inference throughout execution:

    groq_api_key = userdata.get(‘GROQ_API_KEY’)

    shopper = OpenAI(

        base_url=“https://api.groq.com/openai/v1”,

        api_key=groq_api_key

    )

    Vital notice: for the above code to work, you could outline an API secret key for Groq. In Google Colab, you are able to do this by the “Secrets and techniques” icon on the left-hand aspect bar (this icon appears like a key). Right here, give your key the identify 'GROQ_API_KEY', then register on the Groq web site to get an precise key, and paste it into the worth subject.

    Making a Toy Ticket Dataset

    The following step generates an artificial, partly random toy dataset for illustrative functions. In case you have your personal textual content dataset, be at liberty to adapt the code accordingly and use your personal.

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

    50

    51

    52

    53

    import random

    import time

     

    random.seed(42)

    classes = [“access”, “inquiry”, “software”, “billing”, “hardware”]

     

    templates = {

        “entry”: [

            “I’ve been locked out of my account for {days} days and need urgent help!”,

            “I can’t log in, it keeps saying bad password.”,

            “Reset my access credentials immediately.”,

            “My 2FA isn’t working, please help me get into my account.”

        ],

        “inquiry”: [

            “When will my new credit card arrive in the mail?”,

            “Just checking on the status of my recent order.”,

            “What are your business hours on weekends?”,

            “Can I upgrade my current plan to the premium tier?”

        ],

        “software program”: [

            “The app keeps crashing every time I try to view my transaction history.”,

            “Software bug: the submit button is greyed out.”,

            “Pages are loading incredibly slowly since the last update.”,

            “I’m getting a 500 Internal Server Error on the dashboard.”

        ],

        “billing”: [

            “I need a refund for the extra charges on my bill.”,

            “Why was I billed twice this month?”,

            “Please update my payment method, the old card expired.”,

            “I didn’t authorize this $49.99 transaction.”

        ],

        “{hardware}”: [

            “My hardware token is broken, I can’t log in.”,

            “The screen on my physical device is cracked.”,

            “The card reader isn’t scanning properly anymore.”,

            “Battery drains in 10 minutes, I need a replacement unit.”

        ]

    }

     

    knowledge = []

    for _ in vary(100):

        cat = random.selection(classes)

        # Injecting a random variety of days into particular templates to foster selection

        textual content = random.selection(templates[cat]).format(days=random.randint(1, 14))

        

        knowledge.append({

            “textual content”: textual content,

            “account_age_days”: random.randint(1, 2000),

            “prior_tickets”: random.selections([0, 1, 2, 3, 4, 5], weights=[40, 30, 15, 10, 3, 2])[0],

            “label”: cat

        })

     

    df = pd.DataFrame(knowledge)

    The dataset generated comprises buyer assist tickets, combining textual content descriptions with structured numeric options like account age and variety of prior tickets, in addition to a category label spanning a number of ticket classes. These labels will later be used for coaching and evaluating a classification mannequin on the finish of the method.

    Extracting LLM Options

    Subsequent, we outline the specified tabular options we need to extract from the textual content. The selection of options is domain-dependent and totally customizable, however you’ll use the LLM afterward to extract these fields in a constant, structured format:

    class TicketFeatures(BaseModel):

        urgency_score: int = Area(description=“Urgency of the ticket on a scale of 1 to five”)

        is_frustrated: int = Area(description=“1 if the consumer expresses frustration, 0 in any other case”)

    For instance, urgency and frustration usually correlate with particular ticket sorts (e.g. entry lockouts and outages are typically extra pressing and emotionally charged than common inquiries), so these indicators might help a downstream classifier separate classes extra successfully than uncooked textual content alone.

    The following perform is a key aspect of the method, because it encapsulates the LLM integration wanted to remodel a ticket’s textual content right into a JSON object that matches our schema.

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    def extract_features(textual content: str) -> dict:

        # Sleep for two.5 seconds for safer use below the constraints of the 30 RPM free-tier restrict

        time.sleep(2.5)

        

        schema_instructions = json.dumps(TicketFeatures.model_json_schema())

        response = shopper.chat.completions.create(

            mannequin=“llama-3.3-70b-versatile”,

            messages=[

                {

                    “role”: “system”,

                    “content”: f“You are an extraction assistant. Output ONLY valid JSON matching this schema: {schema_instructions}”

                },

                {“role”: “user”, “content”: text}

            ],

            response_format={“kind”: “json_object”},

            temperature=0.0

        )

        return json.hundreds(response.selections[0].message.content material)

    Why does the perform return JSON objects? First, JSON is a dependable strategy to ask an LLM to provide structured outputs. Second, JSON objects could be simply transformed into Pandas Collection objects, which may then be seamlessly merged with different columns of an current DataFrame to develop into new ones. The next directions do the trick and append the brand new options, saved in engineered_features, to the remainder of the unique dataset:

    print(“1. Extracting structured options from textual content utilizing LLM…”)

    engineered_features = df[“text”].apply(extract_features)

    features_df = pd.DataFrame(engineered_features.tolist())

     

    X_raw = pd.concat([df.drop(columns=[“text”, “label”]), features_df], axis=1)

    y = df[“label”]

     

    print(“n2. Remaining Engineered Tabular Dataset:”)

    print(X_raw)

    Here’s what the ensuing tabular knowledge appears like:

              account_age_days  prior_tickets  urgency_score  is_pissed off

    0                564              0              5              1

    1               1517              3              4              0

    2                 62              0              5              1

    3                408              2              4              0

    4                920              1              5              1

    ..               ...            ...            ...            ...

    95                91              2              4              1

    96               884              0              4              1

    97              1737              0              5              1

    98               837              0              5              1

    99               862              1              4              1

     

    [100 rows x 4 columns]

    Sensible notice on value and latency: Calling an LLM as soon as per row can develop into gradual and costly on bigger datasets. In manufacturing, you’ll normally need to (1) batch requests (course of many tickets per name, in case your supplier and immediate design enable it), (2) cache outcomes keyed by a secure identifier (or a hash of the ticket textual content) so re-runs don’t re-bill the identical examples, and (3) implement retries with backoff to deal with transient price limits and community errors. These three practices sometimes make the pipeline sooner, cheaper, and much more dependable.

    Coaching and Evaluating the Mannequin

    Lastly, right here comes the machine studying pipeline, the place the up to date, totally tabular dataset is scaled, cut up into coaching and take a look at subsets, and used to coach and consider a random forest classifier.

    print(“n3. Scaling and Coaching Random Forest…”)

    scaler = StandardScaler()

    X_scaled = scaler.fit_transform(X_raw)

     

    # Break up the information into coaching and take a look at

    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.4, random_state=42)

     

    # Practice a random forest classification mannequin

    clf = RandomForestClassifier(random_state=42)

    clf.match(X_train, y_train)

     

    # Predict and Consider

    y_pred = clf.predict(X_test)

    print(“n4. Classification Report:”)

    print(classification_report(y_test, y_pred, zero_division=0))

    Listed here are the classifier outcomes:

    Classification Report:

                  precision    recall  f1–rating   assist

     

          entry       0.22      0.18      0.20        11

         billing       0.29      0.33      0.31         6

        {hardware}       0.29      0.25      0.27         8

         inquiry       1.00      1.00      1.00         8

        software program       0.44      0.57      0.50         7

     

        accuracy                           0.45        40

       macro avg       0.45      0.47      0.45        40

    weighted avg       0.44      0.45      0.44        40

    In case you used the code for producing an artificial toy dataset, you might get a moderately disappointing classifier consequence when it comes to accuracy, precision, recall, and so forth. That is regular: for the sake of effectivity and ease, we used a small, partly random set of 100 situations — which is often too small (and arguably too random) to carry out properly. The important thing right here is the method of turning uncooked textual content into significant options by the usage of a pre-trained LLM through API, which ought to work reliably.

    Abstract

    This text takes a delicate tour by the method of turning uncooked textual content into totally tabular options for downstream machine studying modeling. The important thing trick proven alongside the way in which is utilizing a pre-trained LLM to carry out inference and return structured outputs through efficient prompting.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Oliver Chambers
    • Website

    Related Posts

    How Agent Expertise Create Specialised AI With out Coaching – O’Reilly

    March 10, 2026

    Studying to Motive for Hallucination Span Detection

    March 10, 2026

    Run NVIDIA Nemotron 3 Nano as a totally managed serverless mannequin on Amazon Bedrock

    March 10, 2026
    Top Posts

    Evaluating the Finest AI Video Mills for Social Media

    April 18, 2025

    Utilizing AI To Repair The Innovation Drawback: The Three Step Resolution

    April 18, 2025

    Midjourney V7: Quicker, smarter, extra reasonable

    April 18, 2025

    Meta resumes AI coaching utilizing EU person knowledge

    April 18, 2025
    Don't Miss

    ShinyHunters Hackers Threaten 400 Companies Over Stolen Salesforce Information

    By Declan MurphyMarch 10, 2026

    ShinyHunters, the infamous group of hackers, has issued a ultimate warning to roughly 400 organisations,…

    From Textual content to Tables: Characteristic Engineering with LLMs for Tabular Knowledge

    March 10, 2026

    Teradyne sues Chinese language cobot maker over UR software program

    March 10, 2026

    KadNap Malware Turning Asus Routers Into Botnets

    March 10, 2026
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    UK Tech Insider
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms Of Service
    • Our Authors
    © 2026 UK Tech Insider. All rights reserved by UK Tech Insider.

    Type above and press Enter to search. Press Esc to cancel.