From Textual content to Tables: Function Engineering with LLMs for Tabular Knowledge

🚀 Able to supercharge your AI workflow? Attempt ElevenLabs for AI voice and speech era!

On this article, you’ll learn to use a pre-trained giant language mannequin to extract structured options from textual content and mix them with numeric columns to coach a supervised classifier.

Subjects we are going to cowl embody:

Making a toy dataset with combined textual content and numeric fields for classification
Utilizing a Groq-hosted LLaMA mannequin to extract JSON options from ticket textual content with a Pydantic schema
Coaching and evaluating a scikit-learn classifier on the engineered tabular dataset

Let’s not waste any extra time.

From Text to Tables: Feature Engineering with LLMs for Tabular Data

From Textual content to Tables: Function Engineering with LLMs for Tabular Knowledge
Picture by Editor

Introduction

Whereas giant language fashions (LLMs) are sometimes used for conversational functions in use instances that revolve round pure language interactions, they’ll additionally help with duties like function engineering on complicated datasets. Particularly, you possibly can leverage pre-trained LLMs from suppliers like Groq (for instance, fashions from the Llama household) to undertake knowledge transformation and preprocessing duties, together with turning unstructured knowledge like textual content into totally structured, tabular knowledge that can be utilized to gas predictive machine studying fashions.

On this article, I’ll information you thru the complete means of making use of function engineering to structured textual content, turning it into tabular knowledge appropriate for a machine studying mannequin — particularly, a classifier skilled on options created from textual content by utilizing an LLM.

Setup and Imports

First, we are going to make all the required imports for this sensible instance:

import pandas as pd import json from pydantic import BaseModel, Subject from openai import OpenAI from google.colab import userdata from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.preprocessing import StandardScaler

import pandas as pd

import json

from pydantic import BaseModel, Subject

from openai import OpenAI

from google.colab import userdata

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

from sklearn.preprocessing import StandardScaler

Word that in addition to frequent libraries for machine studying and knowledge preprocessing like scikit-learn, we import the OpenAI class — not as a result of we are going to straight use an OpenAI mannequin, however as a result of many LLM APIs (together with Groq’s) have adopted the identical interface fashion and specs as OpenAI. This class subsequently helps you work together with quite a lot of suppliers and entry a variety of LLMs via a single shopper, together with Llama fashions by way of Groq, as we are going to see shortly.

Subsequent, we arrange a Groq shopper to allow entry to a pre-trained LLM that we are able to name by way of API for inference throughout execution:

groq_api_key = userdata.get(‘GROQ_API_KEY’) shopper = OpenAI( base_url=”https://api.groq.com/openai/v1″, api_key=groq_api_key )

groq_api_key = userdata.get(‘GROQ_API_KEY’)

shopper = OpenAI(

base_url=“https://api.groq.com/openai/v1”,

api_key=groq_api_key

)

Necessary be aware: for the above code to work, it’s essential to outline an API secret key for Groq. In Google Colab, you are able to do this via the “Secrets and techniques” icon on the left-hand facet bar (this icon seems like a key). Right here, give your key the identify 'GROQ_API_KEY', then register on the Groq web site to get an precise key, and paste it into the worth subject.

Making a Toy Ticket Dataset

The subsequent step generates an artificial, partly random toy dataset for illustrative functions. In case you have your personal textual content dataset, be at liberty to adapt the code accordingly and use your personal.

import random import time random.seed(42) classes = [“access”, “inquiry”, “software”, “billing”, “hardware”] templates = { “entry”: [ “I’ve been locked out of my account for {days} days and need urgent help!”, “I can’t log in, it keeps saying bad password.”, “Reset my access credentials immediately.”, “My 2FA isn’t working, please help me get into my account.” ], “inquiry”: [ “When will my new credit card arrive in the mail?”, “Just checking on the status of my recent order.”, “What are your business hours on weekends?”, “Can I upgrade my current plan to the premium tier?” ], “software program”: [ “The app keeps crashing every time I try to view my transaction history.”, “Software bug: the submit button is greyed out.”, “Pages are loading incredibly slowly since the last update.”, “I’m getting a 500 Internal Server Error on the dashboard.” ], “billing”: [ “I need a refund for the extra charges on my bill.”, “Why was I billed twice this month?”, “Please update my payment method, the old card expired.”, “I didn’t authorize this $49.99 transaction.” ], “{hardware}”: [ “My hardware token is broken, I can’t log in.”, “The screen on my physical device is cracked.”, “The card reader isn’t scanning properly anymore.”, “Battery drains in 10 minutes, I need a replacement unit.” ] } knowledge = [] for _ in vary(100): cat = random.alternative(classes) # Injecting a random variety of days into particular templates to foster selection textual content = random.alternative(templates[cat]).format(days=random.randint(1, 14)) knowledge.append({ “textual content”: textual content, “account_age_days”: random.randint(1, 2000), “prior_tickets”: random.selections([0, 1, 2, 3, 4, 5], weights=[40, 30, 15, 10, 3, 2])[0], “label”: cat }) df = pd.DataFrame(knowledge)

import random

import time

random.seed(42)

classes = [“access”, “inquiry”, “software”, “billing”, “hardware”]

templates = {

“entry”: [

“I’ve been locked out of my account for {days} days and need urgent help!”,

“I can’t log in, it keeps saying bad password.”,

“Reset my access credentials immediately.”,

“My 2FA isn’t working, please help me get into my account.”

“inquiry”: [

“When will my new credit card arrive in the mail?”,

“Just checking on the status of my recent order.”,

“What are your business hours on weekends?”,

“Can I upgrade my current plan to the premium tier?”

“software program”: [

“The app keeps crashing every time I try to view my transaction history.”,

“Software bug: the submit button is greyed out.”,

“Pages are loading incredibly slowly since the last update.”,

“I’m getting a 500 Internal Server Error on the dashboard.”

“billing”: [

“I need a refund for the extra charges on my bill.”,

“Why was I billed twice this month?”,

“Please update my payment method, the old card expired.”,

“I didn’t authorize this $49.99 transaction.”

“{hardware}”: [

“My hardware token is broken, I can’t log in.”,

“The screen on my physical device is cracked.”,

“The card reader isn’t scanning properly anymore.”,

“Battery drains in 10 minutes, I need a replacement unit.”

]

}

knowledge = []

for _ in vary(100):

cat = random.alternative(classes)

# Injecting a random variety of days into particular templates to foster selection

textual content = random.alternative(templates[cat]).format(days=random.randint(1, 14))

knowledge.append({

“textual content”: textual content,

“account_age_days”: random.randint(1, 2000),

“prior_tickets”: random.selections([0, 1, 2, 3, 4, 5], weights=[40, 30, 15, 10, 3, 2])[0],

“label”: cat

})

df = pd.DataFrame(knowledge)

The dataset generated accommodates buyer assist tickets, combining textual content descriptions with structured numeric options like account age and variety of prior tickets, in addition to a category label spanning a number of ticket classes. These labels will later be used for coaching and evaluating a classification mannequin on the finish of the method.

Extracting LLM Options

Subsequent, we outline the specified tabular options we wish to extract from the textual content. The selection of options is domain-dependent and totally customizable, however you’ll use the LLM in a while to extract these fields in a constant, structured format:

class TicketFeatures(BaseModel): urgency_score: int = Subject(description=”Urgency of the ticket on a scale of 1 to five”) is_frustrated: int = Subject(description=”1 if the person expresses frustration, 0 in any other case”)

class TicketFeatures(BaseModel):

urgency_score: int = Subject(description=“Urgency of the ticket on a scale of 1 to five”)

is_frustrated: int = Subject(description=“1 if the person expresses frustration, 0 in any other case”)

For instance, urgency and frustration usually correlate with particular ticket varieties (e.g. entry lockouts and outages are typically extra pressing and emotionally charged than normal inquiries), so these alerts may help a downstream classifier separate classes extra successfully than uncooked textual content alone.

The subsequent perform is a key ingredient of the method, because it encapsulates the LLM integration wanted to remodel a ticket’s textual content right into a JSON object that matches our schema.

def extract_features(textual content: str) -> dict: # Sleep for two.5 seconds for safer use underneath the constraints of the 30 RPM free-tier restrict time.sleep(2.5) schema_instructions = json.dumps(TicketFeatures.model_json_schema()) response = shopper.chat.completions.create( mannequin=”llama-3.3-70b-versatile”, messages=[ { “role”: “system”, “content”: f”You are an extraction assistant. Output ONLY valid JSON matching this schema: {schema_instructions}” }, {“role”: “user”, “content”: text} ], response_format={“kind”: “json_object”}, temperature=0.0 ) return json.hundreds(response.selections[0].message.content material)

def extract_features(textual content: str) -> dict:

# Sleep for two.5 seconds for safer use underneath the constraints of the 30 RPM free-tier restrict

time.sleep(2.5)

schema_instructions = json.dumps(TicketFeatures.model_json_schema())

response = shopper.chat.completions.create(

mannequin=“llama-3.3-70b-versatile”,

messages=[

{

“role”: “system”,

“content”: f“You are an extraction assistant. Output ONLY valid JSON matching this schema: {schema_instructions}”

{“role”: “user”, “content”: text}

response_format={“kind”: “json_object”},

temperature=0.0

)

return json.hundreds(response.selections[0].message.content material)

Why does the perform return JSON objects? First, JSON is a dependable strategy to ask an LLM to supply structured outputs. Second, JSON objects may be simply transformed into Pandas Sequence objects, which might then be seamlessly merged with different columns of an current DataFrame to turn out to be new ones. The next directions do the trick and append the brand new options, saved in engineered_features, to the remainder of the unique dataset:

print(“1. Extracting structured options from textual content utilizing LLM…”) engineered_features = df[“text”].apply(extract_features) features_df = pd.DataFrame(engineered_features.tolist()) X_raw = pd.concat([df.drop(columns=[“text”, “label”]), features_df], axis=1) y = df[“label”] print(“n2. Closing Engineered Tabular Dataset:”) print(X_raw)

print(“1. Extracting structured options from textual content utilizing LLM…”)

engineered_features = df[“text”].apply(extract_features)

features_df = pd.DataFrame(engineered_features.tolist())

X_raw = pd.concat([df.drop(columns=[“text”, “label”]), features_df], axis=1)

y = df[“label”]

print(“n2. Closing Engineered Tabular Dataset:”)

print(X_raw)

Here’s what the ensuing tabular knowledge seems like:

account_age_days prior_tickets urgency_score is_frustrated 0 564 0 5 1 1 1517 3 4 0 2 62 0 5 1 3 408 2 4 0 4 920 1 5 1 .. … … … … 95 91 2 4 1 96 884 0 4 1 97 1737 0 5 1 98 837 0 5 1 99 862 1 4 1 [100 rows x 4 columns]

account_age_days prior_tickets urgency_score is_pissed off

0 564 0 5 1

1 1517 3 4 0

2 62 0 5 1

3 408 2 4 0

4 920 1 5 1

.. ... ... ... ...

95 91 2 4 1

96 884 0 4 1

97 1737 0 5 1

98 837 0 5 1

99 862 1 4 1

[100 rows x 4 columns]

Sensible be aware on value and latency: Calling an LLM as soon as per row can turn out to be sluggish and costly on bigger datasets. In manufacturing, you’ll often wish to (1) batch requests (course of many tickets per name, in case your supplier and immediate design permit it), (2) cache outcomes keyed by a steady identifier (or a hash of the ticket textual content) so re-runs don’t re-bill the identical examples, and (3) implement retries with backoff to deal with transient charge limits and community errors. These three practices sometimes make the pipeline quicker, cheaper, and much more dependable.

Coaching and Evaluating the Mannequin

Lastly, right here comes the machine studying pipeline, the place the up to date, totally tabular dataset is scaled, cut up into coaching and take a look at subsets, and used to coach and consider a random forest classifier.

print(“n3. Scaling and Coaching Random Forest…”) scaler = StandardScaler() X_scaled = scaler.fit_transform(X_raw) # Break up the information into coaching and take a look at X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.4, random_state=42) # Practice a random forest classification mannequin clf = RandomForestClassifier(random_state=42) clf.match(X_train, y_train) # Predict and Consider y_pred = clf.predict(X_test) print(“n4. Classification Report:”) print(classification_report(y_test, y_pred, zero_division=0))

print(“n3. Scaling and Coaching Random Forest…”)

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X_raw)

# Break up the information into coaching and take a look at

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.4, random_state=42)

# Practice a random forest classification mannequin

clf = RandomForestClassifier(random_state=42)

clf.match(X_train, y_train)

# Predict and Consider

y_pred = clf.predict(X_test)

print(“n4. Classification Report:”)

print(classification_report(y_test, y_pred, zero_division=0))

Listed here are the classifier outcomes:

Classification Report: precision recall f1-score assist entry 0.22 0.18 0.20 11 billing 0.29 0.33 0.31 6 {hardware} 0.29 0.25 0.27 8 inquiry 1.00 1.00 1.00 8 software program 0.44 0.57 0.50 7 accuracy 0.45 40 macro avg 0.45 0.47 0.45 40 weighted avg 0.44 0.45 0.44 40

Classification Report:

precision recall f1–rating assist

entry 0.22 0.18 0.20 11

billing 0.29 0.33 0.31 6

{hardware} 0.29 0.25 0.27 8

inquiry 1.00 1.00 1.00 8

software program 0.44 0.57 0.50 7

accuracy 0.45 40

macro avg 0.45 0.47 0.45 40

weighted avg 0.44 0.45 0.44 40

When you used the code for producing an artificial toy dataset, it’s possible you’ll get a moderately disappointing classifier consequence by way of accuracy, precision, recall, and so forth. That is regular: for the sake of effectivity and ease, we used a small, partly random set of 100 cases — which is often too small (and arguably too random) to carry out properly. The important thing right here is the method of turning uncooked textual content into significant options via using a pre-trained LLM by way of API, which ought to work reliably.

Abstract

This text takes a mild tour via the method of turning uncooked textual content into totally tabular options for downstream machine studying modeling. The important thing trick proven alongside the best way is utilizing a pre-trained LLM to carry out inference and return structured outputs by way of efficient prompting.

🔥 Need the perfect instruments for AI advertising? Take a look at GetResponse AI-powered automation to spice up your enterprise!

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

From Textual content to Tables: Function Engineering with LLMs for Tabular Knowledge

Introduction

Setup and Imports

Making a Toy Ticket Dataset

Extracting LLM Options

Coaching and Evaluating the Mannequin

Abstract

LEAVE A REPLY

Subscribe

9 kinds of Google Adverts (professionals, cons, and when to make use of every)

Add security checks to your workflows

Setting Up a Google Colab AI-Assisted Coding Surroundings That Truly Works

Prolong SAP Cloud ALM Observability With Automation Observability

Constructing Good Machine Studying in Low-Useful resource Settings

More like this
Related

9 kinds of Google Adverts (professionals, cons, and when to make use of every)

Add security checks to your workflows

Setting Up a Google Colab AI-Assisted Coding Surroundings That Truly Works

Prolong SAP Cloud ALM Observability With Automation Observability

About us

The latest posts

9 kinds of Google Adverts (professionals, cons, and when to make use of every)

Add security checks to your workflows

Setting Up a Google Colab AI-Assisted Coding Surroundings That Truly Works

Newsletter Subscribe

From Textual content to Tables: Function Engineering with LLMs for Tabular Knowledge

Introduction

Setup and Imports

Making a Toy Ticket Dataset

Extracting LLM Options

Coaching and Evaluating the Mannequin

Abstract

LEAVE A REPLY

Subscribe

More like thisRelated

About us

The latest posts

Newsletter Subscribe

More like this
Related