How you can Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline

🚀 Able to supercharge your AI workflow? Attempt ElevenLabs for AI voice and speech technology!

On this article, you’ll learn to fuse dense LLM sentence embeddings, sparse TF-IDF options, and structured metadata right into a single scikit-learn pipeline for textual content classification.

Subjects we’ll cowl embody:

Loading and getting ready a textual content dataset alongside artificial metadata options.
Constructing parallel characteristic pipelines for TF-IDF, LLM embeddings, and numeric metadata.
Fusing all characteristic branches with ColumnTransformer and coaching an end-to-end classifier.

Let’s break it down.

Combine LLM Embeddings TF-IDF Metadata Scikit-learn Pipeline

How you can Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline (click on to enlarge)
Picture by Editor

Introduction

Knowledge fusion, or combining various items of knowledge right into a single pipeline, sounds formidable sufficient. If we discuss not nearly two, however about three complementary characteristic sources, then the problem — and the potential payoff — goes to the subsequent degree. Probably the most thrilling half is that scikit-learn permits us to unify all of them cleanly inside a single, end-to-end workflow. Do you need to see how? This text walks you step-by-step by constructing a whole fusion pipeline from scratch for a downstream textual content classification process, combining dense semantic data from LLM-generated embeddings, sparse lexical options from TF-IDF, and structured metadata alerts. ? Maintain studying.

Step-by-Step Pipeline Constructing Course of

First, we’ll make all the mandatory imports for the pipeline-building course of. If you’re working in a neighborhood atmosphere, you may must pip set up a few of them first:

import numpy as np import pandas as pd from sklearn.datasets import fetch_20newsgroups from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.feature_extraction.textual content import TfidfVectorizer from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report from sklearn.base import BaseEstimator, TransformerMixin from sklearn.decomposition import TruncatedSVD from sentence_transformers import SentenceTransformer

import numpy as np

import pandas as pd

from sklearn.datasets import fetch_20newsgroups

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.feature_extraction.textual content import TfidfVectorizer

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.decomposition import TruncatedSVD

from sentence_transformers import SentenceTransformer

Let’s look intently at this — virtually infinite! — listing of imports. I wager one factor has caught your consideration: fetch_20newsgroups. This can be a freely accessible textual content dataset in scikit-learn that we are going to use all through this text: it accommodates textual content extracted from information articles belonging to all kinds of classes.

To maintain our dataset manageable in apply, we’ll choose the information articles belonging to a subset of classes specified by us. The next code does the trick:

classes = [ “rec.sport.baseball”, “sci.space”, “comp.graphics”, “talk.politics.misc” ] dataset = fetch_20newsgroups( subset=”all”, classes=classes, take away=(“headers”, “footers”, “quotes”) ) X_raw = dataset.knowledge y = dataset.goal print(f”Variety of samples: {len(X_raw)}”)

classes = [

“rec.sport.baseball”,

“sci.space”,

“comp.graphics”,

“talk.politics.misc”

]

dataset = fetch_20newsgroups(

subset=“all”,

classes=classes,

take away=(“headers”, “footers”, “quotes”)

)

X_raw = dataset.knowledge

y = dataset.goal

print(f“Variety of samples: {len(X_raw)}”)

We referred to as this freshly created dataset X_raw to emphasise that this can be a uncooked, far-from-final model of the dataset we’ll progressively assemble for downstream duties like utilizing machine studying fashions for predictive functions. It’s honest to say that the “uncooked” suffix can also be used as a result of right here we now have the uncooked textual content, from which three completely different knowledge elements (or streams) might be generated and later merged.

For the structured metadata related to the information articles obtained, in real-world contexts, this metadata may already be accessible or offered by the dataset proprietor. That’s not the case with this publicly accessible dataset, so we’ll synthetically create some easy metadata options based mostly on the textual content, together with options describing character size, phrase rely, common phrase size, uppercase ratio, and digit ratio.

def generate_metadata(texts): lengths = [len(t) for t in texts] word_counts = [len(t.split()) for t in texts] avg_word_lengths = [] uppercase_ratios = [] digit_ratios = [] for t in texts: phrases = t.break up() if phrases: avg_word_lengths.append(np.imply([len(w) for w in words])) else: avg_word_lengths.append(0) denom = max(len(t), 1) uppercase_ratios.append( sum(1 for c in t if c.isupper()) / denom ) digit_ratios.append( sum(1 for c in t if c.isdigit()) / denom ) return pd.DataFrame({ “textual content”: texts, “char_length”: lengths, “word_count”: word_counts, “avg_word_length”: avg_word_lengths, “uppercase_ratio”: uppercase_ratios, “digit_ratio”: digit_ratios }) # Calling the operate to generate a structured dataset that accommodates: uncooked textual content + metadata df = generate_metadata(X_raw) df[“target”] = y df.head()

def generate_metadata(texts):

lengths = [len(t) for t in texts]

word_counts = [len(t.split()) for t in texts]

avg_word_lengths = []

uppercase_ratios = []

digit_ratios = []

for t in texts:

phrases = t.break up()

if phrases:

avg_word_lengths.append(np.imply([len(w) for w in words]))

else:

avg_word_lengths.append(0)

denom = max(len(t), 1)

uppercase_ratios.append(

sum(1 for c in t if c.isupper()) / denom

)

digit_ratios.append(

sum(1 for c in t if c.isdigit()) / denom

)

return pd.DataFrame({

“textual content”: texts,

“char_length”: lengths,

“word_count”: word_counts,

“avg_word_length”: avg_word_lengths,

“uppercase_ratio”: uppercase_ratios,

“digit_ratio”: digit_ratios

})

# Calling the operate to generate a structured dataset that accommodates: uncooked textual content + metadata

df = generate_metadata(X_raw)

df[“target”] = y

df.head()

Earlier than getting absolutely into the pipeline-building course of, we’ll break up the info into practice and take a look at subsets:

X = df.drop(columns=[“target”]) y = df[“target”] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )

X = df.drop(columns=[“target”])

y = df[“target”]

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42, stratify=y

)

Essential: splitting the info into coaching and take a look at units have to be executed earlier than extracting the LLM embeddings and TF-IDF options. Why? As a result of these two extraction processes turn out to be a part of the pipeline, they usually contain becoming transformations with scikit-learn, that are studying processes — for instance, studying the TF-IDF vocabulary and inverse doc frequency (IDF) statistics. The scikit-learn logic to implement that is as follows: any knowledge transformations have to be fitted (study the transformation logic) solely on the coaching knowledge after which utilized to the take a look at knowledge utilizing the discovered logic. This manner, no data from the take a look at set will affect or bias characteristic development or downstream mannequin coaching.

Now comes a key stage: defining a category that encapsulates a pre-trained sentence transformer (a language mannequin like all-MiniLM-L6-v2 able to producing textual content embeddings from uncooked textual content) to provide our customized LLM embeddings.

class EmbeddingTransformer(BaseEstimator, TransformerMixin): def __init__(self, model_name=”all-MiniLM-L6-v2″): self.model_name = model_name self.mannequin = None def match(self, X, y=None): self.mannequin = SentenceTransformer(self.model_name) return self def remodel(self, X): embeddings = self.mannequin.encode( X.tolist(), show_progress_bar=False ) return np.array(embeddings)

class EmbeddingTransformer(BaseEstimator, TransformerMixin):

def __init__(self, model_name=“all-MiniLM-L6-v2”):

self.model_name = model_name

self.mannequin = None

def match(self, X, y=None):

self.mannequin = SentenceTransformer(self.model_name)

return self

def remodel(self, X):

embeddings = self.mannequin.encode(

X.tolist(),

show_progress_bar=False

)

return np.array(embeddings)

Now we’re constructing the three important knowledge branches (or parallel pipelines) we’re excited by, one after the other. First, the pipeline for TF-IDF characteristic extraction, by which we’ll use scikit-learn’s TfidfVectorizer class to extract these options seamlessly:

tfidf_pipeline = Pipeline([ (“tfidf”, TfidfVectorizer(max_features=5000)), (“svd”, TruncatedSVD(n_components=300, random_state=42)) ])

tfidf_pipeline = Pipeline([

(“tfidf”, TfidfVectorizer(max_features=5000)),

(“svd”, TruncatedSVD(n_components=300, random_state=42))

])

Subsequent comes the LLM embeddings pipeline, aided by the customized class we outlined earlier:

embedding_pipeline = Pipeline([ (“embed”, EmbeddingTransformer()) ])

embedding_pipeline = Pipeline([

(“embed”, EmbeddingTransformer())

])

Final, we outline the department pipeline for the metadata options, by which we intention to standardize these attributes on account of their disparate ranges:

metadata_features = [ “char_length”, “word_count”, “avg_word_length”, “uppercase_ratio”, “digit_ratio” ] metadata_pipeline = Pipeline([ (“scaler”, StandardScaler()) ])

metadata_features = [

“char_length”,

“word_count”,

“avg_word_length”,

“uppercase_ratio”,

“digit_ratio”

]

metadata_pipeline = Pipeline([

(“scaler”, StandardScaler())

])

Now we now have three parallel pipelines, however nothing to attach them — not less than not but. Right here comes the primary, overarching pipeline that may orchestrate the fusion course of amongst all three knowledge branches, through the use of a really helpful and versatile scikit-learn artifact for the fusion of heterogeneous knowledge flows: a ColumnTransformer pipeline.

preprocessor = ColumnTransformer( transformers=[ (“tfidf”, tfidf_pipeline, “text”), (“embedding”, embedding_pipeline, “text”), (“metadata”, metadata_pipeline, metadata_features), ], the rest=”drop” )

preprocessor = ColumnTransformer(

transformers=[

(“tfidf”, tfidf_pipeline, “text”),

(“embedding”, embedding_pipeline, “text”),

(“metadata”, metadata_pipeline, metadata_features),

the rest=“drop”

)

And the icing on the cake: a full, end-to-end pipeline that may mix the fusion pipeline with an instance of a machine learning-driven downstream process. Specifically, right here’s mix your entire knowledge fusion pipeline we now have simply architected with the coaching of a logistic regression classifier to foretell the information class:

full_pipeline = Pipeline([ (“features”, preprocessor), (“clf”, LogisticRegression(max_iter=2000)) ])

full_pipeline = Pipeline([

(“features”, preprocessor),

(“clf”, LogisticRegression(max_iter=2000))

])

The next instruction will do all of the heavy lifting we now have been designing to this point. The LLM embeddings half will significantly take a couple of minutes (particularly if the mannequin must be downloaded), so be affected person. This step will undertake the entire threefold course of of knowledge preprocessing, fusion, and mannequin coaching:

full_pipeline.match(X_train, y_train)

full_pipeline.match(X_train, y_train)

To finalize, we will make predictions on the take a look at set and see how our fusion-driven classifier performs.

y_pred = full_pipeline.predict(X_test) print(classification_report(y_test, y_pred, target_names=dataset.target_names))

y_pred = full_pipeline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=dataset.target_names))

And for a visible wrap-up, right here’s what your entire pipeline appears to be like like:

Text data fusion pipeline with scikit-learn

Wrapping Up

This text guided you thru the method of constructing a complete machine learning-oriented workflow that focuses on the fusion of a number of data sources derived from uncooked textual content knowledge, in order that the whole lot may be put collectively in downstream predictive duties like textual content classification. We’ve got seen how scikit-learn supplies a set of helpful lessons and strategies to make the method simpler and extra intuitive.

🔥 Need the very best instruments for AI advertising and marketing? Try GetResponse AI-powered automation to spice up what you are promoting!

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

How you can Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline

Introduction

Step-by-Step Pipeline Constructing Course of

Wrapping Up

LEAVE A REPLY

Subscribe

Constructing a Multi-Software Gemma 4 Agent with Error Restoration

6 productiveness hacks everybody ought to strive in 2026

The Statistics of Token Choice: Logits, Temperature, and Prime-P Walkthrough

Extending the Agentic Office to Each Assembly Platform

Constructing a Context Pruning Pipeline for Lengthy-Operating Brokers

More like this
Related

Constructing a Multi-Software Gemma 4 Agent with Error Restoration

6 productiveness hacks everybody ought to strive in 2026

The Statistics of Token Choice: Logits, Temperature, and Prime-P Walkthrough

Extending the Agentic Office to Each Assembly Platform

About us

The latest posts

Constructing a Multi-Software Gemma 4 Agent with Error Restoration

6 productiveness hacks everybody ought to strive in 2026

The Statistics of Token Choice: Logits, Temperature, and Prime-P Walkthrough

Newsletter Subscribe

How you can Mix LLM Embeddings + TF-IDF + Metadata in One Scikit-learn Pipeline

Introduction

Step-by-Step Pipeline Constructing Course of

Wrapping Up

LEAVE A REPLY

Subscribe

More like thisRelated

About us

The latest posts

Newsletter Subscribe

More like this
Related