This code example demonstrates an advanced machine learning pipeline that processes text data through custom preprocessing, dimensionality reduction, and classification using ensemble methods. Here’s a detailed description:
To create a text classification pipeline using scikit-learn, which includes:
This pipeline is suitable for text classification tasks that require advanced feature engineering and model tuning, such as sentiment analysis, spam detection, or topic classification in documents. It serves as a template for building sophisticated pipelines that can be extended for production use.
To address the task of creating a custom model pipeline with feature engineering using scikit-learn, we can create a pipeline that processes text data through custom text preprocessing steps, dimensionality reduction, and a machine learning model.
Let’s break it down step by step:
spaCy
for tokenization and NER.from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import spacy
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')
# Custom transformer for text preprocessing
class TextPreprocessor(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
return [" ".join([token.lemma_ for token in nlp(doc) if not token.is_stop]) for doc in X]
# Load sample data
# X, y = load_data() # Replace with actual data loading logic
X = ["This is a simple test.", "Another document for testing.", "Text data with entities like London."]
y = [0, 1, 0]
# Define the pipeline
pipeline = Pipeline([
('preprocessor', TextPreprocessor()), # Custom text preprocessing
('tfidf', TfidfVectorizer()), # TF-IDF Vectorization
('svd', TruncatedSVD(n_components=100)), # Dimensionality Reduction
('classifier', StackingClassifier( # Ensemble Model
estimators=[
('rf', RandomForestClassifier(n_estimators=10)),
('ada', AdaBoostClassifier(n_estimators=10))
],
final_estimator=RandomForestClassifier(n_estimators=10)
))
])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
pipeline.fit(X_train, y_train)
# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import spacy
We import the necessary modules from scikit-learn and spaCy. Here are their functions:
BaseEstimator
, TransformerMixin
: For building a custom transformer.Pipeline
: To create a processing pipeline.TfidfVectorizer
: To transform text into TF-IDF features.TruncatedSVD
: To perform dimensionality reduction on sparse matrices.RandomForestClassifier
, AdaBoostClassifier
, StackingClassifier
: To build an ensemble classifier.train_test_split
: To split the dataset into training and testing sets.classification_report
: To generate a performance report.spacy
: For text processing (tokenization and lemmatization).nlp = spacy.load('en_core_web_sm')
class TextPreprocessor(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
return [" ".join([token.lemma_ for token in nlp(doc) if not token.is_stop]) for doc in X]
BaseEstimator
and TransformerMixin
.__init__
: Initializes the transformer.fit
: Does nothing but allows the class to be compatible with scikit-learn’s pipeline.transform
: The core function of the transformer:
X
, it processes the text using spaCy (nlp
object).# Load sample data
X = ["This is a simple test.", "Another document for testing.", "Text data with entities like London."]
y = [0, 1, 0]
X
for input text, y
for labels). In practice, you would load a real dataset here.pipeline = Pipeline([
('preprocessor', TextPreprocessor()),
('tfidf', TfidfVectorizer()),
('svd', TruncatedSVD(n_components=100)),
('classifier', StackingClassifier(
estimators=[
('rf', RandomForestClassifier(n_estimators=10)),
('ada', AdaBoostClassifier(n_estimators=10))
],
final_estimator=RandomForestClassifier(n_estimators=10)
))
])
Pipeline
: Defines a sequential pipeline of processing steps:
preprocessor
: Uses the custom text processor to prepare text data.tfidf
: Converts the processed text into numerical features using TF-IDF.svd
: Reduces the dimensionality of the sparse matrix produced by TF-IDF using TruncatedSVD.classifier
: Uses a StackingClassifier as the final estimator:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
train_test_split
: Splits the dataset into training and testing sets (80% training, 20% testing).pipeline.fit
: Trains the entire pipeline on the training data (X_train
, y_train
).pipeline.predict
: Makes predictions on the test data (X_test
).classification_report
: Generates a performance report showing precision, recall, F1-score, and support for each class.In summary, this code demonstrates a scikit-learn pipeline that integrates custom text preprocessing, TF-IDF vectorization, dimensionality reduction, and an ensemble classifier for text classification.