Scikit-learn: Custom Model Pipelines with Feature Engineering

This code example demonstrates an advanced machine learning pipeline that processes text data through custom preprocessing, dimensionality reduction, and classification using ensemble methods. Here’s a detailed description:

Purpose

To create a text classification pipeline using scikit-learn, which includes:

  1. Custom text preprocessing for tokenization and lemmatization.
  2. Transforming text data into numerical features using TF-IDF.
  3. Reducing the dimensionality of the resulting feature space using TruncatedSVD.
  4. Applying a stacked ensemble model for classification.

Key Features

  1. Custom Text Preprocessing: A custom transformer leverages spaCy to tokenize, remove stopwords, and lemmatize the text data to create structured input features.
  2. Dimensionality Reduction: Uses TF-IDF for text vectorization followed by TruncatedSVD to reduce the high-dimensional feature space into a manageable size.
  3. Stacked Ensemble Model: A StackingClassifier combines the strengths of multiple classifiers (RandomForest and AdaBoost) to improve predictive performance.

Benefits

  • Reusable Preprocessing: The custom preprocessing logic is encapsulated in a scikit-learn compatible transformer, making it reusable in other pipelines.
  • Pipeline Flexibility: Each step (preprocessing, feature transformation, dimensionality reduction, and classification) can be modified independently, allowing easy customization.
  • Advanced Ensemble Techniques: The StackingClassifier demonstrates how to combine the predictions of multiple classifiers for potentially better predictive performance.

Use Cases

This pipeline is suitable for text classification tasks that require advanced feature engineering and model tuning, such as sentiment analysis, spam detection, or topic classification in documents. It serves as a template for building sophisticated pipelines that can be extended for production use.

To address the task of creating a custom model pipeline with feature engineering using scikit-learn, we can create a pipeline that processes text data through custom text preprocessing steps, dimensionality reduction, and a machine learning model.

Let’s break it down step by step:

  1. Text Preprocessing:
    • Tokenization and Named Entity Recognition (NER) to convert raw text into structured data.
    • We’ll use spaCy for tokenization and NER.
  2. Dimensionality Reduction:
    • Applying TF-IDF to vectorize the text data into numerical features.
    • Further reducing the dimensionality using TruncatedSVD (SVD for sparse data) to make it suitable for our model.
  3. Machine Learning Model:
    • Using a custom ensemble model such as StackingClassifier with several different classifiers.

Code Example

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import spacy

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Custom transformer for text preprocessing
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [" ".join([token.lemma_ for token in nlp(doc) if not token.is_stop]) for doc in X]

# Load sample data
# X, y = load_data()  # Replace with actual data loading logic
X = ["This is a simple test.", "Another document for testing.", "Text data with entities like London."]
y = [0, 1, 0]

# Define the pipeline
pipeline = Pipeline([
    ('preprocessor', TextPreprocessor()),  # Custom text preprocessing
    ('tfidf', TfidfVectorizer()),  # TF-IDF Vectorization
    ('svd', TruncatedSVD(n_components=100)),  # Dimensionality Reduction
    ('classifier', StackingClassifier(  # Ensemble Model
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=10)),
            ('ada', AdaBoostClassifier(n_estimators=10))
        ],
        final_estimator=RandomForestClassifier(n_estimators=10)
    ))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Code Explanation

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import spacy

We import the necessary modules from scikit-learn and spaCy. Here are their functions:

  • BaseEstimator, TransformerMixin: For building a custom transformer.
  • Pipeline: To create a processing pipeline.
  • TfidfVectorizer: To transform text into TF-IDF features.
  • TruncatedSVD: To perform dimensionality reduction on sparse matrices.
  • RandomForestClassifier, AdaBoostClassifier, StackingClassifier: To build an ensemble classifier.
  • train_test_split: To split the dataset into training and testing sets.
  • classification_report: To generate a performance report.
  • spacy: For text processing (tokenization and lemmatization).

Load spaCy Model

nlp = spacy.load('en_core_web_sm')
  • We load a pre-trained spaCy model for natural language processing tasks like tokenization, lemmatization, and named entity recognition.

Custom Transformer

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [" ".join([token.lemma_ for token in nlp(doc) if not token.is_stop]) for doc in X]
  • This class is a custom text transformer that inherits from scikit-learn’s BaseEstimator and TransformerMixin.
  • __init__: Initializes the transformer.
  • fit: Does nothing but allows the class to be compatible with scikit-learn’s pipeline.
  • transform: The core function of the transformer:
    • For each text document in X, it processes the text using spaCy (nlp object).
    • It tokenizes the document, filters out stopwords, and extracts the lemmatized form of each word.
    • The processed tokens are joined into a string and returned as a list.

Data Loading

# Load sample data
X = ["This is a simple test.", "Another document for testing.", "Text data with entities like London."]
y = [0, 1, 0]
  • We define a sample dataset (X for input text, y for labels). In practice, you would load a real dataset here.

Pipeline Definition

pipeline = Pipeline([
    ('preprocessor', TextPreprocessor()),  
    ('tfidf', TfidfVectorizer()),  
    ('svd', TruncatedSVD(n_components=100)),  
    ('classifier', StackingClassifier(  
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=10)),
            ('ada', AdaBoostClassifier(n_estimators=10))
        ],
        final_estimator=RandomForestClassifier(n_estimators=10)
    ))
])
  • Pipeline: Defines a sequential pipeline of processing steps:
    • preprocessor: Uses the custom text processor to prepare text data.
    • tfidf: Converts the processed text into numerical features using TF-IDF.
    • svd: Reduces the dimensionality of the sparse matrix produced by TF-IDF using TruncatedSVD.
    • classifier: Uses a StackingClassifier as the final estimator:
      • Base models: RandomForestClassifier and AdaBoostClassifier.
      • Final model: RandomForestClassifier.

Model Training and Evaluation

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
  • train_test_split: Splits the dataset into training and testing sets (80% training, 20% testing).
  • pipeline.fit: Trains the entire pipeline on the training data (X_train, y_train).
  • pipeline.predict: Makes predictions on the test data (X_test).
  • classification_report: Generates a performance report showing precision, recall, F1-score, and support for each class.

In summary, this code demonstrates a scikit-learn pipeline that integrates custom text preprocessing, TF-IDF vectorization, dimensionality reduction, and an ensemble classifier for text classification.