Scikit-learn: Custom Model Pipelines with Feature Engineering

Private: Python and Machine Learning (ML) – Part 2 Scikit-learn: Custom Model Pipelines with Feature Engineering

This code example demonstrates an advanced machine learning pipeline that processes text data through custom preprocessing, dimensionality reduction, and classification using ensemble methods. Here’s a detailed description:

Purpose

To create a text classification pipeline using scikit-learn, which includes:

Custom text preprocessing for tokenization and lemmatization.
Transforming text data into numerical features using TF-IDF.
Reducing the dimensionality of the resulting feature space using TruncatedSVD.
Applying a stacked ensemble model for classification.

Key Features

Custom Text Preprocessing: A custom transformer leverages spaCy to tokenize, remove stopwords, and lemmatize the text data to create structured input features.
Dimensionality Reduction: Uses TF-IDF for text vectorization followed by TruncatedSVD to reduce the high-dimensional feature space into a manageable size.
Stacked Ensemble Model: A StackingClassifier combines the strengths of multiple classifiers (RandomForest and AdaBoost) to improve predictive performance.

Benefits

Reusable Preprocessing: The custom preprocessing logic is encapsulated in a scikit-learn compatible transformer, making it reusable in other pipelines.
Pipeline Flexibility: Each step (preprocessing, feature transformation, dimensionality reduction, and classification) can be modified independently, allowing easy customization.
Advanced Ensemble Techniques: The StackingClassifier demonstrates how to combine the predictions of multiple classifiers for potentially better predictive performance.

Use Cases

This pipeline is suitable for text classification tasks that require advanced feature engineering and model tuning, such as sentiment analysis, spam detection, or topic classification in documents. It serves as a template for building sophisticated pipelines that can be extended for production use.

To address the task of creating a custom model pipeline with feature engineering using scikit-learn, we can create a pipeline that processes text data through custom text preprocessing steps, dimensionality reduction, and a machine learning model.

Let’s break it down step by step:

Text Preprocessing:
- Tokenization and Named Entity Recognition (NER) to convert raw text into structured data.
- We’ll use spaCy for tokenization and NER.
Dimensionality Reduction:
- Applying TF-IDF to vectorize the text data into numerical features.
- Further reducing the dimensionality using TruncatedSVD (SVD for sparse data) to make it suitable for our model.
Machine Learning Model:
- Using a custom ensemble model such as StackingClassifier with several different classifiers.

Code Example

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import spacy

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Custom transformer for text preprocessing
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [" ".join([token.lemma_ for token in nlp(doc) if not token.is_stop]) for doc in X]

# Load sample data
# X, y = load_data()  # Replace with actual data loading logic
X = ["This is a simple test.", "Another document for testing.", "Text data with entities like London."]
y = [0, 1, 0]

# Define the pipeline
pipeline = Pipeline([
    ('preprocessor', TextPreprocessor()),  # Custom text preprocessing
    ('tfidf', TfidfVectorizer()),  # TF-IDF Vectorization
    ('svd', TruncatedSVD(n_components=100)),  # Dimensionality Reduction
    ('classifier', StackingClassifier(  # Ensemble Model
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=10)),
            ('ada', AdaBoostClassifier(n_estimators=10))
        ],
        final_estimator=RandomForestClassifier(n_estimators=10)
    ))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Code Explanation

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import spacy

We import the necessary modules from scikit-learn and spaCy. Here are their functions:

BaseEstimator, TransformerMixin: For building a custom transformer.
Pipeline: To create a processing pipeline.
TfidfVectorizer: To transform text into TF-IDF features.
TruncatedSVD: To perform dimensionality reduction on sparse matrices.
RandomForestClassifier, AdaBoostClassifier, StackingClassifier: To build an ensemble classifier.
train_test_split: To split the dataset into training and testing sets.
classification_report: To generate a performance report.
spacy: For text processing (tokenization and lemmatization).

Load spaCy Model

nlp = spacy.load('en_core_web_sm')

We load a pre-trained spaCy model for natural language processing tasks like tokenization, lemmatization, and named entity recognition.

Custom Transformer

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [" ".join([token.lemma_ for token in nlp(doc) if not token.is_stop]) for doc in X]

This class is a custom text transformer that inherits from scikit-learn’s BaseEstimator and TransformerMixin.
__init__: Initializes the transformer.
fit: Does nothing but allows the class to be compatible with scikit-learn’s pipeline.
transform: The core function of the transformer:
- For each text document in X, it processes the text using spaCy (nlp object).
- It tokenizes the document, filters out stopwords, and extracts the lemmatized form of each word.
- The processed tokens are joined into a string and returned as a list.

Data Loading

# Load sample data
X = ["This is a simple test.", "Another document for testing.", "Text data with entities like London."]
y = [0, 1, 0]

We define a sample dataset (X for input text, y for labels). In practice, you would load a real dataset here.

Pipeline Definition

pipeline = Pipeline([
    ('preprocessor', TextPreprocessor()),  
    ('tfidf', TfidfVectorizer()),  
    ('svd', TruncatedSVD(n_components=100)),  
    ('classifier', StackingClassifier(  
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=10)),
            ('ada', AdaBoostClassifier(n_estimators=10))
        ],
        final_estimator=RandomForestClassifier(n_estimators=10)
    ))
])

Pipeline: Defines a sequential pipeline of processing steps:
- preprocessor: Uses the custom text processor to prepare text data.
- tfidf: Converts the processed text into numerical features using TF-IDF.
- svd: Reduces the dimensionality of the sparse matrix produced by TF-IDF using TruncatedSVD.
- classifier: Uses a StackingClassifier as the final estimator:
  - Base models: RandomForestClassifier and AdaBoostClassifier.
  - Final model: RandomForestClassifier.

Model Training and Evaluation

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

train_test_split: Splits the dataset into training and testing sets (80% training, 20% testing).
pipeline.fit: Trains the entire pipeline on the training data (X_train, y_train).
pipeline.predict: Makes predictions on the test data (X_test).
classification_report: Generates a performance report showing precision, recall, F1-score, and support for each class.

In summary, this code demonstrates a scikit-learn pipeline that integrates custom text preprocessing, TF-IDF vectorization, dimensionality reduction, and an ensemble classifier for text classification.

Back to Tutorial

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
tk_lr	1 year	The tk_lr is a referral cookie set by the JetPack plugin on sites using WooCommerce, which analyzes referrer behaviour for Jetpack.
tk_or	5 years	The tk_or is a referral cookie set by the JetPack plugin on sites using WooCommerce, which analyzes referrer behaviour for Jetpack.
tk_r3d	3 days	JetPack installs this cookie to collect internal metrics for user activity and in turn improve user experience.
tk_tc	session	JetPack sets this cookie to record details on how user's use the website.