Prepare your Data

Python and Machine Learning (ML) – Part 1 Prepare your Data

Data preparation is a critical step in the machine learning workflow. It involves several processes aimed at making the raw data suitable for building and training machine learning models. Proper data preparation can significantly improve the performance of your models. Here are the key components:

1. Data Cleaning

Data often comes with errors or missing values. The cleaning process aims to address these issues by:

Removing Duplicates: Ensuring that the data does not contain duplicate records, which can skew results.
Handling Missing Data: Imputing missing values using statistical methods (like mean, median) or predicting them using other available data, or simply removing rows or columns with too many missing values.
Filtering Outliers: Identifying and dealing with outliers that can adversely affect the model’s performance.

2. Data Transformation

Transforming data into a format more suitable for modeling is crucial. This includes:

Normalization/Standardization: Scaling numeric data to a standard range or distribution, where normalization typically refers to scaling data to a [0,1] range and standardization involves shifting the distribution to have a mean of zero and a standard deviation of one.
Encoding Categorical Data: Converting categories or labels into numerical values. Common techniques include one-hot encoding, label encoding, or using binary encoding.
Feature Engineering: Creating new features from existing data to improve model predictions. This could involve aggregating data, combining features, or extracting date parts from datetime columns.

3. Data Reduction

Reducing the volume of data without losing critical information can enhance model performance, especially in terms of speed and memory usage:

Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) or t-SNE are used to reduce the number of variables under consideration, by capturing the essential characteristics with fewer variables.
Feature Selection: Selectively choosing the most relevant features for use in model construction. This involves removing irrelevant or partially relevant features that can negatively impact model performance.

4. Data Splitting

This is the process of dividing the data into subsets:

Training Set: The data on which the model will be trained. Most of your data should be in this set.
Validation Set: A separate portion from the training data used to tune the hyperparameters of a model and avoid overfitting.
Test Set: Used only at the end of the machine learning process to assess the performance of the final model on new, unseen data.

5. Handling Imbalanced Data

In classification problems, if the number of instances of one class significantly outweighs the other, the model might develop a bias towards the majority class. Techniques to handle this include:

Resampling Techniques: Under-sampling the majority class or over-sampling the minority class to balance the dataset.
Synthetic Data Generation: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) create synthetic samples rather than over-sampling with replacement.

Best Practices

Automate Data Preparation Steps: Use pipelines and frameworks that allow for reproducibility and consistency in how data is prepared across different models and datasets.
Iterative Approach: Data preparation is not a one-time task. It often requires iterative adjustments as you refine the model and discover new insights about the data.

Previous Lesson

Back to Tutorial

Next Lesson

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
tk_lr	1 year	The tk_lr is a referral cookie set by the JetPack plugin on sites using WooCommerce, which analyzes referrer behaviour for Jetpack.
tk_or	5 years	The tk_or is a referral cookie set by the JetPack plugin on sites using WooCommerce, which analyzes referrer behaviour for Jetpack.
tk_r3d	3 days	JetPack installs this cookie to collect internal metrics for user activity and in turn improve user experience.
tk_tc	session	JetPack sets this cookie to record details on how user's use the website.