Advanced Sales Data Analysis in Python

In this example, we’ll consider a retail company that wants to perform a comprehensive sales analysis using Python. The objectives include:

  1. Performing sales forecasting using a machine learning model.
  2. Analyzing customer segmentation using clustering.
  3. Creating an interactive dashboard for visualizing sales data and predictions.

We’ll break this down into several detailed steps:

  1. Loading and Preprocessing Data: Clean and prepare sales data.
  2. Sales Forecasting: Use a machine learning model to forecast future sales.
  3. Customer Segmentation: Apply clustering techniques to segment customers.
  4. Interactive Dashboard: Create a dashboard using Plotly Dash for visualization.

Step-by-Step Solution

1. Loading and Preprocessing Data

We’ll start by loading sales data and performing necessary preprocessing steps.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the sales data
file_path = '/mnt/data/sales_data.csv'
sales_data = pd.read_csv(file_path)

# Display basic information about the data
print(sales_data.info())

# Convert Date column to datetime
sales_data['Date'] = pd.to_datetime(sales_data['Date'])

# Handle missing values, if any
sales_data = sales_data.dropna()

# Feature engineering: Extract year, month, and day from the Date
sales_data['Year'] = sales_data['Date'].dt.year
sales_data['Month'] = sales_data['Date'].dt.month
sales_data['Day'] = sales_data['Date'].dt.day

# Scale numerical features for clustering
scaler = StandardScaler()
sales_data[['Quantity', 'Price']] = scaler.fit_transform(sales_data[['Quantity', 'Price']])

# Display the first few rows of the preprocessed data
print(sales_data.head())

Explanation:

  • Loading Data: Import the sales data using pandas.read_csv.
  • Basic Info: Display data info to understand its structure and identify any missing values.
  • Date Conversion: Convert the Date column to datetime format for easier manipulation.
  • Handling Missing Values: Drop any rows with missing values to ensure clean data.
  • Feature Engineering: Extract additional time-based features (Year, Month, Day) from the Date column.
  • Scaling: Normalize the numerical features (Quantity and Price) using StandardScaler to prepare for clustering.

2. Sales Forecasting

We’ll use the Prophet library to forecast future sales.

from fbprophet import Prophet

# Prepare data for Prophet
sales_forecast_data = sales_data.groupby('Date')['Quantity'].sum().reset_index()
sales_forecast_data.columns = ['ds', 'y']

# Initialize and fit the Prophet model
model = Prophet()
model.fit(sales_forecast_data)

# Create a DataFrame with future dates
future_dates = model.make_future_dataframe(periods=365)

# Make predictions
forecast = model.predict(future_dates)

# Display forecasted data
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())

Explanation:

  • Data Preparation: Group the sales data by Date and sum the Quantity to prepare it for Prophet.
  • Prophet Model: Initialize the Prophet model and fit it to the prepared data.
  • Future Dates: Generate a DataFrame with future dates (e.g., 365 days ahead) for prediction.
  • Forecasting: Use the model to predict future sales and display the results.

3. Customer Segmentation

We’ll use K-Means clustering to segment customers based on their purchasing behavior.

from sklearn.cluster import KMeans

# Aggregate data by Customer ID for clustering
customer_data = sales_data.groupby('CustomerID').agg({
    'Quantity': 'sum',
    'Price': 'mean'
}).reset_index()

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=0)
customer_data['Cluster'] = kmeans.fit_predict(customer_data[['Quantity', 'Price']])

# Display the clustered data
print(customer_data.head())

Explanation:

  • Data Aggregation: Aggregate the sales data by CustomerID to compute total Quantity purchased and average Price.
  • Clustering: Apply K-Means clustering to segment customers into four clusters based on their purchasing behavior.

4. Interactive Dashboard

Create an interactive dashboard using Plotly Dash to visualize the data and predictions.

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.graph_objs as go

# Initialize the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    html.H1("Retail Sales Dashboard"),
    dcc.Graph(id='sales-forecast'),
    dcc.Graph(id='customer-segmentation'),
])

# Define callback to update sales forecast graph
@app.callback(
    Output('sales-forecast', 'figure'),
    [Input('sales-forecast', 'id')]
)
def update_sales_forecast(_):
    figure = go.Figure()
    figure.add_trace(go.Scatter(x=forecast['ds'], y=forecast['yhat'], mode='lines', name='Forecast'))
    figure.add_trace(go.Scatter(x=forecast['ds'], y=forecast['yhat_lower'], mode='lines', name='Lower Bound', fill='tonexty'))
    figure.add_trace(go.Scatter(x=forecast['ds'], y=forecast['yhat_upper'], mode='lines', name='Upper Bound', fill='tonexty'))
    figure.update_layout(title='Sales Forecast', xaxis_title='Date', yaxis_title='Sales')
    return figure

# Define callback to update customer segmentation graph
@app.callback(
    Output('customer-segmentation', 'figure'),
    [Input('customer-segmentation', 'id')]
)
def update_customer_segmentation(_):
    figure = go.Figure()
    for cluster in customer_data['Cluster'].unique():
        cluster_data = customer_data[customer_data['Cluster'] == cluster]
        figure.add_trace(go.Scatter(x=cluster_data['Quantity'], y=cluster_data['Price'], mode='markers', name=f'Cluster {cluster}'))
    figure.update_layout(title='Customer Segmentation', xaxis_title='Quantity', yaxis_title='Price')
    return figure

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

Explanation:

  • Dash Initialization: Initialize a Dash app and define the layout with two graphs: one for sales forecasting and one for customer segmentation.
  • Sales Forecast Graph: Define a callback to update the sales forecast graph using the forecast data from Prophet.
  • Customer Segmentation Graph: Define a callback to update the customer segmentation graph using the clustering results.
  • Plotly Visualization: Use Plotly to create interactive visualizations, enhancing user interaction and data exploration.

Summary

This advanced scenario covers:

  • Data loading and preprocessing: Ensuring the data is clean and ready for analysis.
  • Sales forecasting using a machine learning model (Prophet): Predicting future sales trends.
  • Customer segmentation using clustering (K-Means): Identifying distinct customer groups based on purchasing behavior.
  • Interactive dashboard creation: Providing a user-friendly interface for visualizing results and making informed business decisions.

Testing the Scenario

To test the entire scenario, you need to ensure you have the necessary libraries installed, the appropriate data available, and follow a structured testing approach. Here’s a step-by-step guide to help you:

Prerequisites

Install Required Libraries: Make sure you have all the required Python libraries installed. You can install them using pip:

Prepare the Data: Ensure you have a sales data CSV file named sales_data.csv with the following columns:

  • Date: Date of the sale.
  • CustomerID: Identifier for the customer.
  • Quantity: Number of items sold.
  • Price: Price of the item.

Step-by-Step Testing

1. Loading and Preprocessing Data

Create a Python script (e.g., preprocess.py) to test data loading and preprocessing:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the sales data
file_path = 'sales_data.csv'
sales_data = pd.read_csv(file_path)

# Display basic information about the data
print(sales_data.info())

# Convert Date column to datetime
sales_data['Date'] = pd.to_datetime(sales_data['Date'])

# Handle missing values, if any
sales_data = sales_data.dropna()

# Feature engineering: Extract year, month, and day from the Date
sales_data['Year'] = sales_data['Date'].dt.year
sales_data['Month'] = sales_data['Date'].dt.month
sales_data['Day'] = sales_data['Date'].dt.day

# Scale numerical features for clustering
scaler = StandardScaler()
sales_data[['Quantity', 'Price']] = scaler.fit_transform(sales_data[['Quantity', 'Price']])

# Display the first few rows of the preprocessed data
print(sales_data.head())

Run this script to ensure the preprocessing works correctly.

2. Sales Forecasting

Create a Python script (e.g., forecast.py) to test sales forecasting:

from fbprophet import Prophet
import pandas as pd

# Load preprocessed sales data
file_path = 'sales_data.csv'
sales_data = pd.read_csv(file_path)
sales_data['Date'] = pd.to_datetime(sales_data['Date'])

# Prepare data for Prophet
sales_forecast_data = sales_data.groupby('Date')['Quantity'].sum().reset_index()
sales_forecast_data.columns = ['ds', 'y']

# Initialize and fit the Prophet model
model = Prophet()
model.fit(sales_forecast_data)

# Create a DataFrame with future dates
future_dates = model.make_future_dataframe(periods=365)

# Make predictions
forecast = model.predict(future_dates)

# Display forecasted data
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())

Run this script to ensure the sales forecasting works correctly.

3. Customer Segmentation

Create a Python script (e.g., segmentation.py) to test customer segmentation:

from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load preprocessed sales data
file_path = 'sales_data.csv'
sales_data = pd.read_csv(file_path)

# Aggregate data by Customer ID for clustering
customer_data = sales_data.groupby('CustomerID').agg({
    'Quantity': 'sum',
    'Price': 'mean'
}).reset_index()

# Scale the data
scaler = StandardScaler()
customer_data[['Quantity', 'Price']] = scaler.fit_transform(customer_data[['Quantity', 'Price']])

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=0)
customer_data['Cluster'] = kmeans.fit_predict(customer_data[['Quantity', 'Price']])

# Display the clustered data
print(customer_data.head())

Run this script to ensure customer segmentation works correctly.

4. Interactive Dashboard

Create a Python script (e.g., dashboard.py) to test the interactive dashboard:

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.graph_objs as go
import pandas as pd
from fbprophet import Prophet
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load and preprocess data
file_path = 'sales_data.csv'
sales_data = pd.read_csv(file_path)
sales_data['Date'] = pd.to_datetime(sales_data['Date'])
sales_data = sales_data.dropna()
sales_data['Year'] = sales_data['Date'].dt.year
sales_data['Month'] = sales_data['Date'].dt.month
sales_data['Day'] = sales_data['Date'].dt.day

# Scale numerical features for clustering
scaler = StandardScaler()
sales_data[['Quantity', 'Price']] = scaler.fit_transform(sales_data[['Quantity', 'Price']])

# Forecasting
sales_forecast_data = sales_data.groupby('Date')['Quantity'].sum().reset_index()
sales_forecast_data.columns = ['ds', 'y']
model = Prophet()
model.fit(sales_forecast_data)
future_dates = model.make_future_dataframe(periods=365)
forecast = model.predict(future_dates)

# Clustering
customer_data = sales_data.groupby('CustomerID').agg({
    'Quantity': 'sum',
    'Price': 'mean'
}).reset_index()
customer_data[['Quantity', 'Price']] = scaler.fit_transform(customer_data[['Quantity', 'Price']])
kmeans = KMeans(n_clusters=4, random_state=0)
customer_data['Cluster'] = kmeans.fit_predict(customer_data[['Quantity', 'Price']])

# Initialize the Dash app
app = dash.Dash(__name__)

# Define the layout of the app
app.layout = html.Div([
    html.H1("Retail Sales Dashboard"),
    dcc.Graph(id='sales-forecast'),
    dcc.Graph(id='customer-segmentation'),
])

# Define callback to update sales forecast graph
@app.callback(
    Output('sales-forecast', 'figure'),
    [Input('sales-forecast', 'id')]
)
def update_sales_forecast(_):
    figure = go.Figure()
    figure.add_trace(go.Scatter(x=forecast['ds'], y=forecast['yhat'], mode='lines', name='Forecast'))
    figure.add_trace(go.Scatter(x=forecast['ds'], y=forecast['yhat_lower'], mode='lines', name='Lower Bound', fill='tonexty'))
    figure.add_trace(go.Scatter(x=forecast['ds'], y=forecast['yhat_upper'], mode='lines', name='Upper Bound', fill='tonexty'))
    figure.update_layout(title='Sales Forecast', xaxis_title='Date', yaxis_title='Sales')
    return figure

# Define callback to update customer segmentation graph
@app.callback(
    Output('customer-segmentation', 'figure'),
    [Input('customer-segmentation', 'id')]
)
def update_customer_segmentation(_):
    figure = go.Figure()
    for cluster in customer_data['Cluster'].unique():
        cluster_data = customer_data[customer_data['Cluster'] == cluster]
        figure.add_trace(go.Scatter(x=cluster_data['Quantity'], y=cluster_data['Price'], mode='markers', name=f'Cluster {cluster}'))
    figure.update_layout(title='Customer Segmentation', xaxis_title='Quantity', yaxis_title='Price')
    return figure

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)

Run this script to start the dashboard. Open your browser and navigate to the displayed URL to interact with the dashboard.

Testing Each Component

  • Data Loading and Preprocessing: Ensure the preprocess.py script runs without errors and the output looks correct.
  • Sales Forecasting: Ensure the forecast.py script runs and outputs reasonable forecast data.
  • Customer Segmentation: Ensure the segmentation.py script runs and outputs reasonable clusters.
  • Interactive Dashboard: Ensure the dashboard.py script runs, the dashboard loads in the browser, and the visualizations update correctly based on the callbacks.

Comprehensive Testing

  • Use a sample dataset that mimics the structure and variety of the real dataset.
  • Validate the outputs at each step by comparing them with expected results.
  • Check for edge cases, such as handling missing values and outliers.

When you run the dashboard.py script, Dash will start a local server. By default, it runs on http://127.0.0.1:8050.

Here is how you can start the server and access the dashboard:

  1. Open a terminal or command prompt.
  2. Navigate to the directory where your dashboard.py script is located.
  3. Run the script using Python:
    python dashboard.py
  4. After running the script, you should see output similar to this:
    Dash is running on http://127.0.0.1:8050/
  5. Open your web browser and navigate to the displayed URL: http://127.0.0.1:8050.