Case Study of E-commerce Demand Forecasting

Setting the right prices for products has been an essential problem for long years.

Conventionally, companies have applied their own pricing strategies which considerably vary from one company to another. These strategies often rely on heuristics solutions based on their own experiences and knowledge. As widely known, however, they are not the best solutions in general.

Recently, price optimization has been gaining popularity among pricing strategies. Price optimization utilizes machine-learning methods to forecast demand and finds the best prices for various problems. For example, price optimization often uses such features as competition, weather, season, special events or holidays, macroeconomic variables, operating costs, and warehouse information in order to determine the initial price, the best price, the discount price, and the promotional price. Price optimization has a potential of producing large profits in retails and B2B markets. In particular, it can be a powerful solution in e-commerce business, where customers are used to frequent price changes: real-time price optimization is critical because demand changes dramatically in a very short time span.

Nonetheless, price optimization is still a challenging issue. Demand can change substantially due to various factors in the world, and it is often difficult to forecast the demand accurately and set the right prices.

In this work, a simple case study of demand forecasting, which is an essential part of price optimization, will be shared. In the first section, the dataset based on the open-source data on Kaggle provided by Olist is described. Detailed discussion and the full source code is available on Kaggle notebook.

Acknowledgements¶

Throughout this work, I refer to these great articles about the introduction of price optimization by tryolabs.

Data Collection
Exploratory Data Analysis
Demand Forecast Modeling
Evaluation
Conclusions

Data Collection

Brazilian E-Commerce Public Dataset by Olist provides multiple datasets related to Olist's order history from 2016 to 2018. Based on this dataset, I create a new dataset which contains the information of sales history.

df.info()

Exploratory Data Analysis

1. First, let's see the demand history

# daily plot
pd.to_datetime(df.timestamp).dt.date.hist(bins=593,figsize=(15,3))

The variance of daily demand is very high.
As a whole, the number of sales has increased over time.
In the daily plot, a periodic (weekly) demand change is observed. (demand increases on weekends)
There are many (apparently random) intense shifts. For example, it seems that the sales history has an unusual peak on Black Friday, suggesting that a specific event affects the purchase.

2. Purchase frequency by items.

For each product, if there is not enough volume of purchase history, demand forecasting is difficult. Therefore, the trend of purchase frequency per product is inspected here.

df.product_id.value_counts().hist(bins=30, range=(0, 30),figsize=(10,3))

There are many items that have only 1 or a few purchase histories.
To build a demand forecasting model, we need to remove items that has only few sales history, because the sales of these items are extremely difficult to predict.

Demand Forecast Modeling

Here, I summarize what I did in modeling.

For the full source code and detailed discussion, please see the Kaggle notebook.

Preprocess summary.

features
- item id
- seasonal information (year, month, week, day of week, day)
- historical sales trend information (total sales in last 1 day, 3 days, 1 week, 1 month, 1 year)
- Brazilian economical information
  - Brazilian imports
  - Brazilian customer confidence
- label
  - sales volume

Model training summary.

Model training
- LightGBM model
- For each target day, a model is trained using the past data, and makes a prediction on the target day.
  - ex.) To predict the sales volume on January 1st 2018, all the data between the start date and December 31th 2017 are used as a train dataset. After that, the model predicts the sales volume of January 1st 2018.
- Repeat model training & prediction for each day to obtain demand forecasting curve.

Evaluation

Now, let's evaluate the model.

1. Model decay

Let's see how fast our model decays. The graph below shows how the mean squared error changes for predicting the sales of n days later.

fig4: time-series variation of mean-squared error

The mean squared error decays over time.

Next, let's compare the demand forecasting curves of 30 days to the actual demand curve.

We compare two different demand forecasting curves to understand the model decay.

Demand forecasting where models are retrained to make predictions each day. ("retrained")
Demand forecasting where a model is trained only once and used for all predictions for the following 30 days. ("once")

Compared to y_once (a model is trained only once and used for all the predictions for the following 30 days), y_retrain (models are retrained every day) is more successful in predicting the fluctuations especially after the first 10 days. This means the demand trend is very sensitive to latest information, which even requires changes of the model itself.

Lastly, let's visualize demand forecasting (for one specific item).

For some of the large peaks, the model is successful in predicting them while it fails to predict some other large peaks (ex. the one in the beginning of February 2018) and small fluctuations.

Conclusion¶

As a summary, a simple case study of demand forecasting has been shown in this work.

Although we have very limited data, the model predicts some trends in the demand history. For example, the result in Fig. 6 shows that the model successfully predicts some of the main peaks in the demand history of a specific product (4 big peaks in January, February, and March 2018). It also catches some trends of the average demand curve.

At this time we didn't use detailed information about products, but further feature engineering and integrations of these data will improve the demand forecasting model. Moreover, additional information such as details of stores, competition, weather, special events or holidays, other macroeconomic variables, operating costs will improve the performance of the model.

As a future work, inspections of price elasticity of demand should be meaningful for further exploration.

Thank you for reading this article. I appriciate any comments, questions, and suggestions!

#datascience #dataanalysis #ecommerse #demandforecasting