Seoul Bike Trip Prediction Using Machine Learning End to End

   


  Before going to start check this link: Seoul Bike Trip Duration 

        Nowadays we reduced plenty of Hard work by using Machines, Similarly, we can reduce our Smart work by using Machines that is called Machine intelligence and one of the processes of making intelligence is called Machine Learning. Today, I am going to show you, how a Machine behaves intelligently by using Machine Learning through a real-world case study. In this case study, we are going to solve everything right from scratch from End-End

Life Cycle of Problem
Life Cycle of Machine Learning Problem

Introduction :

We are taking  Seoul Bike Trip data from Kaggle. Let us see something about data with questions of where, when, why?    

      Seoul Bike trip is happening in South Korea in the city of Seoul and routes are around the Han river, which is passing Seoul city. Most of the people prefer bikes for the trip, Some routes bikers preferred most of the times

  1. Han river/Banbpo Bridge loop: On this trip, most of the bikers starts at Eugen station and the route pass along the Han river and finally end at Gayang Station with a distance of 34km
  2. Fr 20.09 trip: This trip starts at Hoehyeon in Seoul and end at the same place, with a distance of 33km
  3. Cycling in Seoul: This trip along the Han river starts at Yeouldo Hangang Park and end at the same place with a distance of 13km, and some other trips also there in Seoul
  4. Trip from Seodaemun-ga to Guragu of distance 24km 
Trip Routs in Seoul city

This data collected from different bike stands of different locations in Seoul city, It belongs to one-year data, In Bike stations, the owner need to know trip duration according to that they allowed for booking bikes.

 Problem statement:

    To design a Machine learning model that can predict trip duration for the given inputs

 Performance measures:

 Mean Square Error: This actual error due to our model for best model the value is zero, its use used to compare different models to find the best model.
Mean Absolute percentage error: Definitely, a model gives some error in prediction if we know how much error will happen it is good
Error measures

Business constraints: 

  1. Interpretability is important because we need to know why it will take that much duration.
  2. Latency is concern

Dataset overview :

 Seoul bike data of size 9.60M rows we have to design a model that predict the trip duration by using columns

  1. ID: User id of the person who takes the bike for rent
  2. Distance: The distance a person travel on a bike 
  3. PLong, PLat: Latitudes and Longitudes where the person take a bike for rent 
  4. DLong, DLat: Latitudes and Longitudes where the person return the bike 
  5. Haversine: This is the shortest distance on earth calculated using Latitudes and Longitudes
  6. Pmonth, Pday, Phour, Pmin: Month, day of the month, at what time bike taken for rent 
  7. PDweek: On which day of week bike taken for rent 
  8. Temp: how much average temperature per hour, it measures in degree celsius 
  9. Precipitation: It indicates average rainfall per hour, it measures cm 
  10. Wind: This is avg value per hour, it measures miles/hour 
  11. Solar: Heat and light radiation from the sun 
  12. Snow, Ground Temp, Dust are other independent parameters all the values the average value of one-hour duration 
  13. Duration: how much time will take to return a bike, This is the column that we have to predict by using an Algorithm, It is dependent value

we have only few features only present at the time starting, they are Haversine, Temperature, Precipitation, Wind speed, Humidity, Solar, Snow, Pmonth,  Ground Temperature, Dust, and time features Pday, Phour, Pmin, PDweek

Input features Haversine, Temperature, Precipitation, Wind speed, Humidity, Solar, Snow, Pmonth,    Pday, Phour, Ground Temperature, Dust, Pmin, PDweek

Target Variable: Duration

Blog Cycle


Exploratory Data Analysis:-
The very First step to solve any case study is the proper analysis of data. It helps us better understanding data, Even though it isn't part of the model design if we analyse data and then will get valuable insights then it is easy to select a model for our problem.
    In the First step, we import necessary libraries which need basic processing of data if we need more libraries we can import those at that time but keep all the libraries in starting cells after completion model 

Load Data:

After downloading we will get . CSV file it becomes difficult to perform Maths operations then designing a model  😖.Don't worry we have pandas DataFrame to take care of all these things😎
Now we load the CSV file by using pandas and by using head it shows the first 5 rows


Let us see something about the data by Describe method, It shows Quantile values, Mean, Stander Deviations of each feature, if we have more number of features it might be confuse😉

check for missing data:
Generally, data  has some missing data present, but in our data, there is no possibility for that condition
In our dataset, we don't have any NaN😲, we will check by using IsNull() and count missing rows by using sum as below

Correlation of Features:
        Let see how each feature is correlated with other features and also we know how each feature correlated with the Target variable, here we are using person correlation and we visualize by using Seaborn library 

Fig.1 Correlation


        Here white is highly positively correlated and deep blocks highly negatively correlated, Here some features like Pmonth, Pday, PDweek are highly correlated with Dmonth, DDweek, Dday but we need to remove these features it won't much increase our performance it increases the complexity of our because if more features are there then our model needs to perform more number of mathematical operation and also it creates problems in finding feature impotence.
Now we will see the distributions of some features then we detail analysis of each feature, To get distribution plots of features in the blog we are using seaborn kdeplot as below
some of the distribution plot as shown below which are useful to analysis
Fig.2
        Here our data is looking like time series, Distribution plot of Pmonth showing that the number of people was going to trip less in starting of years (i.e in months of January and February ) and the number of people interested in trip increases gradually and reach max in months September and October and then decrease 
          The distribution plot of Phour shows that the number of people preferring trip is more in the morning than it decreases b midday then it increases in the evening time and distribution plot of Pday shows that all the days in a month are ok for trip. Similarly, Pmin, PDweek and Haversine these distribution plots don't have much data that's why I am not showing these plots.
    Now let us see more details of Pmonth by taking Pmonth as hue in the distribution plot of Duration
Fig.3

           The distribution plot shows that in the months of September and October are more people completed their trip in a short time

    Let us go next stage of Analysis, we are going to analyse some features in detail, For every feature, we are seeing visually how it with help of PDF plot, CDF plot, Box plot, Time Series plot, QQ-plot if needed we use Auto Correlation plot and Partially autocorrelation plots also and then numerically methods like percentile values and (ADF) Augmented Dickey-Fuller test to check data stationery. 

    Before going to the analysis we sample data to 10 min interval data and this data used only for time series plots only, after sampling we get some Null values because in some intervals nobody was gone to trip especially at midnight times, we remove those null values because at the time nobody going for the trip. Here datetime column we created by using Pmonth, Pday, Phour, Pmin and then converted it to datetime format by using pandas.to_datetime 
           It is good for programmers to use functions, class instead of writing code all the time now I am going to write a function it can plot PDF plot, CDF plot, Box-plot and give percentile values

Duration

        we will use plot_fun function for the distribution plot, CDF plot, Box-plot and for percentile values 

Fig.4
        
We can observe that that distribution plot is "right-skewed".That means most of the trip durations values are small, people completing their as early as possible and from CDF and Box-polt shows that 90% of people completing their trip in 1 hour, here we got some visual observation, Let us see percentile values to make it clear.

     Here we can observe percentile values clearly, max value in 119 min means nearly 2 hours. Generally, we consider higher values as outliers and we try to minimize the effect of outliers if data is less, If we have more data point we discard them from our data set
       But here, we have a different scenario. It might be possible lower values also as outliers because we need data on who goes for the trip, it is very fewer people who complete their trip within 2 min 😮. That why we don't need those people data

 We already saw that the distribution is right-skewed but now we will use QQ-plot to check for Normal distribution. for that, we are going to use scipy library 

Fig.5
By QQ-plot we clearly observe that Duration, not in Normal Distribution.

         It is good to see Auto Correlation and Partial Correlation Plots because we have time a feature give and look like time-series data. But it not completely time-series because past values are independent of present values that can we observe by data and also by using correlation plots

Fig.6
  Let us see how the Autocorrelation and Partial Autocorrelation plots will after sampling data to 10 min interval data
Fig.7
         After Observing the Autocorrelation plot and Partial Autocorrelation before sampling we not getting much correlation with past values but when we sample the data then we got correlation in data that we can observe in the autocorrelation plot, But there is one point we are missing in the data we have the location of each station there are there located in different locations
Fig.8
To avoid that problem we will make data into clusters by using Locations into clusters then we will check this we can do in next 


Distance

        Distance is a feature which highly correlated with Duration but we actually don't use it at the time of the model, because this feature we don't at the time of starting the trip and but is very useful to the analysis of data.
Let's start with the Distribution plot by using plot_fun function

Fig.9
Distance is very similar to duration that why their distributions also similar, here we can observe the values are in meters and the max distance is 33 km, Now we use percentile values for better analysis of our feature

It concludes that few people travel less distance, Let us take some threshold value ex:100 don't allow then in our data consider the people who travel distance more than the threshold value.

Temperature:

        It is one more feature in our dataset here the max value is 39.4 Degrees and min value -17 Degrees,
Let us see some plots to know about this feature by using plot_fun

Fig.10
It concludes that the 'Temp' feature is not in Normal Distribution, by using CDF and box plot 80% of values between 10 degrees to 39 degrees and also we check percentile values 

We know that temperature is time series lets us go for more analysis of this data firstly we will see how it varied with time and then check Stationary or Non-Stationary then finally we see how temp and Duration going on time plot, It makes more fun😃 for this purpose we are going to use seaborn line plot.

Fig.11
      Herein, time series plot temperature shows that temp is low at the starting of the year and reach the max value in the months July to September then it decreases.
    Let us see something more by using the time series plot of aTemp and with hue as Duration
Fig.12
       From the above time series plot Duration shows the temp at max ranges or min ranges the trip duration is less(i.e duration is less at starting of the year and ending of the year and between July to September)
Now we will check the data is Stationary or Non-Stationary before going to check, we will see when we say data is Stationary
After observing the above plots we will understand when we tell data is Stationary and Non-Stationary based on Mean and Variance 
 To check this we can go visually by plotting the mean and variance curve in a time series plot and another way by using statistical methods like the ADF test
In our blog we are using the ADF test to check data is stationary or Non-stationary
Augmented Dickey-Fuller Test: Test is a type of statistical test called a unit root test. Unit roots are a cause for non-stationarity.
  • Null Hypothesis (H0): Time series has a unit root. (Time series is not stationary
  • Alternate Hypothesis (H1): Time series has no unit root (Time series is stationary).

If the null hypothesis is rejected, we can conclude that the time series is stationary.

There are two ways to rejects the null hypothesis:

On the one hand, the null hypothesis can be rejected if the p-value is below a set significance level. The defaults significance level is 5%

  • **p-value > significance level (default: 0.05)**: Fail to reject the Null hypothesis (H0), the data has a unit root and is non-stationary.
  • **p-value <= significance level (default: 0.05)**: Reject the Null hypothesis (H0), the data does not have a unit root and is stationary.
Now we will check our data with ADF test for Stationary

After applying the ADF test we got a p-value as shown above fig which located inside the orange region, Now we can conclude that data is stationary because of p-value is less than 0.05

Precipitation:

        Precipitation is another feature in our data, It tells what is the rainfall at the time of trip start and the units used here 'mm', Let's see visual plots to get a better understanding of data
we start with distribution plots as usual then we will go for numerical understanding,
Fig.13
the max value of precipitation 35 cm
      The distribution plot looks like power-law distribution that means precipitation occurring very rarely, by using the box plot 35cm is showing as an outlier but by observing the previous year rainfall in the google it was happening
Here we can observe most of the values are zeros that mean, Rainfall occurs is not a daily event we know that
Let us see how the time series plot Perception with hue as Duration column as hue
Fig.14


Here in time series plot of precip and duration as hue if rainfall is low duration is high and after heavy rainfall for some days, duration is low that can observe when we zoom it by using Plotly library
Here we can clearly observe that precipitation data is stationary data
Here we can observe some correlation between Duration and Precipitation but it, not a daily event it won't show much impact when we consider all data

Wind

Wind is the feature in our data it represents with what speed wind flow happening at the time of trip start and units for wind speed is m/s in our data, Now we are going to analyse the Wind feature, let us start with the distribution plot
Fig.15
  Here we can observe that the distribution curve for wind is not smooth and in the changes occurs more but the range of changeless below 3.by using box plot and percentile values 90% of values are less than 3 and the values above 7 may be outliers and also we can observe in percentile values   
let's see how the time series plot Wind 
Fig.16
From the above plots, we can observe there is no much effect of wind on the duration and check for data is stationary or non-stationary by using the ADF test
Here p-value is less than 0.05 that we can clearly observe from the above that mean data is stationary

Humidity:

The humidity feature tells how much humidity in air present at the time of the trip start and units are %
Fig.17
Max value of Humidity is 98%
The distribution plot is not a smooth curve and box plot  values show that there is no possibility of outliers that can be observed by using percentile values as below
It's looking good data, let's see how the time series plot with hue parameter as duration
Fig.18
By time series plot Humidity the value is low at the starting of the year and reach the max value in the months July to September then it decreases and to check the data is stationary (mean and std is constant values at any time) we used augmented dickey-fuller test here p-value is less than 0.05 that mean Humidity is stationary data that we can observe below

Solar

Solar feature this anther feature in our data, let visualize the data for better understanding
Fig.19
Here we can observe, The distribution plot is a smooth curve but it is not normal by the QQ plot, box plot and percentile values show that there is no possibility of outliers and here 30% of values zeros because at night solar value becomes zero, It can also observe by using percentile values.
Here we can observe max value is 3.52 and many of the values are zeros
After Observing QQ-plot we can clearly say that it is not normal distribution data, let analysis more by using time 
Fig.20

In time series plot Solar the value is low at the starting of the year and reach the max value in the months July to September then it decreases, by using Plotly library we can observe after zoom, The data is stationary that we can observe by using ADF test, here p-value less 0.05

Snow:

Snow is a feature that represents how snowfall happens at the time of trip start,
Fig.21
Wow! what plots, By observing  box plot and percentile values show that there is the possibility of outliers and here 90% of values zeros, because snow happening is a rare event in Seoul, this we will get by percentile values
Snow is stationary data that we can observe by ADF test

Ground Temperature

        The ground-level temperature at the time of trip start, this feature is similar to temp that we can see later but now, we will look at how CDF, PDF and Box-plots visually
Fig.22

Max value of GroundTemp 62.2, The distribution plot is not smooth curve and box plot and percentile values shows that there is no possibility of outliers and here 90% of values less than 35 that we can confirm with percentile values 
Now we will see how the Time series plot
Fig.23

        Time series plot GroundTemp shows that Groundtemp is low at the starting of the year and reach the max value in the months July to September then it decreases, plot with hue is Duration shows the temp at max ranges or min ranges the trip duration is less(i.e duration is less at starting of the year and ending of the year and between July to September) and .by augmented dickey-fuller test here p-value is less than 0.05 that mean ground temperature is stationary data
Before a while, we saw there is some relation between Ground Temp and Temp, Generally, we know the relation between them, Let see how they are going on the time plot
Fig.24

Dust:

    Dust is another feature used to predict the trip duration, it indicates how much dust present atmosphere, let us see how the dust data
Fig.25
Max value of Dust 304, The distribution plot is not smooth curve and box plot and percentile values shows that there is a possibility of outliers and here 90% of values less than 75 and check percentile values
Here we can observe a difference between 90 and 100 percentile, let see more values from 90 to 100
 
In this, there might be possible for outliers there more gap between 99 and 100 percentile values
         Time series plot Dust shows that it rotating around 20 continuously there no lot of change occurs, plot with hue is Duration shows the when the dust is there for a long time then that time trip duration is less. by augmented dickey-fuller test here the p-value is less than 0.05 that mean Dust is stationary data
 
Finally, this data is good because the outlier possibility is very less and it doesn't contain NaN values
Now we will Latitude and Longitude values for more exploration
Let's group data by using PLati, PLong


      By Grouping on PLati, PLong and count on id and mean on duration, the data has 1524 starting locations but in some starting locations has less than 10 people i will feel those are outliers because in a bike stand we need to take more bikes for rent, Some people trip duration is less than 2 minutes that mean that are not complete their trip they might be stopped because any reasons and similarly some people did not travel more than 200 meters and they also consider as outliers 

Now we will group by  PLati, PLong, DLati, DLong to find the number of people travel from one location to another location.
Between two places(locations) a single person also travels I feel that it was an outlier because he might be stopped because of any problem with the bike or he used the bike for different purposes other than the trip.

Data-Preprocessing

        We observe that our data doesn't have Null values, but there is a possibility of outliers in data as discussed above, Let's try to remove them because we have a lot of data points.


Above we are creating two data frames df, df0
     df: It contains PLati, PLong, DLati, DLong values after removing number of people travelled less than 6
     df0:It contains PLati, PLong ie. station locations here we are allowed if the number of trips from that station is greater than 20
Let us take these locations(Latitude and Longitude) values from the original data by merging as below
Then it is necessary to check how much data is removed from original data, if it is more then it becomes difficult for our model to give more accurate results, here we lose only 4.78% of data from original data it is ok to get 

Modelling

In the blog, we are going to see two types of modelling techniques 
  • Time-Based Modeling: here we have time is given as a feature hence we apply few time-based models Ex: Moving Average, Weighted Moving Average, VAR, Prophet 
  • Complex Modeling: We also check some Complex model Ex: XGB Regressor, Support Vector Regressor(SVR)
Input features: Pmonth, Pday, Phour, Pmin, PDweek,, Temp, snow, precip, wind, dust, ground temp, PLati, PLong
Output: Duration

Time-Based Modeling:

            Before going applying models lets look at the data, Here in data we have data of different regions that we know by Latitude and Longitude value, because of that there is some difference in temperature, Humidity, Precipitation, Snow and also on other features, to avoid this problem we will make whole data into different clusters of regions by using K-Mean clustering on Plati and PLong then we apply our models, After that, we will check without clustering based on the performance of model we finalize a model for our data

Cluster-Based Modeling:

        Initially, we will make different clusters of regions based on PLati and PLong using K-mean cluster, Here I am making it into 30 regions 
Here we clu is a function to make data into clusters, It good to visualize clusters
Fig.26

Regions as shown in the figure, we have data with clusters and each cluster has time has an index, Now we are creating 10 min bins in each cluster as below
         Now we have cluster data with time bines but all regions are mixed in data, Let's separate them by using group by function on cluster and time bine

It is time to start modelling:
Actually, we don't know which model performs well on our data, Let's try with different models then finalize one model
Firstly we will start with simple Univariate Models:
  • Simple Moving Average
  • Weighted Moving Average
  • Exponential Weighted Moving Average
For a simple univariate model, we need a Duration column based on the time index from each cluster of data

Simple Moving Average:
Moving Average


       In this time-series data, most of the present data depend on past data by using that past data we will predict future value by using Moving average, Moving average is used when the data is univariate time-series data
n is the number of past values, n is a hyperparameter that we will find
 Let apply moving Average on our data
Here we can observe mean square error and mean absolute error are the values used to find the performance of the model, let us see the performance of the model on different 'n' values
Here we can observe Mean Absolute Error and Mean Square Error decreasing and as window size increasing then change in error is very less

Weighted Moving Average:
Weighted Moving Average


       In Moving Average Model used gave equal importance to all the values in the window used, but we know intuitively that the future is more likely to be similar to the latest values and less similar to the older values. Weighted Averages converts this analogy into a mathematical relationship giving the highest weight while computing the averages to the latest previous value and decreasing weights to the subsequent older ones 
Rt-1, Rt-2 ... are past values and N, N-1 ... are weighting for past values
Let see how the code look like
Here also MSE and MAE are performance measures based on that we select 'n' value
Now we will how values we gatting for different n values
In weighted moving average also we are getting closed results with moving average results
Exponential weighted moving average: 
Exponential weighted moving average


       Weighted averaged we have satisfied the analogy of giving higher weights to the latest value and decreasing weights to the subsequent ones but we still do not know which is the correct weighting scheme as there are infinitely many possibilities in which we can assign weights in a non-increasing order and tune the hyperparameter window size. To simplify this process we use Exponential Moving Averages which is a more logical way towards assigning weights and at the same time also using an optimal window-size.
    In exponential moving averages, we use a single hyperparameter alpha (α) which is a value between 0 & 1 and based on the value of the hyperparameter alpha the weights and the window sizes are configured.
To find Exponential Weighted Moving Averages we used pandas. ewm function with hyperparameter N.
Here also MSE and MAE are performance measures based on that we select 'n' value
Now we will how values we gatting for different 'n' values.
        In exponential weighted moving average also we are getting closed results with moving average results.
In univariate models, we are not getting good results, Let's see how multivariate models will perform on our data

Multivariate Modeling

            Up to now, we applied univariate model and they are not performed well as we expect, Now we apply some complex model to our data
  • Vector Auto Regressor
  • Prophet
Before applying the model let's check how the past values of other features(Temp, Snow, Humidity, Precipitation ...) are affected Duration, for this purpose we will Granger's Causality Test

Here we consider only a random cluster and we applying Granger's Causality test then we check coefficient values

Let's see something about Granger's Causality test
Granger's Causality test: Generally in time-series data, past values of one feature will affect other features. By using this test which features are affecting more we will know
  • Null Hypothesis(H0): Past values of time series(X) do not cause other series(Y)
  • Alternative Hypothesis(H1): Past values of time series(X) cause other series(Y) In this test 0.05 is used as a threshold value
Let's look at outputs of the test here we consider only the past 10 lages
Here we can observe that lag values how the features affecting Duration if feature lag values less than 0.05, the importance is more compared to other features.
          By Applying Granger’s causality tests on cluster data the lag features of Temp, Wind, Humid, Solar, GroundTemp, Pmonth, PDweek, Phour are has non zero coefficient values but other features have some lag values have zero coefficients, but here Pmonth, Pday, PDweek, Phour has different scenario there may possible for 10 lags the values might be same 

Let's create our data for train and test

Here we separated each cluster then we make each cluster into 10 min time interval bins and also split the data train and test in each cluster. then we store train datasets in list_train_clu and similarly test data set in list_test_clu
Now we apply our Multivariate Time Series Models

Vector Autoregression (VAR):

          The model used when data is Multivariate Time-series data and Time-series are influencing other time series. In the VAR model, each variable is modelled as a linear combination of past values of itself and the past values of other variables in the system. Since we have multiple time series that influence each other, it is modelled as a system of equations with one equation per variable (time series). Order is a hyperparameter that is used in VAR



In this Yt−1 lag values of 1 st feature and Yt-2 lag values of 2 st feature and similarly other features
Let's see the code for VAR model
Here we applied VAR model to each cluster then we got MSE and MAE error for each cluster then take 
mean of error then results
Ho here we are getting large error values, It is not working well lets us see a more complex model

 Multivariate time series forecasting using Prophet

            The prophet is a model we can use for both Univariate and Multivariate time-series forecasting and it has a lot of parameters for fine-tuning the model and also most useful is it tunes the parameters automatically 😲 and also give confidence intervals of output 
Let's see how the code will

Here also we will get MSE and MAE values for each cluster then we take the mean of them

Here we can observe error values for 28,29,30 and also mean errors of all cluster, the error values less compared to all the above models above 
let's see how some complex models perform  on our data

Complex Models

            Before applying models, In our data, we have latitude and Longitude values but we are not used them, let's try to use those features 
For that purpose instead of using original latitude and longitude values, we going to use cluster centre values, we will get these venter values from 'KMean'
We have also Latitude and Longitude values in the data, Now we group the data by using the cluster, time_bin
Split the data train and test, here data1 split 90% for train and 10% for test data, because of we have time data we don't apply stratify 
        Before going to apply  models, for the complex  models have a lot of hyperparameters, to find the best hyperparameters we are going to use Grid SearchCV or Randomized SearchCV, but in GridSearchCV we can use only one scorer but we have two scores for that we create a dictionary of scorers it contains methods of MSE and MAE
Let start with simple complex models

Ridge regressor

            A Ridge regressor is basically a regularized version of a Linear Regressor. i.e to the original cost function of linear regressor we add a regularized term that forces the learning algorithm to fit the data and helps to keep the weights lower as possible.

let's how the codel 
Here we are used  RandomizedSearchCV to find hyperparameters and we have time data for instead of using normal Cross-Validation here we are used TimeSeriesSplit, Lets see how the performance 

Here I using split2_test  to check performance because, when using split0 and split1 the training data is less, but at split2 we have more train compared to others because we are using TimeSeriesSplit, we got around 8.51 MAE  as a minimum and all values are close, It is good results let's try more complex model

Support Vector Regression (SVR)

            Support Vector Regression (SVR) uses the same principle as SVM, for regression problems also,  in SVR we will try with different kernels and different hyperparameter values
here degree is applicable for polynomial kernel only, let see how the results
Here we got poor results compared to all above model MAE 36.55

XGBRegressor

            XGB models are widely used model in Machine Learning because their performance and training is fast, here we can use GPU also to train, this big advantage to the XGB model 


Let's see code snippets for XGBRegressor 
        Here we are used hyperparameters n_estimators,learning_rate,max_depth with RandamizedSearch CV with 5 splits, Now we will check the performance by using the last split ie. spli4

we know cv_results in this split4_test value are useful values to find hyperparameters, In this at 7th position we have fewer values MAE=7.46 and MSE=112.8 this values occurred for max_depth=10, learning_rate=0.01, n_estimaters=200 they better results compared to now
we have features but we don't know which feature is of importance, Let's find which feature important by using the XGB model

Phour, PDweek, Ground temp are the most important feature in our data
Now we see the results of all models in a table to find a good model for cluster data 

Up to Now for Modeling we are used Clusted based data, Now we check the models how they are performing without clusters

Without Cluster

        We saw how the models perform on cluster data, Let see if won't cluster the data then train the model then check the performance
 For this, we take original data then we create datetime column as disused above then we have the features Duration, Distance, Pmonth, Pday, Phour, Pmin, PDweek, Temp, Precip, Wind, Temp, Precip, Wind.
Now we will resample the data to 10 mints interval data by taking the mean of other feature as below
By the re_sample data we got 
Here we got only 52269 rows, Let's split data train and test 
we split the data 90% as train and 10% as test data, Now we start with simple models then we try complex models

Simple Moving Average

Let's directly step into code snippets
Compared to the cluster-based data model here codel is simpler, Now we will see the performance

        Wow, 😲 we got very good results for a simple moving average model here we got MSE 82.14 and MAE 6.17 that we can observe above

Weighted Moving Averages

Here also we directly see code, It is also small code but we will get more accurate results, let's see
        Here we will get simply replace online of code from moving average, here also we getting a close result  with moving average
It closed values MSE 83.01 and MAE 6.16
I am not going to discuss the Exponential Weighted Moving average because it is also performing same as the above two
Let's directly jump into a complex model 

XGBRegressor

Let's see how the XGB model perform on without cluster data, codel is simple

        Here also we were used TimeSeires Split and RandmizedSearch for finding good hyperparameters after this we will get results, Let's check the results, as usual, we use the last split values to find hyperparameters


XGBRegressor performing pretty good on this data that we can say after observing the above values MSE 21.02 and MAE 3.19
Let's train our final model using XGBRegressor on this data


Let's see cluster-based model results in Table

Conclusions 

  • Models performed pretty well on resample data without clustering
  • To improve performance we can also more complex model like deep learning 

Deployment

It is the final stage, I am going to deploy using AWS 
        Before deploying our model we need a web page that we can create by using HTML and we need API for that we create an API by using Flask, For the code Link
To deploy in AWS, we need to create an EC2 instance as follow
         After that we need to connect to server by using ssh cmd, then upload all file, then we install necessary libraries, then we run our model in server as a blow


After Deploying our mode we will go to our web page then it looks 

If you gave proper inputs then we get results
Check it in link: Seoul bike trip


Some Sources that I Learn :

Comments