Before going to start check this link: Seoul Bike Trip Duration
Nowadays we reduced plenty of Hard work by using Machines, Similarly, we can reduce our Smart work by using Machines that is called Machine intelligence and one of the processes of making intelligence is called Machine Learning. Today, I am going to show you, how a Machine behaves intelligently by using Machine Learning through a real-world case study. In this case study, we are going to solve everything right from scratch from End-End
![Life Cycle of Problem Life Cycle of Problem](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkqXFaEr4rWoy2Pt2r_e_Ar5Szbrf0doFZX_CR8tMBiATfl9AYtX9hf_R_VXfdJJIKHWrwCeELlJIICfx-jaLrZW4SX2bkT9WPqG8UCU8yfM4vlh0eGl6tevAGVoGDLdtOpjtw5Ksqg91m/s16000/Life+Cycle.gif) |
Life Cycle of Machine Learning Problem |
Introduction :
We are taking Seoul Bike Trip data from Kaggle. Let us see something about data with questions of where, when, why? Seoul Bike trip is happening in South Korea in the city of Seoul and
routes are around the Han river, which is passing Seoul city. Most of the people prefer bikes for the trip, Some routes bikers preferred most of the times
- Han river/Banbpo Bridge loop: On this trip, most of the bikers starts at
Eugen station and the route pass along the Han river and finally end at
Gayang Station with a distance of 34km
- Fr 20.09 trip: This trip starts at Hoehyeon in Seoul and end at the
same place, with a distance of 33km
- Cycling in Seoul: This trip along the Han river starts at Yeouldo
Hangang Park and end at the same place with a distance of 13km, and
some other trips also there in Seoul
- Trip from Seodaemun-ga to Guragu of distance 24km
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBrMSmevVd1lD1gO7IAidGcaarU_UJL6wIvOyEO2JuzvvPeqenKp983NujhS_hamEfdSCQldVxnXXusjTlQZpgOLZpbVtnhdGPwZEkyqPLTix-NNcpSPrKs3bNdH8XmFEYDiPtZbG44z1V/s16000/rout.jpg) |
Trip Routs in Seoul city |
This data collected from different bike stands of different locations in Seoul city, It belongs to one-year data, In Bike stations, the owner need to know trip duration according to that they allowed for booking bikes.
Problem statement:
To design a Machine learning model that can predict trip duration for the given inputs
Performance measures:
Mean Square Error: This actual error due to our model for best model the value
is zero, its use used to compare different models to find the best model.
Mean Absolute percentage error: Definitely, a model gives some error in
prediction if we know how much error will happen it is good
Business constraints:
- Interpretability is important because we need to know why it will take that
much duration.
- Latency is concern
Dataset overview :
Seoul bike data of size 9.60M rows we have to design a model that
predict the trip duration by using columns
- ID: User id of the person who takes the bike for rent
- Distance: The distance a person travel on a bike
- PLong, PLat: Latitudes and Longitudes where the person take a bike for
rent
- DLong, DLat: Latitudes and Longitudes where the person return the bike
- Haversine: This is the shortest distance on earth calculated using
Latitudes and Longitudes
- Pmonth, Pday, Phour, Pmin: Month, day of the month, at what time bike taken
for rent
- PDweek: On which day of week bike taken for rent
- Temp: how much average temperature per hour, it measures in
degree celsius
- Precipitation: It indicates average rainfall per hour, it measures cm
- Wind: This is avg value per hour, it measures miles/hour
- Solar: Heat and light radiation from the sun
- Snow, Ground Temp, Dust are other independent parameters all the
values the average value of one-hour duration
- Duration: how much time will take to return a bike, This is the column
that we have to predict by using an Algorithm, It is dependent value
we have only few features only present at the time starting, they are Haversine, Temperature, Precipitation, Wind speed, Humidity, Solar,
Snow, Pmonth, Ground Temperature, Dust, and time features Pday, Phour, Pmin, PDweek
Input features Haversine, Temperature, Precipitation, Wind speed, Humidity, Solar,
Snow, Pmonth, Pday, Phour, Ground Temperature, Dust, Pmin, PDweek
Target Variable: Duration
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFsymwjIx1i-qFQ45PD0IT7ZzmfxwF9tSKWflK9pbc-bjJns-qAFG3a8CLWsBiIJTcANIfXDtqm7JcyRTT9YOV1qHQuVsxCm2VCVysQd_ApFBz-UyxUOydTqdGbEyrHAmzEf9CoumdUuMS/s16000/process.gif) |
Blog Cycle |
Exploratory Data Analysis:-The very First step to solve any case study is the proper analysis of data. It helps us better understanding data, Even though it isn't part of the model design if we analyse data and then will get valuable insights then it is easy to select a model for our problem.
In the First step, we import necessary libraries which need basic processing of data if we need more libraries we can import those at that time but keep all the libraries in starting cells after completion model
After downloading we will get . CSV file it becomes difficult to perform Maths operations then designing a model 😖.Don't worry we have pandas DataFrame to take care of all these things😎
Now we load the CSV file by using pandas and by using head it shows the first 5 rows
Let us see something about the data by Describe method, It shows Quantile values, Mean, Stander Deviations of each feature, if we have more number of features it might be confuse😉
check for missing data:
Generally, data has some missing data present, but in our data, there is no possibility for that condition
In our dataset, we don't have any NaN😲, we will check by using IsNull() and count missing rows by using sum as below
Correlation of Features: Let see how each feature is correlated with other features and also we know how each feature correlated with the Target variable, here we are using person correlation and we visualize by using Seaborn library
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdiiEGb6LtK6XBBI0QC0xzHsK6XgFdDAKjsAGZ_UrYl2e0hdTVimtPEisTEDHEbYvU_Xj5QdDpXCh_8ekfL_GV76_AT1C80GtbjD9RSRK6boWtv2BqjKYmXefl_0y5Ob3LAMLXX2t7mYZE/s16000/3.correlation+plot.JPG) |
Fig.1 Correlation
|
Here white is highly positively correlated and deep blocks highly negatively correlated, Here some features like Pmonth, Pday, PDweek are highly correlated with Dmonth, DDweek, Dday but we need to remove these features it won't much increase our performance it increases the complexity of our because if more features are there then our model needs to perform more number of mathematical operation and also it creates problems in finding feature impotence.
Now we will see the distributions of some features then we detail analysis of each feature, To get distribution plots of features in the blog we are using seaborn kdeplot as below
some of the distribution plot as shown below which are useful to analysis
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBY3pbgfqCnV4Wh8AOZErw0Z7WZuwgPP2tQJTRsrBGdkXrlzFJ-STBDUALRTCGJhQZOvwszpkf7Gav31S7kupVXUCGAgLqJwXmJnUFsvKjKU7rkU5xU3rRDAibNn6ekFfir5w7Vp0Lutst/w470-h123/1.Distributions.jpg) |
Fig.2 |
Here our data is looking like time series, Distribution plot of Pmonth showing that the number of people was going to trip less in starting of years (i.e in months of January
and February ) and the number of people interested in trip increases gradually and reach max in months September and October
and then decrease
The distribution plot of Phour shows that the number of people preferring trip is more in the morning than it decreases b midday
then it increases in the evening time and distribution plot of Pday shows that all the days in a month are ok for trip. Similarly, Pmin, PDweek and Haversine these distribution plots don't have much data that's why I am not showing these plots.
Now let us see more details of Pmonth by taking Pmonth as hue in the distribution plot of Duration
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqT2vbGdEM7J1ciEPsQClIDHp1dFtkWNa-YMvJnop7I1xR7sunD-s0WLB-oexvCwkKknB9Q2-Ue2oGC0VHeZAw8SztiYlh52g81th1_LBL1Gq4_N3N6fs5vmbpK_0Hy9l3FgMvCWl7i0qr/w507-h295/2.Dur_Month.JPG) |
Fig.3 |
The distribution plot shows that in the months of September and October are more people completed their trip in a short time
Let us go next stage of Analysis, we are going to analyse some features in detail, For every feature, we are seeing visually how it with help of PDF plot, CDF plot, Box plot, Time Series plot, QQ-plot if needed we use Auto Correlation plot and Partially autocorrelation plots also and then numerically methods like percentile values and (
ADF) Augmented Dickey-Fuller test to check data stationery.
Before going to the analysis we sample data to 10 min interval data and this data used only for time series plots only, after sampling we get some Null values because in some intervals nobody was gone to trip especially at midnight times, we remove those null values because at the time nobody going for the trip. Here datetime column we created by using Pmonth, Pday, Phour, Pmin and then converted it to datetime format by using pandas.to_datetime
It is good for programmers to use functions, class instead of writing code all the time now I am going to write a function it can plot PDF plot, CDF plot, Box-plot and give percentile values
Duration
we will use plot_fun function for the distribution plot, CDF plot, Box-plot and for percentile values
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiHGHEsyMuRLTOFQYzUciZaaESU9zLxeJ2ynPPlY84NC2-I5Le7ZVww02TJZnK-IWaSGgINhUYpBH3_vPCGBhw_ForvkXRlOpiepfSd1rQJy8pMzvfFi1CX7HO9W9vdpmfriauWD0QeCk7/s16000/5.Distance2.JPG) |
Fig.4 |
We can observe that that distribution plot is "right-skewed".That means most of the trip durations values are small, people completing their as early as possible and from CDF and Box-polt shows that 90% of people completing their trip in 1 hour, here we got some visual observation, Let us see percentile values to make it clear.
Here we can observe percentile values clearly, max value in 119 min means nearly 2 hours. Generally, we consider higher values as outliers and we try to minimize the effect of outliers if data is less, If we have more data point we discard them from our data set
But here, we have a different scenario. It might be possible lower values also as outliers because we need data on who goes for the trip, it is very fewer people who complete their trip within 2 min 😮. That why we don't need those people data
We already saw that the distribution is right-skewed but now we will use QQ-plot to check for Normal distribution. for that, we are going to use scipy library
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSDTzfkZMHOTpGOvReTRBzQukQo5J6Zd2-V519nmnX9ftQJzlpIKiSBhvdpf4bFhWLSdb8pEK_mcgi2_NBeQS527h-hH8RCHYRMaGOWKigrMyfPudYznaf017Hrq5KJUBc2S1EfcfMAVyH/s16000/5.Distance8.JPG) |
Fig.5 |
By QQ-plot we clearly observe that Duration, not in Normal Distribution.
It is good to see Auto Correlation and Partial Correlation Plots because we have time a feature give and look like time-series data. But it not completely time-series because past values are independent of present values that can we observe by data and also by using correlation plots
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirlPIpblkWB1lqEbunaPytqxbz87WODQ1SKGm3wsGf1AZumMEzytKirzd0l6cOfFhb_mKTAngSh5ub_UP1iEtw8kBzKRjccMWWV_1vBV9q5iSpll6QdJP0zebjzeORXSWlPQtk8AheOlYN/s16000/5.Distance5.JPG) |
Fig.6 |
Let us see how the Autocorrelation and Partial Autocorrelation plots will after sampling data to 10 min interval data
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC1XygLQwn5at5xBamXyk-WmRF5NSTI4VYCWGEJUS4myFtseLOatmR5D8pQl78jebUFubai7pexVCYNfZHbsu6EZZ7WtCB1KLDaSjlT7z7W4opXi6lceL9c5JSeZcv1yIG_iJ2beYQ1rSJ/s16000/5.Distance6.JPG) |
Fig.7 |
After Observing the Autocorrelation plot and Partial Autocorrelation before sampling we not getting much correlation with past values but when we sample the data then we got correlation in data that we can observe in the autocorrelation plot, But there is one point we are missing in the data we have the location of each station there are there located in different locations
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwDgEQBCeGyYs2wkq3qDDFuJl-QNWLhbgUWq2KsagDxWTdVSrh54vsGKTa6n-YPAlPiusQeddY0Ibq8vZTGJyQzCFYbjKVynnWTKlzKLnivw83y2yhaeMCppgNQYmmt1L0uSxv-0RtTHul/s16000/locations.JPG) |
Fig.8 |
To avoid that problem we will make data into clusters by using Locations into clusters then we will check this we can do in next
Distance
Distance is a feature which highly correlated with Duration but we actually don't use it at the time of the model, because this feature we don't at the time of starting the trip and but is very useful to the analysis of data.
Let's start with the Distribution plot by using plot_fun function
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHNcb2HYU7GhmdPdqrKp1dyhOWtLLNzSydFUg2Ted_Z_z6YBGjmj2QNnNne9EUSy840UD1b3f8uac4kGi1geKeZzW7LoVWAOvlgsVBtjwXaHkI4NHG6WSpO0JpFu2vs6Es08CpqklTc5Fv/s16000/5.Dist2.JPG) |
Fig.9 |
Distance is very similar to duration that why their distributions also similar, here we can observe the values are in meters and the max distance is 33 km, Now we use percentile values for better analysis of our feature
It concludes that few people travel less distance, Let us take some threshold value ex:100 don't allow then in our data consider the people who travel distance more than the threshold value.
Temperature:
It is one more feature in our dataset here the max value is 39.4 Degrees and min value -17 Degrees,
Let us see some plots to know about this feature by using plot_fun
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn76NpNbGBdCYxR5iMOSKOQ4NqSpxZC4JT7oOj2jB8tijCQ-_09Z1SvZnXsTQEOE4dN3SSN4wySsXcJXDMT015zB0bd8nO-0Ru2WTfMuegIaE_pp8U_aadu1XF6l4NLUCfeef29OcB1pjf/s16000/6.temp2.JPG) |
Fig.10 |
It concludes that the 'Temp' feature is not in Normal Distribution, by using CDF and
box plot 80% of values between 10 degrees to 39 degrees and also we check percentile values
We know that temperature is time series lets us go for more analysis of this data firstly we will see how it varied with time and then check Stationary or Non-Stationary then finally we see how temp and Duration going on time plot, It makes more fun😃 for this purpose we are going to use seaborn line plot.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjt-8yCZkIRRLZ7AzfSQSnsFHeJinywRJQ4Ejiwoe8z5289zin-sS1eHZP2BaPdQQyL9S3o0oIOuIs0my_cu-YK-1KdAhyphenhyphenR2q3LxHho7UiJvfTAiJa7T0xHnPFsDAmRAMaZqHNcveTm6aCb/s16000/6.temp5.JPG) |
Fig.11 |
Herein, time series plot
temperature shows that temp is low at the starting of the year and reach the max
value in the months July to September then it decreases.
Let us see something more by using the time series plot of aTemp and with hue as Duration
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjONKj8AON1GZYPt4nWkUUgVgcOnHEjQkolHZxDwSZcPXVgz7PoMXiGNDCCUb_B9Nr4QAg1Yr_O9SmawRvXGfIl9n1S7fXXMn6uTjlji8-CmxAjpFxyTR8Wtr26P1farYSWNUdMVkQwf3BC/s16000/6.temp7.JPG) |
Fig.12 |
From the above time series plot Duration shows the temp at max ranges or min ranges the trip duration is less(i.e
duration is less at starting of the year and ending of the year and between July to
September)
Now we will check the data is Stationary or Non-Stationary before going to check, we will see when we say data is Stationary
After observing the above plots we will understand when we tell data is Stationary and Non-Stationary based on Mean and Variance
To check this we can go visually by plotting the mean and variance curve in a time series plot and another way by using statistical methods like the ADF test
In our blog we are using the ADF test to check data is stationary or Non-stationary
Augmented Dickey-Fuller Test: Test is a type of statistical test called a unit root test. Unit roots are a cause for non-stationarity.
- Null Hypothesis (H0): Time series has a unit root. (Time series is not stationary
- Alternate Hypothesis (H1): Time series has no unit root (Time series is stationary).
If the null hypothesis is rejected, we can conclude that the time series is stationary.
There are two ways to rejects the null hypothesis:
On the one hand, the null hypothesis can be rejected if the p-value is below a set significance level. The defaults significance level is 5%
- **p-value > significance level (default: 0.05)**: Fail to reject the Null hypothesis (H0), the data has a unit root and is non-stationary.
- **p-value <= significance level (default: 0.05)**: Reject the Null hypothesis (H0), the data does not have a unit root and is stationary.
Now we will check our data with ADF test for Stationary
After applying the ADF test we got a p-value as shown above fig which located inside the orange region, Now we can conclude that data is stationary because of p-value is less than 0.05
Precipitation:
Precipitation is another feature in our data, It tells what is the rainfall at the time of trip start and the units used here 'mm', Let's see visual plots to get a better understanding of data
we start with distribution plots as usual then we will go for numerical understanding,
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh98Rp4iZx39DXYU8Bu3K4HTuki7Ish1tFyUsTWHgW3dMyYZrj0A6b1VubKrp7OEb-OG91vYt-GAuko2f8BsIA1xl6n40u4yi0Cy-i618BtO5C6M6D1_mGyKOGHtRHNAzsUVF_0aGVNibr0/s16000/10.preciptation1.JPG) |
Fig.13 |
the max value of precipitation 35 cm
The distribution plot looks like power-law distribution that means precipitation
occurring very rarely, by using the box plot 35cm is showing as an outlier but by
observing the previous year rainfall in the google it was happening
Here we can observe most of the values are zeros that mean, Rainfall occurs is not a daily event we know that
Let us see how the time series plot Perception with hue as Duration column as hue
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjw1AIFfvWGxyxYqUjkR4Fm_7y0YQSBHa0fMTwxwGXphwP4w2W45ZHPfmG_-hZIEIkqrvs5nTSskhj1UxInC0_Yd1QPcDPiWQXA_EQP8bf4Mxys-WUUWE4a22Z_3Syf6Or-v-RqWiovpHpj/s16000/10.preciptation3.JPG) |
Fig.14 |
Here in time series plot of precip and duration as hue if rainfall is low duration is high and
after heavy rainfall for some days, duration is low that can observe when we zoom it by using Plotly library
Here we can clearly observe that precipitation data is stationary data
Here we can observe some correlation between Duration and Precipitation but it, not a daily event it won't show much impact when we consider all data
Wind
Wind is the feature in our data it represents with what speed wind flow happening at the time of trip start and units for wind speed is m/s in our data, Now we are going to analyse the Wind feature, let us start with the distribution plot![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIoq965IV08jGs6KbYM3tw55s8XNNmUc8ya1pV-wMuV7WFaiEfCTBXE5BEKOC1HB6BxEnsGjQSq1z3srlwpFKaVCiB3OOQY5aMUW2xohPITQN-taiZmUNDMWE6NEV2lcxL1eVCu9fC81q2/s16000/7.wind1.JPG) |
Fig.15 |
Here we can observe that the distribution curve for wind is not smooth and in the changes occurs more but
the range of changeless below 3.by using box plot and percentile values 90% of
values are less than 3 and the values above 7 may be outliers and also we can observe in percentile values
let's see how the time series plot Wind
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTb9ISNzlc-P177XuefFx77HBI_KKXL0srPG1FgX4HaLW6qDSsFdLSw5z6oLWvQQQaNgpJP8oCPIsZb_XAXWUPCEZL-4meI9je7L6WhkCKmsQQJtIz19XuV5qqjIPcG0XkFIuiLmOg4tZ-/s16000/7.wind3.JPG) |
Fig.16 |
From the above plots, we can observe there is no much effect of wind on the duration and check for data is stationary or non-stationary by using the ADF test
Here p-value is less than 0.05 that we can clearly observe from the above that mean data is stationary
Humidity:
The humidity feature tells how much humidity in air present at the time of the trip start and units are %
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRMPYKwpEoYNwlQ5YU10h0aqrFSCz91khGDqm0FJ_O0G0ipGOVq7KdK01dSExQ6yjcURRZmXtVkXQaiysOZ9kTG4skXhFrEMW-NzpAbX2DoQFCnPhm0hZ1ZaTPAQWYrbhexu1sl10pRezu/s16000/8.humid1.JPG) |
Fig.17 |
Max value of Humidity is 98%
The distribution plot is not a smooth curve and box plot values show that there is no possibility of
outliers that can be observed by using percentile values as below
It's looking good data, let's see how the time series plot with hue parameter as duration
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikqVwPFLJI0Y2wkl-joSuSGmQCJHsKJKUnr2O5pf74thMqbjmlzKwheMHNxVizscHTk-cKkLvWhljbj6G9JpTxmjL0fMG5EQ6Y1YBrvQ5A4rJ5na5cG877C3sV2KiQueTooxa-LuQRIIHA/s16000/8.humid3.JPG) |
Fig.18 |
By time series plot Humidity the value is low at the starting of the year
and reach the max value in the months July to September then it decreases and
to check the data is stationary (mean and std is constant values at any time) we
used augmented dickey-fuller test here p-value is less than 0.05 that mean
Humidity is stationary data that we can observe below
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnlBj01ScGog61eZgi2Y9S4mDCFa9LpvWksQHQGPDFAPASwC-ytGZp4vNyxUraLj8QJMtEB0FlzVlNJ9n1uFwHvY2P4qJGRJgRy3yxCb6AaOatfQ3Q5yIGOm-fP9ODOl2DxswvSNZP7PvL/s16000/8.humid4.JPG)
Solar
Solar feature this anther feature in our data, let visualize the data for better understanding
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEill1UihJTEuJPEAfhrLaJkOYn7KXLNXS6x3YAQzTo4jvdzgl_BOGdyVhj17LVs22vS-bFBn68Ti0jWooaRo186ioOWs1FbC0jB0zWfxbqkxj7UM1Dk6OUFGDtEeCpsAOkBv3a1EuGsOpQH/s16000/9.solar1.JPG) |
Fig.19 |
Here we can observe, The distribution plot is a smooth curve but it is not normal by the QQ plot, box
plot and percentile values show that there is no possibility of outliers and here
30% of values zeros because at night solar value becomes zero, It can also observe by using percentile values.
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1vMmkdKEUY5GLO4KDI1yyK5hZEy9WNlV-NHJQRtR-tVQKlEqC2kYy-Lc9npCM4PsnvXcz1axUzRvVaz5J-C-sjADr5J4vnXxuC9QA4TuYuFKAdjscNNx07fLZkb0MZJWw1w_xw8NvnZmQ/s0/9.solar2.JPG)
Here we can observe max value is 3.52 and many of the values are zeros
After Observing QQ-plot we can clearly say that it is not normal distribution data, let analysis more by using time
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhmcin_Ytf_dJz5xr9EQXuPK6_hyphenhyphenJlws05L8LZbvjpQfusVoHkAu9tk7mGIwhYcTW3y42-ZHDmNc4kYtpxJoGKojPunc2wp9q9l9GQ-poaqNaX710bp34bHsblKuWu4-s2FkuuN9rWcDvGs/s16000/9.solar4.JPG) |
Fig.20 |
In time series
plot Solar the value is low at the starting of the year and reach the max value in
the months July to September then it decreases, by using Plotly library we can observe after zoom, The data is stationary that we can observe by using ADF test, here p-value less 0.05
Snow:
Snow is a feature that represents how snowfall happens at the time of trip start,
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOlg4H29nbFRRYqWyfM8wlPj8CgGMQxPUdF18MYRI8ugoA2e1rfYIvzlAKn09nsCqLM8_gh7eDc6AgcKZBJuX5bopSAnzwFEu0CVX6bat7IvpDghhI4yzl-EKXqA-6_Isp96lWzs-_NHHm/s16000/11.snow1.JPG) |
Fig.21 |
Wow! what plots, By observing box
plot and percentile values show that there is the possibility of outliers and here 90%
of values zeros, because snow happening is a rare event in Seoul, this we will get by percentile values
Snow is stationary data that we can observe by ADF test
Ground Temperature
The ground-level temperature at the time of trip start, this feature is similar to temp that we can see later but now, we will look at how CDF, PDF and Box-plots visually
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjb6Uq2vGama4NNvkLXV0SLTijM0fp6gA7X5BjraAnbaDQ9DkpbBag461Pfz5o8cFEBvFvvJIHW5frZkCDDCibsM4gzwX7vEB92_dHwlJsxAqSV09IHuGKUdkcB1VCsmIbXnKxRazZEMfgj/s16000/12.Gtemp1.JPG) |
Fig.22 |
Max value of GroundTemp 62.2, The distribution plot is not smooth curve and box plot and percentile values
shows that there is no possibility of outliers and here 90% of values less than
35 that we can confirm with percentile values
Now we will see how the Time series plot
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVUKBY7wo92BK1CV5AF7xXdlWq1zSUDi26LhRc0M2xUEzbN7opNXcIv-007JF5w92Ks6C7NY-9pFBmhdhMeiZymj2EOSAMAHYe6sdZIf267bkxHhiEpghQX8PRs8sc9yImI5oGS8xtdv2b/s16000/12.Gtemp3.JPG) |
Fig.23 |
Time series plot GroundTemp shows that Groundtemp is low at the starting of the year and reach the max value in the months July to September then it decreases,
plot with hue is Duration shows the temp at max ranges or min ranges the trip
duration is less(i.e duration is less at starting of the year and ending of the year
and between July to September) and .by augmented dickey-fuller test here p-value is
less than 0.05 that mean ground temperature is stationary data
Before a while, we saw there is some relation between Ground Temp and Temp, Generally, we know the relation between them, Let see how they are going on the time plot
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXzAkUr5WFcrBeW-_S-vp7M2yaV577tFKLCs26g0wQz6TwVpz-zhOlr3P_McanvoCxK3ApsXpN8F_l9rIQeSoEcsZDFilDpW9pEj2TtlwsjtiLP7EXGEd5MuZUjH3ArSxk2Et9kaglvCoB/s16000/12.Gtemp5.JPG) |
Fig.24 |
Dust:
Dust is another feature used to predict the trip duration, it indicates how much dust present atmosphere, let us see how the dust data
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKeOUWCEmgTr0l39J0MVdiUjM9lXOF4JqS8GaFyTI46yjhcWCvJaVzY15kDPoN5r0ckKroqMr0mKVaJpA4lN1BV3jP8ERt0q_6tYCBF8phs3ocW9ZTuH0tThxHlTxg3cv6rgk4qwP2r-77/s16000/13.Dust1.JPG) |
Fig.25 |
Max value of Dust 304, The distribution plot is not smooth curve and box plot and percentile values
shows that there is a possibility of outliers and here 90% of values less than
75 and check percentile values
Here we can observe a difference between 90 and 100 percentile, let see more values from 90 to 100
In this, there might be possible for outliers there more gap between 99 and 100 percentile values
Time series plot Dust shows that it rotating around 20 continuously there no lot
of change occurs, plot with hue is Duration shows the when the dust is there for a long
time then that time trip duration is less. by augmented dickey-fuller test here the p-value is less than 0.05 that mean Dust is stationary data
Finally, this data is good because the outlier possibility is very less and it doesn't contain NaN values
Now we will Latitude and Longitude values for more exploration
Let's group data by using PLati, PLong
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjH7vgzT0GjKXIRJkfTCGuh3cuIjG0hYr_ldON6lhGYvp8BLXTlF3b0iTvleps_N3Zr9kxz5D24ehX-zwzNA0kKNDbuHp_Z7i2gtq3BwP1GGQNlFlK0HoVZ_aUD-UXZ5M1k05IhAGvijG79/s16000/14.ex1.JPG)
By Grouping on PLati, PLong and count on id and mean on duration, the data has
1524 starting locations but in some starting locations has less than 10 people i
will feel those are outliers because in a bike stand we need to take more bikes for
rent, Some people trip duration is less than 2 minutes that mean that are not
complete their trip they might be stopped because any reasons and similarly some
people did not travel more than 200 meters and they also consider as outliers
Now we will group by PLati, PLong, DLati, DLong to find the number of people travel from one location to another location.
Between two places(locations) a single person also travels I feel that it was an
outlier because he might be stopped because of any problem with the bike or he
used the bike for different purposes other than the trip.
Data-Preprocessing
We observe that our data doesn't have Null values, but there is a possibility of outliers in data as discussed above, Let's try to remove them because we have a lot of data points.
Above we are creating two data frames df, df0
df: It contains PLati, PLong, DLati, DLong values after removing number of people travelled less than 6
df0:It contains PLati, PLong ie. station locations here we are allowed if the number of trips from that station is greater than 20
Let us take these locations(Latitude and Longitude) values from the original data by merging as below
Then it is necessary to check how much data is removed from original data, if it is more then it becomes difficult for our model to give more accurate results, here we lose only 4.78% of data from original data it is ok to get
Modelling
In the blog, we are going to see two types of modelling techniques
- Time-Based Modeling: here we have time is given as a feature hence we apply few time-based models Ex: Moving Average, Weighted Moving Average, VAR, Prophet
- Complex Modeling: We also check some Complex model Ex: XGB Regressor, Support Vector Regressor(SVR)
Input features: Pmonth, Pday, Phour, Pmin, PDweek,, Temp, snow, precip, wind, dust,
ground temp, PLati, PLong
Output: Duration
Time-Based Modeling:
Before going applying models lets look at the data, Here in data we have data of different regions that we know by Latitude and Longitude value, because of that there is some difference in temperature, Humidity, Precipitation, Snow and also on other features, to avoid this problem we will make whole data into different clusters of regions by using K-Mean clustering on Plati and PLong then we apply our models, After that, we will check without clustering based on the performance of model we finalize a model for our data
Cluster-Based Modeling:
Initially, we will make different clusters of regions based on PLati and PLong using K-mean cluster, Here I am making it into 30 regions
Here we clu is a function to make data into clusters, It good to visualize clusters
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVgp_adl2mminJ7CXwgXDKGJJ8TI77JhKnRuvXscoXtE9KHIgpV8Fl4quw954ceB_sphsltmbqYU3XLsHN2IiN_h4s_l1lFOmWkI0Xef33hVB0-UI39fN0hAmQqjccF3M6yMvV2QyNv11D/s16000/Data_pre8.JPG) |
Fig.26 |
Regions as shown in the figure, we have data with clusters and each cluster has time has an index, Now we are creating 10 min bins in each cluster as below
Now we have cluster data with time bines but all regions are mixed in data, Let's separate them by using group by function on cluster and time bine
It is time to start modelling:
Actually, we don't know which model performs well on our data, Let's try with different models then finalize one model
Firstly we will start with simple Univariate Models:
- Simple Moving Average
- Weighted Moving Average
- Exponential Weighted Moving Average
For a simple univariate model, we need a Duration column based on the time index from each cluster of data
Simple Moving Average:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj97YQPV1yY9glPVsNChUpfJSoLROphl0Z4nHshmv4HTC3QQs_MJkjPc0R5F_t82tf9VdqQsjEvSh0spX6va6Z0d4wZZ211-UlBaHr1kRnGyls13nHNAhmPRkFJag_1mMbPhX2rSN1zO0KX/s16000/movingavg.gif) |
Moving Average |
In this time-series data, most of the present data depend on past data by using that past data we will predict future value by using
Moving average, Moving average is used when the data is univariate time-series data
n is the number of past values, n is a hyperparameter that we will find
Let apply moving Average on our data
Here we can observe mean square error and mean absolute error are the values used to find the performance of the model, let us see the performance of the model on different 'n' values
Here we can observe Mean Absolute Error and Mean Square Error decreasing and as window size increasing then change in error is very less
Weighted Moving Average:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBUKzgJ3BqSbYAvKUzJlfbYzDxk_h5yrKCB-ZHoM_tBgS46HRuwFWPzTwhxzvjTjrh9L0PvySTxEN7mzIBrOnllZ0ln-J-GnrwM27jB_rUr1pMx0SQOnwGHnhPAeKeEycMFADhQWeBatez/s16000/weihted1.gif) |
Weighted Moving Average |
In Moving Average Model used gave equal importance to all the values in the window used, but we know intuitively that the
future is more likely to be similar to the latest values and less similar to the older values. Weighted Averages converts this
analogy into a mathematical relationship giving the highest weight while computing the averages to the latest previous value and
decreasing weights to the subsequent older ones
Rt-1, Rt-2 ... are past values and N, N-1 ... are weighting for past values
Let see how the code look like
Here also MSE and MAE are performance measures based on that we select 'n' value
Now we will how values we gatting for different n values
In weighted moving average also we are getting closed results with moving average results
Exponential weighted moving average:
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3oXy9TjmQQF45rwlXsR4C1iOMQdHXd1FpYuAq4uUpITb0-mGE3wj0dkyjIJVP6ZIEE9BKJuDU5XR18V8J1qoePbKB_Zjum8GevLjMSwpCb2vSkOJj9SnqI6XzW4ofDEy2ea3oPi9wZiSz/s16000/expontial1.gif) |
Exponential weighted moving average |
Weighted averaged we have satisfied the analogy of giving higher weights to the latest value and decreasing weights to the
subsequent ones but we still do not know which is the correct weighting scheme as there are infinitely many possibilities in which
we can assign weights in a non-increasing order and tune the hyperparameter window size. To simplify this process we use
Exponential Moving Averages which is a more logical way towards assigning weights and at the same time also using an optimal
window-size.
In exponential moving averages, we use a single hyperparameter alpha (α) which is a value between 0 & 1 and based on the
value of the hyperparameter alpha the weights and the window sizes are configured.
To find Exponential Weighted Moving Averages we used pandas. ewm function with hyperparameter N.
Here also MSE and MAE are performance measures based on that we select 'n' valueNow we will how values we gatting for different 'n' values.
In exponential weighted moving average also we are getting closed results with moving average results.
In univariate models, we are not getting good results, Let's see how multivariate models will perform on our data
Multivariate Modeling
Up to now, we applied univariate model and they are not performed well as we expect, Now we apply some complex model to our data
- Vector Auto Regressor
- Prophet
Before applying the model let's check how the past values of other features(Temp, Snow, Humidity, Precipitation ...) are affected Duration, for this purpose we will
Granger's Causality TestHere we consider only a random cluster and we applying Granger's Causality test then we check coefficient values
Let's see something about Granger's Causality test
Granger's Causality test: Generally in time-series data, past values of one feature will affect other
features. By using this test which features are affecting more we will know
- Null Hypothesis(H0): Past values of time series(X) do not cause other series(Y)
- Alternative Hypothesis(H1): Past values of time series(X) cause other series(Y)
In this test 0.05 is used as a threshold value
Let's look at outputs of the test here we consider only the past 10 lages
Here we can observe that lag values how the features affecting Duration if feature lag values less than 0.05, the importance is more compared to other features.
By Applying Granger’s causality tests on cluster data the lag features of
Temp, Wind, Humid, Solar, GroundTemp, Pmonth, PDweek, Phour are has non zero coefficient values but other features have some lag
values have zero coefficients, but here Pmonth, Pday, PDweek, Phour has different scenario there may possible for 10 lags the values might be same
Let's create our data for train and test
Here we separated each cluster then we make each cluster into 10 min time interval bins and also split the data train and test in each cluster. then we store train datasets in list_train_clu and similarly test data set in list_test_clu
Now we apply our Multivariate Time Series Models
Vector Autoregression (VAR):
The model used when data is Multivariate Time-series data and Time-series are influencing other time series. In the VAR model, each variable is modelled as a linear combination of past values of itself and the past values of other variables
in the system. Since we have multiple time series that influence each other, it is modelled as a system of equations with one
equation per variable (time series). Order is a hyperparameter that is used in VAR
In this Yt−1
lag values of 1
st
feature and Yt-2 lag values of 2
st
feature and similarly other features
Let's see the code for VAR model
Here we applied VAR model to each cluster then we got MSE and MAE error for each cluster then take
mean of error then results
Ho here we are getting large error values, It is not working well lets us see a more complex model
Multivariate time series forecasting using Prophet
The prophet is a model we can use for both Univariate and Multivariate time-series forecasting and it has a lot of parameters for fine-tuning the model and also most useful is it tunes the parameters automatically 😲 and also give confidence intervals of output
Let's see how the code will
Here also we will get MSE and MAE values for each cluster then we take the mean of them
Here we can observe error values for 28,29,30 and also mean errors of all cluster, the error values less compared to all the above models above
let's see how some complex models perform on our data
Complex Models
Before applying models, In our data, we have latitude and Longitude values but we are not used them, let's try to use those features
For that purpose instead of using original latitude and longitude values, we going to use cluster centre values, we will get these venter values from 'KMean'
We have also Latitude and Longitude values in the data, Now we group the data by using the cluster, time_bin
Split the data train and test, here data1 split 90% for train and 10% for test data, because of we have time data we don't apply stratify
Before going to apply models, for the complex models have a lot of hyperparameters, to find the best hyperparameters we are going to use Grid SearchCV or Randomized SearchCV, but in GridSearchCV we can use only one scorer but we have two scores for that we create a dictionary of scorers it contains methods of MSE and MAE
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjER123G8UXihJ9Fd6Bc33wSTyvRSdRC7DV7NGY8rmZmPvTc_hbdKvrdJzXV7Noz5cguQE-C16TGQB0fyEAwgCnbho2NCq0aAfFGRzl3NRPpYNri5kAmtijz58_SePG1Wa1L6HwtUfIfWuI/s16000/m_m12.JPG)
Let start with simple complex models
Ridge regressor
A Ridge regressor is basically a regularized version of a Linear Regressor. i.e to the original cost function of linear regressor we
add a regularized term that forces the learning algorithm to fit the data and helps to keep the weights lower as possible.
let's how the codel
Here we are used RandomizedSearchCV to find hyperparameters and we have time data for instead of using normal Cross-Validation here we are used TimeSeriesSplit, Lets see how the performance
Here I using split2_test to check performance because, when using split0 and split1 the training data is less, but at split2 we have more train compared to others because we are using TimeSeriesSplit, we got around 8.51 MAE as a minimum and all values are close, It is good results let's try more complex model
Support Vector Regression (SVR)
Support Vector Regression (SVR) uses the same principle as SVM, for regression problems also, in SVR we will try with different kernels and different hyperparameter values
here degree is applicable for polynomial kernel only, let see how the results
Here we got poor results compared to all above model MAE 36.55
XGBRegressor
XGB models are widely used model in Machine Learning because their performance and training is fast, here we can use GPU also to train, this big advantage to the XGB model
Let's see code snippets for XGBRegressor
Here we are used hyperparameters n_estimators,learning_rate,max_depth with RandamizedSearch CV with 5 splits, Now we will check the performance by using the last split ie. spli4
we know cv_results in this split4_test value are useful values to find hyperparameters, In this at 7th position we have fewer values MAE=7.46 and MSE=112.8 this values occurred for max_depth=10, learning_rate=0.01, n_estimaters=200 they better results compared to now
we have features but we don't know which feature is of importance, Let's find which feature important by using the XGB model
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTFntwuCyAeOf-PT9lqLm7jB41uLph-WYWPOu8G4lCXk25le15Cxj8hc9ggD9xhuDHGpuzEq0XBGC9lHudDp6pJrfRQpG6liKaIzFBt1Io51QckOupCokS_XELFsOHb5mp4P3M2LpybuAd/s320/m_m23.JPG)
Phour, PDweek, Ground temp are the most important feature in our data
Now we see the results of all models in a table to find a good model for cluster data
Up to Now for Modeling we are used Clusted based data, Now we check the models how they are performing without clusters
Without Cluster
We saw how the models perform on cluster data, Let see if won't cluster the data then train the model then check the performance
For this, we take original data then we create datetime column as disused above then we have the features Duration, Distance, Pmonth, Pday, Phour, Pmin, PDweek, Temp, Precip, Wind, Temp, Precip, Wind.
Now we will resample the data to 10 mints interval data by taking the mean of other feature as below
By the re_sample data we got
Here we got only 52269 rows, Let's split data train and test
we split the data 90% as train and 10% as test data, Now we start with simple models then we try complex models
Simple Moving Average
Let's directly step into code snippets
Compared to the cluster-based data model here codel is simpler, Now we will see the performance
Wow, 😲 we got very good results for a simple moving average model here we got MSE 82.14 and MAE 6.17 that we can observe above
Weighted Moving Averages
Here also we directly see code, It is also small code but we will get more accurate results, let's see
Here we will get simply replace online of code from moving average, here also we getting a close result with moving average
It closed values MSE 83.01 and MAE 6.16
I am not going to discuss the Exponential Weighted Moving average because it is also performing same as the above two
Let's directly jump into a complex model
XGBRegressor
Let's see how the XGB model perform on without cluster data, codel is simple
Here also we were used TimeSeires Split and RandmizedSearch for finding good hyperparameters after this we will get results, Let's check the results, as usual, we use the last split values to find hyperparameters
XGBRegressor performing pretty good on this data that we can say after observing the above values MSE 21.02 and MAE 3.19
Let's train our final model using XGBRegressor on this data
Let's see cluster-based model results in Table
Conclusions
- Models performed pretty well on resample data without clustering
- To improve performance we can also more complex model like deep learning
Deployment
It is the final stage, I am going to deploy using AWS
Before deploying our model we need a web page that we can create by using HTML and we need API for that we create an API by using Flask, For the code
Link To deploy in AWS, we need to create an EC2 instance as follow
![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-_FFPGLmC7JhFTmiy5LbZ4b4OA_W8IMfFlX8bP2r-EoACUpdvthcvYOjLHLMrd5NkFUAAwny6sFrZWNIvbVMGBHuseevYfblVsWE9nU_BZ1CS7FCnBm50rneJwvnHeqRwXNI-ozisVsDo/s16000/AWS1.jpg)
After that we need to connect to server by using ssh cmd, then upload all file, then we install necessary libraries, then we run our model in server as a blow
After Deploying our mode we will go to our web page then it looks
If you gave proper inputs then we get results
Some Sources that I Learn :
Comments
Post a Comment