0:00
hello world have you ever wondered what regression analysis is or wanted to create your own linear regression model
0:06
in python well if you have you've come to the right place and i'm super excited about this video why because it is my
0:12
first mini course video so with that being said i probably need to apologize in advance because you know i'm not a
0:18
video editor so you know it might be a little rough around the edges but with that being said i'm still
0:23
attempting and if i accomplish this goal please let me know to create the very best linear regression python
0:30
mini course out there on the interweb so if i accomplish that let me know so with
0:35
that being said what are we going to cover today well it's a lot and that's okay because i'm going to
0:41
hold your hand through every step of the way but just to kind of get understanding of how i've broken this course up the first part of the course
0:48
is understanding the core concepts and building that mental model to be able to really understand what's what's going on
0:55
and the second part is you know the programming aspect where we actually complete a mini machine learning project
1:01
where we predict you know or attempt to predict the prices of stocks based on gdp so
1:06
uh let's dig into the details of what each section covers so the first section of really understanding things we're
1:12
going to discuss what is regression analysis you know what is linear regression
1:18
when do you want to use linear regression when does it make sense and then how does the machine determine what
1:24
the linear regression model is i mean how do we determine that best fit line of the best fit equation we're going to
1:29
briefly touch on polynomial regression because not every model you know or relationship is linear there are other
1:35
types of relationships and it's just important for you to understand that but we're not really going to dig too much
1:41
in detail on that then we're going to figure out you know now that we understand linear regression models is
1:46
this model that we just built does it you know does it have good performance does it is it a good model
1:51
then we're going to learn how to do simple linear regression in python simple linear regression is just a one
1:56
input variable regression then multi-linear regression in python that's just multiple variables and then we're
2:03
going to learn how to visualize these linear regression models just because it makes it makes things more
2:09
intuitive when you can kind of see how all the data fits together then we're going to cover a few things unique to
2:14
multiple linear regression including multicollinearity and variance inflation factor and then after that we're going
2:21
to do our mini project where we you know create this machine learning model using cross validation trying to predict uh
2:29
the prices of stocks we'll see if we're successful there so again like i said tons of stuff and like i also said i'm
2:35
going to walk you through every step of the way so with that being said let's get started
2:41
what is regression analysis anyways well regression analysis is a common statistical method used in a variety of
2:47
places but typically finance on this channel to determine the relationship between variables the processes helps us
2:55
understand the factors that are important and the factors that are irrelevant and how they affect each
3:00
other when trying to predict something so let's cover the key terms there's the dependent variable this is the target
3:08
response variable it's the target of what we're trying to predict or understand and then there are the
3:13
independent variable or the independent variables these are the independent notice i keep saying independent they
3:19
shouldn't affect each other the independent input factors that we think we don't necessarily know but that we
3:25
think will in influence the independent variable so for instance if we're trying to
3:31
predict the value of homes or the price of homes the price prediction would be the
3:37
dependent variable and the independent or independent variables would be maybe let's say the
3:44
square footage the number of rooms the you know garage you know finished basement all of these different things
3:50
so that's the key we you know to kind of recap we have the thing that we're trying to predict which is the dependent
3:57
variable and the independent variables these are the factors that we think are going to influence the prediction
4:03
so why is it called regression well the term regression was used by francis
4:09
gelton in his 1886 paper regression towards a mediocracy in hereditary
4:14
stature in other words regression towards the mean you know if you're really tall or your parents are really tall it's likely that you're still going
4:20
to be tall but maybe not so tall so the terms and words may might sound
4:26
complex but the process of linear regression is in fact quite simple so now that we understand you know a
4:32
little bit of the background and what it is and why it's called regression let's cover the
4:38
various types of regression analysis starting with the most simple the linear regression
4:44
what is linear regression well now that we understand what regression analysis is linear regression is pretty simple to
4:50
understand so linear regression is a statistical method for modeling linear
4:56
relationships between dependent response variables and one or more independent explanatory
5:01
variables that's like a tongue twister simple linear regression predicts using
5:07
one input variable and multiple linear regression predicts using numerous explanatory variables
5:15
right it's it's that simple so let's break this down right linear regression
5:20
assumes that there's a linear relationship between the predicted variable and the independent variables
5:26
that's super important to understand so going back to the house price
5:32
you know example above what if we predict the house prices from the size of the house and just one
5:38
variable that would be a simple linear regression now obviously that wouldn't yield a very good prediction but you get
5:46
the idea now what we'd want to do is then add maybe you know the house uh the house size if
5:53
it has a garage number of floors its location all sorts of other variables
5:58
maybe um how many bathrooms acreage etc you know we would put that in our model
6:04
and then we'd analyze it you know and figure out how each one of those independent variables affect our target
6:11
variable which is our house price and this uh brings me to a critical
6:17
point when do you want to use linear regression well it's in the name you want to use linear
6:23
regression when there's a when you assume there's a linear relationship you don't make you don't necessarily know there's
6:30
a linear relationship at the outset but you assume there is so let's clarify and
6:35
we'll do another example so the relationship between the s p 500 index price and gdp so this is also called the
6:43
warren buffett indicator in essence as companies sell more goods their stock prices should go up and we can see if
6:50
this relationship is genuine now i'm going to use the data from the federal reserve it's free you're able to
6:56
download this too i'll also include links for all of this code on my github
7:02
but you know if you're not interested in the program don't worry i'll show you the charts here and as you can
7:07
see behind me as time passes it appears that both gdp and the s p 500 are
7:13
increasing so this does seem like there's a linear relationship but there's actually a better way to
7:18
visualize this and this is by using a scatter plot now on this scatter plot the y-axis is
7:25
the s p 500 index price and on the x-axis we see the gdp and it does appear
7:30
that there's a line we could probably draw a line right through that you know and you know over time as gdps increase
7:38
so as the s p 500 index price so why don't we actually see what this looks like with a line now with this line it's
7:45
super clear to see but how do we know you know if this is the right line what is the let's say the best fit line let's
7:52
cover that now ols or ordinary least squares is a linear least squares method used to
7:58
estimate unknown model parameters you might say leo what in the world did you just say well i'm going to make it
8:04
simple for you do you remember you know the equation y equals mx plus b you know how that's the equation for a straight
8:10
line well that's all we're trying to do we're trying to draw a line or estimate a formula which a line formula linear
8:16
formula that best fits our data so let's think about this for a minute so in our
8:21
example why is the s p 500 price or the value that we're trying to predict and gdp is x and then m and b which is the
8:30
slope and the intercept are the model parameters that we're trying to fit and when we say fit what we're really trying
8:36
to do is we're trying to estimate the values that make the best fit line by minimizing the error you know there's
8:42
really no magic here take a look at this graph the error is the difference between the actual points and the
8:49
estimated points and what ols does is it takes a difference between these points and squares them and then adds them this
8:56
is known as a squared error we try to find the line that minimizes the squared residuals and you can see you know the
9:02
table of the ols is here just for clarity purposes but there's a problem with squared error
9:09
the issue is with squared error the more points will lead to a higher squared error even if the line fits better which
9:15
the solution to this is the mse or the mean squared error which just divides this squared
9:21
error by the number of observations right so that way we don't just simply get a larger error
9:27
for larger observations but this doesn't really tell us how much each observation is off by remember we squared the
9:34
differences we can take the square root of the above to get a meaningful number which gives us an idea of how much our
9:40
prediction strays from the actual this is called the root mean squared error or
9:45
rmse and the reason this formula is such a great tool is it accomplishes two things
9:51
one we need to make the differences positive right because we could have a prediction that's positively incorrect
9:57
for it and negatively incorrect and they would unfortunately balance each other out so we need to you know you know
10:03
essentially take the absolute value of that and then we should also punish it punish increasingly poor predictions
10:09
squaring this does also so what is the formula for our best fit line can you figure it out take a look at this stats
10:16
model output from python and see if you can
10:24
if you're feeling overwhelmed by this table don't be anytime you feel overwhelmed take a step back and try to
10:29
think about what you're looking to accomplish in this case we're just trying to come up with the best fit line right and we know that equation for the
10:36
line is y equals mx plus b so in this case y or y hat because we're trying to
10:41
predict a value is the s p 500 index price m which is our slope or our coefficient
10:48
for our gdp plus some bias so the full equation would be y equals
10:55
0.3688 gdp plus a negative 4680.4
11:01
and that's our best fit line we can rearrange this equation to make it a little more easier to read and scalable
11:06
later on all we need to do is switch the bias or the intercept with the slope so our new formula would be y hat equals b
11:14
zero or b naught plus b one times x one and again this is you're going to see this when we start to get into multiple
11:20
linear regression so remember the y hat simply means it's a predicted or expected variable and now the x's
11:26
because we can have a bunch of them are the distinct independent variables and the b's are the regression coefficients
11:32
so each regression coefficient represents a change in y relative to the change in regards to that respective
11:39
independent variable so let's go back to our prior equation for every one unit of gdp we expect to
11:45
add roughly 3.688 units of the s p 500 index we're expected to go
11:51
up by that much now if that gdp is zero which if the gdp is ever zero we've got
11:56
much larger problems the you know the s p index would be four negative four thousand six hundred and eighty
12:03
now obviously this is something to consider because that can't happen and that's really important to consider
12:09
remember we're modeling here and modeling you know it's not perfect there are no perfect models but anyways and
12:16
while this mini course is about understanding linear regression perhaps your regression problem shouldn't be
12:22
modeled linearly let's discuss polynomial regression take a look at the following graph do
12:28
you notice how the regression line doesn't fit the model well we can model curve relationships using polynomial
12:34
regression and like i said we're only going to briefly touch on this for about 30 seconds i just wanted to give you an
12:39
understanding that not all relationships are linear and some are exponential but if you are interested in learning more
12:46
about polynomial regression let me know in the comments below and i'll create some content around that but in regards
12:51
to our case we can model a third degree polynomial to get a line that fits our equation as we see here
12:58
now with a general understanding of the polynomial regression problem and understanding that not all relationships
13:04
are linear let's determine if our prior simple linear regression model performs
13:09
well now there's a lot of ways to understand
13:15
regression performance and if you're interested in the finding minutiae of how to do that feel free to go to my site at analyzingalpha.com
13:22
but for now we're going to kind of keep it simple we're going to just discuss r squared and r squared is the coefficient
13:28
of determination it's a proportion of the variation explained by the predictive variables in the observed
13:33
data and that means the higher the r squared the better so let's revisit our prior example the r squared was 0.941
13:41
this means that 94 of the observed data can be explained with our formula and 6 percent cannot now an r squared of 94
13:49
percent is pretty high so the predictive variables predict the dependent variables from the historical data quite
13:55
well in our relationships we see the line fits the data very nicely this does not necessarily mean our model has a ton
14:02
of predictive power which will cover how to assess that using cross validation get a better understanding and we'll
14:08
talk about why we're over fitting in this instance there are many other metrics in this output
14:14
summary from stats models like i said if you're interested in those feel free to check out my blog but for now let's just
14:20
see how i created this simple linear regression model in python we'll change up the data to make it a little bit more
14:26
interesting now let's perform a regression analysis on the money supply in the s p 500 price
14:32
now remember we just tested out the warren buffett indicator which you know assumes that the amount of output or
14:37
production gdp affects you know the index which it seems to but we'll get more on to whether that has a lot of
14:43
predictive power later on but maybe let's take another approach maybe it's really more about the money in the
14:49
system when the federal reserve controls that in fact they control that in three ways first reserve ratios how much of
14:55
the deposits or banks of deposits can lend out the discount rates how much you know
15:00
what rate uh banks can get in the short term to then then lend out to their uh
15:06
borrowers and then finally uh federal uh open market operations this is a big one
15:11
this is buying and selling uh government securities this is what you can think of when you hear you know big qe or
15:18
quantitative easing and now you can see the tables very similar what we used before only we're replacing gdp with the
15:24
currency in circulation now let's create a linear regression model using stats model now keep in mind i'm just going to
15:31
show you the snippets here just it's super easy relatively speaking but if you're really
15:36
interested in getting your hands dirty going to have that jupiter notebook full project at the end but in order to
15:42
predict or come up with a model with stats model we just simply have two lines of code we create the model using
15:49
sm.ols or ordinarily squares we pass in what we're trying to predict which is the s p 500 values and then we add our
15:56
independent predictor variables which in this case is the currency in circulation then we fit it and then you know stats
16:02
model does its thing you know machine learning tries to figure out the ordinary least squares you know that
16:08
minimizes that error in this case and then we print out the summary and that summary has all of this stuff that you
16:14
know we've already seen we can then determine what is the equation that stats model is giving us for the best
16:20
fit line what is the r squared and what is all that other stuff again if you're interested that you can check out on my
16:26
website but basically what we see here is this model is actually not as good
16:32
during this time frame as the the prior gdp model was and while it performed
16:37
worse what happens if we take the gdp and currency in circulation and combine them well that's when we're going to be
16:43
talking about multiple linear regression now multiple linear regression is just like simple linear regression except it
16:49
has two or more features instead of just one like i just spoke about and
16:54
just like before the stats model implementation is really simple all we need to do is now pass the list of
17:00
coefficients instead of a single variable so remember we have the model equals
17:05
statsmodel.ols because we're using ordinary least squares and then we pass in what we're trying to predict or the s
17:11
p 500 price and now we'll add both the gdp and currency in circulation we then
17:16
fit it and we print out the model summary and now we can see that the results do indeed seem
17:23
better having two variables improve the regression model our predicted value should now be improved because we have a
17:29
higher r squared value right we're explaining you know our model explains more of the variation now notice also
17:36
that we have one more coefficient in the model's coefficient section now the
17:41
another question is how do we visualize this you know now that we have you know simple linear regression is a simple
17:47
line to visualize right and we can easily draw a best fit line right through the data but when there's
17:53
multiple lines how do we think about that we can perform simple linear regression and graph them each separately like this but in truth all
18:01
we're really doing here is doing what we did previously twice we're only really creating a best fit line for each one of
18:07
the predictor variables in order to do more than two columns or two dimensions we need to put on our 3d glasses let's
18:14
create a multiple linear regression model 3d graph where the y values are the s p 500 and the x and z values are
18:22
the gdp and currency in circulation respectively at first visualizing three dimensions
18:27
feels strange but over time it becomes more natural now if you're interested in the code i
18:33
have all of this stuff on my jupiter notebook and github which you can see but for now i just want to kind of give
18:39
you guys an understanding of what this looks like now the good news is is that the line moves straight up and to the
18:45
right which is my favorite direction that's a trading joke and we can see that both gdp and currency in
18:50
circulation does increase the s p 500 or at least as the s p 500 increases so do
18:57
both gdp and currency in circulation so what about this why don't we add some random data to see how that affects our
19:04
model we know it really shouldn't right so let's add a random one-dimensional array between one and a thousand to our
19:11
linear regression model basically all we're doing here is trying to test to make sure that you know we've seen that
19:18
both of these variables increase our predictive power right we want to make sure that the random one does not and
19:24
the r squared obviously didn't improve but that should be obvious but how do we know you know if the feature is
19:30
statistically significant you know we know that adding random data isn't going to help but there's got to be a better
19:36
way than just kind of guessing right well there's more to it than this but a good rule of thumb is that if the p-value is 0.05 or lower the coefficient
19:45
and independent variable are said to be statistically significant in other words if the p-value is small and the increase
19:52
in r r2 is large then it makes sense to add the input feature otherwise discard
19:58
in this case we can see that our p-value is you know way above what we need it to
20:04
be 0.785 so we should remove it from this model even if it did improve our
20:10
r squared but obviously in this case it didn't now i'm sure you're eager to jump in and create some code but before we do
20:16
there's one other issue that we need to talk about multi-co-linearity and regression you may say multi-what well when you perform
20:24
linear regression the independent variable should be well independent we should understand that a regression
20:30
coefficient represents the change in the predicted response in our case the s p 500 index for each one unit change in
20:36
the independent variable holding all other independent variables constant so let's pretend that gdp changes by 1 and
20:44
the coefficient is 0.3 so 0.3 will be added to the s p index for every one
20:50
change in gdp but the key is it should be independent there are problems when there's multi-collinearity
20:58
and there's multiple types of this but in short you can't trust the p-values to identify that everything's statistically
21:05
significant if you have multi-collinearity so how do we know if the independent
21:10
variables or features are really independent well we can detect multi-collinearity
21:16
in our model with vif or variance inflation factor variance inflation factor or vif detects
21:23
multi-co-linearity and regression analysis a vif of 1 indicates two variables are not correlated a vaf
21:31
greater than one and less than five indicates moderate correlation and the vif of five or above indicates high
21:37
correlation and guess what python and stats model to the rescue again we can easily calculate the vif for each
21:43
feature simply using this code
21:51
and the challenge is we now see that gdp and currency in circulation are highly correlated so how do we handle this well
21:58
whenever we have such a small amount of data in this case we only have roughly 20 data points you can add data to see
22:04
if it changes but most of the time you're going to have to drop the feature that improves the model at least in this
22:11
case currency in circulation now it's time for a pop quiz how many values have we predicted so far
22:18
if you said none you're absolutely correct and now it's time for cross-validation and if you think about it the goal of
22:25
most regression models is to predict the future and if you think about it further on
22:31
what we've actually done is we've used all of the data so far to train our model and then we're turning right back
22:37
around and asking how well it will predict that's very same historical data that's the definition of overfitting you
22:44
know if we use our linear regression model to predict next quarter's gdp or using next quarter's gdp prices to
22:50
predict the future s p 500 price then we're finally making a prediction but how do we make predictions when
22:56
we've only got historical data right well what we need to do is we need to break up the data into a training and
23:03
test set or even better yet training and test sets what we'll do is we'll use different slices of history or the
23:09
trading sets to make predictions about different periods in history which will be the testing sets this would help us
23:15
determine if the currency in circulation or gdp was better for predicting equity prices as we saw gdp
23:22
bested or was the winner in our previous example but what happens if maybe gdp
23:28
you know was historically uh the best performer but recently currency in circulation was actually the better
23:35
performer so it's plain to see that this type of trained test set is more robust and often comes up with a better
23:42
regression model leading to more accurate predicted response and this is actually a common practice in scientific
23:48
computing and machine learning the only concern with machine learning models is that they're prone to overfitting and
23:54
we'll discuss that later and now with all of the core concepts behind you it's finally time to create some code we're
24:00
going to jump into the jupiter notebook and we're going to walk through an entire machine learning essentially pipeline
24:06
for linear regression using sklearn i hope you're excited i am i'll see you in a second
24:15
so as you can see i've already created an outline of what we're going to accomplish today this outline is pretty standard process that you're going to
24:23
follow whenever doing one of these projects so first we'll grab our imports i'll explain each import as we go we'll
24:28
import numpy as np because we'll need numpy arrays import pandas as pd
24:34
as well pandas is pandas i need to say more import matplot
24:40
matte plot i could type lib.pi plot as plt
24:45
you're probably already familiar with those and now what we want to do is we want to import the sklearn stuff so
24:51
we'll do from sklearn metrics import r
24:56
to score this will give us the ability to get an r squared from our model do from
25:04
sklearn.model selection import train test split this will allow us to
25:11
separate in our separate our data into train and test splits then we'll do uh pre-processing we need
25:17
to be on a min max thing so you'll see in a second this key learn free processing
25:23
import min max scaler basically we're going to import that we're going to import this and use it in
25:30
the next section because we want to scale our data and then we'll do from sklearn
25:40
feature selection import rfe or recursive feature elimination this allows us to
25:47
select only the features that make the most sense or import the model most effectively
25:53
from sk learn linear model import linear regression
25:59
we know what that is creating a linear regression model then two more from sklearn
26:06
model selection import ross valve 4 for our cross validation
26:12
score from sk learn model select import say fold quickly to making typos
26:20
of course i did sk sk earn okay
26:26
and apparently i'm retired model selection
26:33
there we go always fun doing this stuff live maybe i should just copy and paste next
26:39
time but anyways that's the case and you don't care if i'm typing it in and chatting with you let me know and i'll
26:44
just copy it and paste next time so let's see if i can do this one without any issues so we want to select our columns we already know what our columns
26:51
are their gdp currency and circulation rand that random variable and sp500
26:59
almost did it again okay all right and then let's see if
27:05
df columns and dfc if that gives us the data we want in the right order perfect so it
27:12
does okay so now that we have our data what we want to do
27:17
is formalize it okay so we want to scale and normalize because machine learning
27:23
algorithms you know work better when all of the data is scaled in a similar scale if you
27:29
don't scale and normalize the data essentially different features will have
27:34
more or less of an impact than other features and we don't necessarily want that so i think it makes intuitive sense
27:41
why all of the data should share a similar scale so that's what we'll do now so we'll type scalar equals min max
27:50
scalar we'll create that scalar object let's create the scaled data is
27:55
call scalar fit transform df
28:00
and we already selected our column so we don't need to do that and then what we do is we'll do
28:06
df2 because we're going to what we're going to do is create a second data frame because i never like overwriting our
28:11
original data frame pd data frame scaled
28:17
and because that destroys the columns i'll show you what i mean instead of just doing it here so
28:23
type in df2 so you can see all of the data now falls between 0 and 1 which is
28:28
exactly what we want and it's also normalized from a normal distribution and then what we want to do
28:34
is fix the column name so we'll do columns equals columns
28:41
and then df2 okay now we see that we got the column headings and all of the data is now
28:47
normalized remember how i said we're going to talk a little bit more about overfitting well guess what we just did
28:53
it so think about what we just did we scaled this data right here which is all
28:59
of the data based on all of the data right but that isn't necessarily what we want to do
29:04
because we you know if we're scaling between the minimum and the maximum yet maybe you know some of these maximums
29:11
are coming out here and we should only have this much data let me clarify that one more time so think about this let's
29:18
pretend like we want to predict the future okay and we only have up to data point 10.
29:24
well these data points are a lot higher but we just scaled everything using the entire distribution so that's a problem
29:31
so whenever we're calling a fit transform we should only ever transform
29:37
with test data hopefully that makes sense and i'll i'll explain it again as
29:42
as i do it so let's go ahead and split the data into train and test set so
29:47
we'll say let's see here train size equals 0.7 or 70 percent ef train
29:55
vf test equal train test split
30:02
and we just pass the data frame we say what the train size is
30:07
already selected to your ain size and then test size you could just
30:14
punch these in but i always like to put them as variables round 1 minus grain size
30:21
2 and then shuffle equal false i don't want to shuffle these values
30:27
and i think that looks good now we're going to go ahead and do what we previously did we're going to
30:32
complete that scalar remember this df is the non-normal non-normalized right the f2 was
30:37
so we'll say scalar equal min max scalar
30:42
okay now df train columns equals scalar
30:49
but fit transform via train column so remember what we're now doing
30:56
is we're normalizing over only that 70 percent okay and df test
31:02
columns equals scalar transform
31:08
vf test columns this is super important notice the fit doesn't exist i mean sorry the
31:15
fit we're only fitting to the train on the training set we're never fitting on the test set fitting is the thing that
31:21
normalizes and you know calculates the mean and the variance and all of that other stuff and then
31:26
adjust the values based on that right so we obviously want to only use the mean and the variance for
31:32
our training set and it should not affect our test set hopefully that makes
31:38
sense now what we'll do is we'll go ahead and separate the
31:43
training data we're going to pop what we're trying to predict off into a separate series so we'll do df.train
31:50
top sp 500 right so this will put the s p 500 or
31:55
what we're trying to predict into y train and then x train equals df train or the remaining
32:02
okay and this is just conventional y or y hat you're predicted and x is your independent variables uh array typically
32:10
capitalized we do the same thing with our test we'll do y test equal df test
32:16
dot pop and then sp 500 so now only our you know s p 500 prices are in here and
32:23
then the x test will be our df test okay and then just print this out so you can kind of get an understanding x train
32:31
a head we'll just print out let's see why test.com so
32:37
to give you an understanding so you've got the x train or the independent
32:42
variables right here and then you know since i printed the
32:48
test we've got the y values right here but again they don't
32:53
align because obviously the training date and the test date are two separate dates
32:59
so hopefully that makes sense and now remember we have that issue with multi-co-linearity
33:04
so let's go ahead and fix that but instead of manually reviewing our features what
33:10
happens if we had tons of features and we weren't sure which ones to eliminate so we can use machine learning to the
33:17
rescue so we can use recursive feature elimination to do that and it's pretty simple to do we just furnish a hyper
33:23
parameter of the number of parameters we want again a hyperparameter is just a parameter for parameters
33:30
and then that'll you know essentially give us exactly what we need so
33:35
let's go ahead and create i'll create this or type this out and it will make more sense as we do so sk learn
33:40
feature selection import rfv so that's our recursive feature elimination
33:47
we'll do lm equals linear regression this has been a long video so if you're still with me you're awesome
33:53
okay rfe equals rfe so we take our linear model which is just our linear you know regression
34:00
model that we just created the step above we do n features to select equals one
34:08
and rfd equals rfe.fit x train y train so let's think about what we're
34:15
doing we are just fitting this rfe object and rfe is just like a linear regression
34:21
only it knows to only select one feature we're doing that because we already know
34:28
rand is terrible it's just random it's not going to impact our our values but we
34:33
know that we can only have one feature because based on our vif above we you know these are you know there's
34:40
too much multicollinearity so we need to get rid of one right we don't want to
34:45
mess up our model and assume you know again like we talked about those p
34:50
values assume that our model works we know it shouldn't and we need to get rid of that because we're
34:57
we've got too much multi-collinearity so all right so what do we do now we'll go ahead and print x train columns
35:05
and print rfe port and print
35:10
rfe ranking and typing these out will make it easier to understand what we're doing here so
35:16
we can see now that we've got the train columns for gdp
35:21
currency and circulation and the rand but it only selects
35:27
the one which is gdp and then the next two are false and false that shows the
35:32
priority of orders if that makes sense so which is pretty awesome because you know we could have hundreds of features
35:39
and we could use this to figure out what features work the best and we don't have to
35:45
i guess create a linear regression and an r squared and all of that other stuff for each and every single one of them
35:50
and then you know do that we can just use rfe so pretty awesome there
35:56
now let's understand a little bit more about what we did above let's create another linear regression model so we'll
36:03
do lm equal linear regression and really walk through step by step
36:09
what we're doing here so we already know that only the gdp the gdp
36:16
column is the feature that we want and you know because we don't want the multi-collinearity and the random value
36:22
is random and didn't impact you know positively impact our prediction so what we do is we create this linear
36:29
model object we then pass it the training data right so that would be the
36:34
x train gdp lm.fit x train gdp
36:40
values and reshape negative 1 1. all this is doing here is
36:48
sklearn and the linear model wants a one-dimensional data and that's why we're reshaping this
36:55
this just means one column okay and then what we need to do is we need to train it
37:01
right on the y train okay so we take that we fit this data this will look at you know again like i
37:08
said the um the mean and the variance and everything that it needs to to to configure and
37:15
fit the model right to you know figure out what the best model is using ols or
37:21
least squares like we talked about above and now what we do is we take this pre-configured model and predict with it
37:27
so we'll do y cred equals lm predict but think about it this time we don't need
37:34
the y test data all we need to do because is get our train data and make predictions
37:41
from that so this will be x dot test gdp
37:47
and it can do the same thing values dot reshape made of one one
37:52
again instead we're obviously not going to put y test here because that's what we need to see
37:57
if you know to test against whether or not our predictions are good we'll hit enter here make sure there's no typos
38:03
and now we'll check the r squared and i'll also mark this up with some comments uh you know to upload it so
38:09
that way when you're working along with this if you're not working alongside me right now you can read and review that
38:14
but anyways r2 score you pass in the y test
38:20
and then the which is the predicted i'm sorry the actual and then passing the predicted
38:26
and then print r2 and let's see what we get so we get a 0.4
38:34
r squared so obviously a massive difference compared to what we saw above which was like 92 or 94 percent so
38:41
that's why i said before we're overfitting and you know i see this rampant around various places oh this is
38:48
a amazing model and you know this look it fits the 98 percent you know only two
38:54
percent of the variation well no you're predicting using the same data that you're trying to predict hopefully as
39:00
you should know and understand by now as i've said it a few times even and now i'm rambling so anyways
39:07
like i said before rfe recursively selects the best feature we talked about that and now it's time to use uh k-folds
39:15
right so right now we did one test and train split but as i said before you know even
39:23
better is when you have multiple test and train splits so this is what we
39:28
use or where we use k folds go do k folds
39:34
right we use n splits equal four so this will split the data into four different splits and then
39:40
we'll do four train index test
39:46
index and kf dot split x train
40:06
i don't know if you can tell but at the beginning of this video i was pretty amped up and i'm still excited about this video but
40:12
this time i'm pretty tired because getting later and taking longer than i expected it to but
40:19
anyways that's okay hopefully you guys will enjoy the hard work okay so what did i do here
40:25
so k folds there it is k fold not k folds there we go all right
40:31
so now you can see it separates the train and test set right you can see
40:36
from the test set at 0 1 2 3 4 5 6 7 8 9 10 11 12 13. so obviously those are
40:43
essentially sequentially separated but you can now see that the train so we're training the first set
40:51
on four through thirteen but now for the next set we actually remove four
40:58
five six and seven from the train and we're training from data earlier and after and we just do that throughout
41:04
right so this covers up through seven and then you have eight nine ten as a test and then 11 12 13 and then at the
41:11
last one is you're just training on all of the prior data and then the last three
41:17
points are the um you know the the train so now let's see
41:23
what we'll do is we'll go ahead and get scores for each one of these sections essentially r squared so
41:29
that we imported cross valve score previously that's pretty easy just put in the linear model x
41:36
train y train scoring we'll use r2 and then
41:42
our cross validation will default and then scores
41:48
cross val score oops we are
41:55
okay now we can see that there's four different r squares right and that makes sense because we
42:01
created four different folds right here we created or calculated an r squared
42:07
for each one and what's interesting is you know our prior model obviously overfit now that
42:12
we understand that you can't train and test on the same data is not as great as we once thought it was right we can see
42:19
that you know the second slice here you know 95 roughly is not explained by our
42:25
model uh so that's uh pretty crazy so would i you know essentially put money
42:32
on this model absolutely not and what am i trying to teach here two things one
42:37
you know obviously a you know how how to do linear regression properly and in a
42:42
machine learning environment with multiple variables and normalize all that stuff but also to be really really
42:47
careful to not overfit and i'm going to you know go back into areas one you need to think about
42:55
the data that you're bringing in right and make sure that they're not
43:01
somehow overlapping and two i'm going to say this one more time because i've seen it
43:07
in many places even on kaggle right you do not want to fit transform or ever fit
43:15
i should say you never want to fit your test data right you do want to transform
43:21
it transforming it just puts it in you know zero and ones fit adjusts the mean and variance and
43:27
you know the standard uh normalizes it and all of that and
43:33
you don't want to do that on your test data so again those are the big overfitting ones think you know what are
43:39
you really testing on are you actually fitting to pass data and are you accidentally fitting when you shouldn't
43:44
be so and that's it and if you made it to the end you're awesome because i know i
43:50
barely did and you know i got a favor to ask you and it's actually not to like and subscribe that'll come next um this
43:56
video you know now i'm at the end of my first mini course took a lot more effort than i realized uh probably by like
44:03
three to five times so if seeing me and you know some of the effects and all that other stuff and being able to see
44:10
kind of the overview first instead of just going right to the code and just walking through that without seeing me
44:16
was valuable to you let me know because if it's not maybe the next one i just show share my screen and just go go that
44:22
way but i tried to make this you know a little bit higher quality production and
44:29
you know something that might be able to help you learn even more and keep you interested because sometimes this
44:35
material is maybe not the most exciting for me you know i like this stuff um but even more
44:42
i like the results of what i can get with machine learning right it's more of a means to the end than just just loving
44:48
the stuff but anyways i'm starting to digress if you like this mini course please subscribe and hit the thumbs up
44:54
it lets the google algorithm know this is a video worth sharing and even more importantly if there's another mini
45:00
course that you're interested maybe it's python course pandas rust whatever leave
45:05
it in leave a suggestion or request in the comments below and if there's enough of
45:10
demand for it i'll go ahead and create that okay so i hope to see in the next video thanks for staying so long i know
45:15
it was a long one have a great day bye