Purpose of this notebook

The main purpose of this notebook is to learn how to get started in R and apply some basic commands like vectors, data importation, loops, if statements, and of course linear regression.

While this notebook is beginner friendly, it does require some basic understanding of how the OLS regression algorithm works.

Theory

Gross fixed capital formation (GFCF) includes spending on land improvements (fences, ditches, drains, and so on); plant, machinery, and equipment purchases; the construction of roads, railways, private residential dwellings, and commercial and industrial buildings.

Foreign direct investment (FDI) is an investment from a party in one country into a business or corporation in another country with the intention of establishing a lasting interest.

Data Importation

The Data was taken from the website “Perspective Monde”. Instead of downloading it as a CSV file and then importing it and extract vectors, they provide it in a ready format for R and Python as vectors and Data frames for convenience, which is what we did here:

Date_FDI=c(1970:2019)


# Vectors ( This writing is ignored while running the command because it's considered a comment )

FDI=c( 20000000, 23100000, 13000000, 5490000, -20400000, 5020000, 38014962.74, 7994056.273,
      11759988.24, 7437548.637, 89416222.59, 58581335.99, 79528177.1, 46123623.5, 46989196.56, 
      19975166.86, 549182.4961, 59574900.78, 84661627.57, 167056032.1, 165122977.8, 317462140.6,
      422470462.5, 491466064.6, 550924373.9, 334768272.9, 357393801.8, 1079341332, 308712164.4,
      826974026.9, 426553283.9, 2824557252, 480355698, 2312829823, 893325392.8, 1670609689, 
      2460787164, 2825801376, 2466288357, 1970323920, 1240625859, 2521362081, 2841954371, 
      3360909924, 3525384612, 3252913902, 2153363905, 2680109856, 3544387229, 1599761098)

Date_GFCF=c(1960:2018)

GFCF=c( 199877917.2, 229133346.5, 250410018.8, 306056694, 296850449.6, 313012548.2, 336528030.8, 
       415769192.8, 433356367.9,479596877.8, 590455488.6, 647326732.7, 690532081.4, 845163018.3, 
       1128655774, 2229981493, 2755187473, 3530966180, 3295653635, 3815239414, 5675430575, 
       5550459177, 5462138469, 4509216318, 3747639748, 3779465368, 4569944203, 
       4839690401, 5786813414, 7065573384, 7781959744, 8054243608, 8362423940, 8088129003, 
       8392956449, 9350603852, 9414815900, 8909512890, 10128138832, 10930128728, 10480934331, 
       10202074980, 11120357844, 13498857067,16273601240, 17759604422, 20021964801, 
       25416061424, 31838380450, 29413188368, 28576723851, 31926847056, 32032590051, 
       32894652311, 32860711609, 28703644911, 31025847566, 31424989682, 33556322647)

length=c(length(Date_FDI),length(FDI),length(Date_GFCF),length(GFCF))
print(length)
## [1] 50 50 59 59

If you want to import the data as an excel or csv file, use the following commands :

# For csv files
data = read.csv2("filename.csv")

# For Excel files, we have to install the 'readxl' package and then read it.
install.packages("readxl")
library(readxl)
data = read_excel("filename.xlsx", sheet="name")

# Extract vectors from csv file
vector = data$column_name

Data Cleaning

Note that the FDI has 50 observations starting from 1970 to 2019. The GFCF has 59 observations starting from 1960 to 2018. We will remove the first 10 observation from the GFCF as well as the last one from the FDI the last one so that the two vectors become equal in length:

GFCF_2 = GFCF[-(1:10)]
FDI_2 = FDI[-length(FDI)]
length_2=c(length(GFCF_2),length(FDI_2))
print(length_2)
## [1] 49 49

Now that the vectors are equal, we can bind them in a dataframe and visualize its head as follows:

data=data.frame(Year=c(1970:2018),GFCF_2,FDI_2)
print(head(data,2))
##   Year    GFCF_2    FDI_2
## 1 1970 590455489 20000000
## 2 1971 647326733 23100000
print(tail(data,2))
##    Year      GFCF_2      FDI_2
## 48 2017 31424989682 2680109856
## 49 2018 33556322647 3544387229

Data visualisation & Descriptive statistics

# Histograms

hist(GFCF_2,col='cornflowerblue')
hist(FDI_2,col='cornflowerblue')

# Line charts using the 'ggplot' library
library(ggplot2)
ggplot(data, aes(x=c(1970:2018), y=GFCF_2)) + geom_line(color="blue") + theme_bw()
ggplot(data, aes(x=c(1970:2018), y=FDI_2)) + geom_line(color="blue") + theme_bw()

The cat function allows you to print text next to actual commands, which is what we did to print descriptive statistics.

cat(cat('GFCF:'),cat(' min=',min(GFCF_2)),
     cat(' median=',median(GFCF_2)),
     cat(' max=',max(GFCF_2)),
     cat(' mean=',mean(GFCF_2)),
     cat(' standard_deviation=',sd(GFCF_2)))
## GFCF: min= 590455489 median= 8392956449 max= 33556322647 mean= 12835832651 standard_deviation= 11187944083
cat(cat('FDI:'),cat(' min=',min(FDI_2)),
     cat(' median=',median(FDI_2)),
     cat(' max=',max(FDI_2)),
     cat(' mean=',mean(FDI_2)),
     cat(' standard_deviation=',sd(FDI_2)))
## FDI: min= -20400000 median= 357393802 max= 3544387229 mean= 1001447986 standard_deviation= 1207213984
# Correlation
cat('Correlation between GFCF & FDI:', cor(FDI_2,GFCF_2))
## Correlation between GFCF & FDI: 0.9049102

Stationarity

The reason why I chose to work on Time series and not Cross sectional Data is because there is a strong chance that you will use R for Time series analysis if you’re reading this article.

There are tens of R libraries that are especially made for time series analysis. For now we will use the tseries library.

GFCF_3 = diff(GFCF_2,4)
tseries::adf.test(GFCF_3)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
##  Augmented Dickey-Fuller Test
## 
## data:  GFCF_3
## Dickey-Fuller = -3.6347, Lag order = 3, p-value = 0.04113
## alternative hypothesis: stationary
FDI_3 = diff(FDI_2,1)
tseries::adf.test(FDI_3)
## Warning in tseries::adf.test(FDI_3): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  FDI_3
## Dickey-Fuller = -7.1171, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

The p-value is less than 5% for both the GFCF and the FDI after 4 and 1 differentiations (integreations) respectively.

Since the FDI vector has 3 more observations than the GFCF, we need to remove the first three values so they become equal in length while maintaining the same Date index.

FDI_3 = FDI_3[-(1:3)]
data_new = data.frame("Year"=c(1974:2018),"GFCF"=GFCF_3,"FDI"=FDI_3)
ggplot(data_new,aes(x=Year, y=GFCF)) + geom_line(color="blue") + theme_bw()
ggplot(data_new,aes(x=Year, y=FDI)) + geom_line(color="blue") + theme_bw()

Using the ggplot package requires binding the vectors in dataframes ( data_new ). You can see through the line charts that the series have become stationary. We can proceed to the OLS method.

Parameters Estimation and Prediction

We will now create a function that returns the value of these two functions:

  • Intercept : \(f(x,y)=\bar{Y}-\frac{Cov(X,Y)}{V(X)}\bar{X}\)
  • Slope : \(f(x,y)=\frac{Cov(X,Y)}{V(X)}\)

First we call the function function(X,Y){}, and then stock variables inside it using the X and Y variables you put as arguments. The return command is specific to the function command, it returns the values of the stocked variables.

GFCF_par=function(X,Y){ 
    Intercept=mean(Y)-(cov(X,Y)/var(X))*mean(X);Slope=(cov(X,Y)/var(X));
    return(cat(cat('Intercept:',Intercept),cat('   Slope:',Slope)))}
GFCF_par(FDI_3,GFCF_3)
## Intercept: 2706828303   Slope: 0.03674983

The below code is to create the Prediction vector \(\hat{Y}\) based on the calculated OLS parameters. Note that we will be referring to the intercept as \(\alpha\) and the slope as \(\beta\).

  • \(\hat{Y}=\hat{\alpha} + \hat{\beta}{X}\)
GFCF_hat=function(X,Y){ 
    Yhat=mean(Y)-(cov(X,Y)/var(X))*mean(X)+(cov(X,Y)/var(X))*X;
    return(Yhat)}

# show first 4 values
print(head(GFCF_hat(FDI_3,GFCF_3),4))
## [1] 2705876850 2707762483 2708040862 2705725040

The Sum of Squares

ESS

  • Estimated Sum of Squares = \(\sum{(\bar{Y}-{\hat{Y}_i})^2}\)

Loops are by far my favorite feature in programming in general, I use them all the time because they facilitate calculations and are so efficient.

There are three ways to create loops in R, that is for, while and repeat loops. We will break down the code below to explain how they work :

  • for loop:
    First we need to initialize a variable to stock the sums, let’s call it h. Inside the for loop, we set a list of indexes from 1 until the length of the Y vector, we call these indexes using the letter i in the loop. The h variable starts at 0, the first value it will take is \(g[1]+0\), the second value is \(g[2]+g[1]\), the third value is \(g[3]+g[2]\), and so on until we reach the last index.

  • while loop:
    The while loop is slightly different. You set a condition for the index i, in this case we calculate the sums as long as the index is less than the length of Y while(i<=length(Y)), and we set \(i=i+1\) to change the index after each iteration.

  • repeat loop:
    The repeat loop is kind of a combination of both the for and while loops. we repeat the sum operation like in the for loop, while maintaining the index up to iteration as in the while loop. What is specific for the repeat loop is that we have to set a condition where the loop would break, which is if the value of the index become higher than the length of the vector, stop the iterations if(i>length(Y)){break}}

# Using for loop
ESS=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                  h=0;
                  for ( i in 1:length(Y)){h=g[i]+h};
                  return(cat('ESS=',h))}
ESS(FDI_3,GFCF_3)
## ESS= 3.324419e+16
# using while loop
ESS=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                  h=0;i=1;
                  while(i<=length(Y)){h=g[i]+h;i=i+1};
                  return(cat('   ESS=',h))}
ESS(FDI_3,GFCF_3)
##    ESS= 3.324419e+16
# using repeat loop
ESS=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                    h=0;i=1;
                    repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                  return(cat('   ESS=',h))}
ESS(FDI_3,GFCF_3)
##    ESS= 3.324419e+16

RSS

  • Residual Sum of Squares = \(\sum{({Y_i}-{\hat{Y}_i})^2}\)
# Using for loop
RSS=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                   h=0;
                   for ( i in 1:length(Y)){h=g[i]+h};
                  return(cat('RSS=',h))}
RSS(FDI_3,GFCF_3)
## RSS= 6.297181e+20
# using while loop
RSS=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                    h=0;i=1;
                    while(i<=length(Y)){h=g[i]+h;i=i+1};
                  return(cat('    RSS=',h))}
RSS(FDI_3,GFCF_3)
##     RSS= 6.297181e+20
# using repeat loop
RSS=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                    h=0;i=1;
                    repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                  return(cat('    RSS=',h))}
RSS(FDI_3,GFCF_3)
##     RSS= 6.297181e+20

TSS

  • Total Sum of Squares = \(\sum{({Y_i}-{\bar{Y}})^2}\)
# using for loop
TSS=function(X,Y){g=(Y-mean(Y))^2;
                  h=0;
                  for ( i in 1:length(Y)){h=g[i]+h};
                  return(cat('TSS=',h))}
TSS(FDI_3,GFCF_3)
## TSS= 6.297513e+20
# using while loop
TSS=function(X,Y){g=(Y-mean(Y))^2;
                  h=0;i=1;
                  while(i<=length(Y)){h=g[i]+h;i=i+1};
                  return(cat('    TSS=',h))}
TSS(FDI_3,GFCF_3)
##     TSS= 6.297513e+20
# using repeat loop
TSS=function(X,Y){g=(Y-mean(Y))^2;
                    h=0;i=1;
                    repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                  return(cat('    TSS=',h))}
TSS(FDI_3,GFCF_3)
##     TSS= 6.297513e+20

Coefficient of determination

  • \(R^2=\frac{ESS}{TSS}\)

In these loops, we have stocked the value of the ESS in the \(g\) variable, and the value of the TSS in the k variable. We fixed \(h=0\) and \(z=0\) to stock the value of the iterations just like we did above, and finally we used the return function to return the fraction \(h/z\).

# using for loop
R_squared=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                       k=(Y-mean(Y))^2;
                       h=0;z=0;
                       for ( i in 1:length(Y)){h=g[i]+h;z=k[i]+z};
                       return(cat('R_squared=',h/z))}
R_squared(FDI_3,GFCF_3)
## R_squared= 5.27894e-05
# using while loop
Rsquared=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                         k=(Y-mean(Y))^2;
                         h=0;z=0;i=1;
                         while(i<=length(Y)){h=g[i]+h;z=k[i]+z;i=i+1};
                       return(cat('      R_squared=',h/z))}
R_squared(FDI_3,GFCF_3)
## R_squared= 5.27894e-05
# using repeat loop
Rsquared=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                         k=(Y-mean(Y))^2;
                         h=0;z=0;i=1;
                         repeat{h=g[i]+h;z=k[i]+z;i=i+1;if(i>length(Y)){break}};
                       return(cat('      R_squared=',h/z))}
R_squared(FDI_3,GFCF_3)
## R_squared= 5.27894e-05

Variances of the model’s parameters

Error variance: \(\sigma^2=\frac{RSS}{n-k-1}\) with \(n=\) numbers of observations and \(k=\) number of independent variables.

In these loops we simply used the code of the RSS, and divided its value by the degree of freedom in the return command.

# using for loop
VarError=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       h=0;
                       for ( i in 1:length(Y)){h=g[i]+h}
                       return(cat('Variance_of_Error=',(h/(length(Y)-2))))}
VarError(FDI_3,GFCF_3)
## Variance_of_Error= 1.464461e+19
# using while loop
VarError=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       h=0;i=1;
                       while(i<=length(Y)){h=g[i]+h;i=i+1};
                       return(cat('    Variance_of_Error=',(h/(length(Y)-2))))}
VarError(FDI_3,GFCF_3)
##     Variance_of_Error= 1.464461e+19
# using repeat loop
VarError=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       h=0;i=1;
                       repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                       return(cat('    Variance_of_Error=',(h/(length(Y)-2))))}
VarError(FDI_3,GFCF_3)
##     Variance_of_Error= 1.464461e+19

Intercept Variance: \(\hat{\sigma}_{\hat{\alpha}}^2 = \sigma_{\epsilon}^2[\frac{1}{n}+\frac{\bar{X}^2}{\sum{(X_i-\bar{X})}^2}]\)

We used the code for the RSS for the variance of error, then we simply returned the formula in the return command.

# using for loop
VarIntercept=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;
                           for ( i in 1:length(Y) ){h=g[i]+h;z=k[i]+z};
                           return(cat('Variance_of_Intercept=:',(h/(length(Y)-2))*((1/length(Y))
                                                                                   +(mean(X)^2/z))))}
VarIntercept(FDI_3,GFCF_3)
## Variance_of_Intercept=: 3.291151e+17
# using while loop
VarIntercept=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;i=1;
                           while(i<=length(Y)){h=g[i]+h;z=k[i]+z;i=i+1};
                           return(cat('  Variance_of_Intercept=:',(h/(length(Y)-2))*((1/length(Y))
                                                                                     +(mean(X)^2/z))))}
VarIntercept(FDI_3,GFCF_3)
##   Variance_of_Intercept=: 3.291151e+17
# using repeat loop
VarIntercept=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;i=1;
                           repeat{h=g[i]+h;z=k[i]+z;i=i+1;if(i>length(Y)){break}};
                           return(cat('  Variance_of_Intercept=:',(h/(length(Y)-2))*((1/length(Y))
                                                                                     +(mean(X)^2/z))))}
VarIntercept(FDI_3,GFCF_3)
##   Variance_of_Intercept=: 3.291151e+17

Slope variance: \(\hat{\sigma}_{\hat{\beta}}^2 = \frac{\sigma_{\epsilon}^2}{\sum{(X_i-\bar{X})}^2}\)

Same method above. we used the return command to return the value of the function.

# using for loop
VarSlope=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       k=(X-mean(X))^2;
                       h=0;z=0;
                       for( i in 1:length(Y)){h=g[i]+h;z=k[i]+z};
                       return(cat('Variance_of_Slope=',(h/(length(Y)-2)/z)))}
VarSlope(FDI_3,GFCF_3)
## Variance_of_Slope= 0.5949392
# using while loop
VarSlope=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       k=(X-mean(X))^2;
                       h=0;z=0;i=1;
                       while(i<=length(Y)){h=g[i]+h;z=k[i]+z;i=i+1};
                       return(cat('     Variance_of_Slope=',(h/(length(Y)-2)/z)))}
VarSlope(FDI_3,GFCF_3)
##      Variance_of_Slope= 0.5949392
# using repeat loop
VarSlope=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       k=(X-mean(X))^2;
                       h=0;z=0;i=1;
                       repeat{h=g[i]+h;z=k[i]+z;i=i+1;if(i>length(Y)){break}};
                       return(cat('     Variance_of_Slope=',(h/(length(Y)-2)/z)))}
VarSlope(FDI_3,GFCF_3)
##      Variance_of_Slope= 0.5949392

Hypothesis testing

Student test: \(t_{\hat{\rho}}= \mid{\frac{\hat{\rho}}{\hat{\sigma}_{\hat{\rho}}}}\mid\)

This code looks complicated but it is far from that. Let’s break it down:

  • First We stocked the values of the intercept and the slope.
  • The \(g\) variable is used to stock \((Y_i-\hat{Y_i})^2\)
  • The \(k\) variable is used to stock \((X_i-\bar{X})^2\)
  • \(h\) and \(z\) are initialized as 0. At the end of the iterations they will respectively equal \(\sum{(Y_i-\hat{Y_i})^2}\) and \(\sum{(X_i-\bar{X})^2}\).
  • SdIntercept is to calculate the standard deviation of the intercept.
  • SdSlope is to calculate the standard deviation of the slope.
  • tIntercept is the value of the student test for the intercept : \({\frac{\hat{\alpha}}{\hat{\sigma}_{\hat{\alpha}}}}\)
  • tSlope is the value of the student test for the slope : \({\frac{\hat{\beta}}{\hat{\sigma}_{\hat{\beta}}}}\)
  • Finally we will use if statements to set the conditions for the test. For example if(abs(tIntercept)>qt(1-(alpha/2),length(Y)-2)) means that if the absolute value of the intercept’s t-test is higher than the Student theoretical value calculated using the qt function, then The intercept is statistically significant.
t_test=function(X,Y,alpha){Intercept=mean(Y)-(cov(X,Y)/var(X))*mean(X);
                           Slope=cov(X,Y)/var(X);
                           g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;
                           for ( i in 1:length(Y) ){h=g[i]+h;z=k[i]+z};
                           SdIntercept=sqrt((h/(length(Y)-2))*((1/length(Y))+(mean(X)^2/z)));
                           SdSlope=sqrt(h/(length(Y)-2)/z);
                           tIntercept=Intercept/SdIntercept;
                           tSlope=Slope/SdSlope;
                           if(abs(tIntercept)>qt(1-(alpha/2),length(Y)-2))
                           {print('The intercept is statistically significant')};
                           if(abs(tIntercept)<=qt(1-(alpha/2),length(Y)-2))
                           {print('The intercept is statistically *non* significant')};
                           if(abs(tSlope)>qt(1-(alpha/2),length(Y)-2))
                           {print('The slope is statistically significant')};
                           if(abs(tSlope)<=qt(1-(alpha/2),length(Y)-2))
                           {print('The slope is statistically *non* significant')}}
t_test(FDI_3,GFCF_3,0.05)
## [1] "The intercept is statistically significant"
## [1] "The slope is statistically *non* significant"

Note that we added an alpha as an argument to the function, which is the desired level of risk-tolerance.

# risk-tolerance level = 1%
t_test(FDI_3,GFCF_3,0.01)
## [1] "The intercept is statistically significant"
## [1] "The slope is statistically *non* significant"

Fisher test: \(F = \frac{R^2}{\frac{1-R^2}{n-k-1}}\)

For the Fisher test we calculated the test value using the \(R^2\), which is also the coeffcient of correlation squared ( another easy way to calculate it ).

To get the theoretical values for the Fisher test, we use the qf function.

F_test=function(X,Y,alpha){r=cov(X,Y)/(sd(X)*sd(Y));
                           Rsquared=r^2;
                           F=Rsquared/((1-Rsquared)/(length(Y)-2));
                           if(F>qf(1-alpha,1,length(Y)-2))
                           {print('The model is globally significative')};
                           if(F<=qf(1-alpha,1,length(Y)-2))
                           {print('The model is globally *non* significative')}}
F_test(FDI_3,GFCF_3,0.05)
## [1] "The model is globally *non* significative"
# risk-tolerance level = 1%
F_test(FDI_3,GFCF_3,0.01)
## [1] "The model is globally *non* significative"

ANOVA

Bonus: ANOVA table. To create an ANOVA from scratch like we did with all the above applications, we simply just organize the functions in a dataframe as follows :

i=1:length(GFCF_3)

ANOVA=data.frame(Source=c("x1,x2,...,xn","Residual","Total"),
                 Sum_Sq=c(sum((GFCF_hat(FDI_3,GFCF_3)[i]-mean(GFCF_3))^2),
                          sum((GFCF_3[i]-GFCF_hat(FDI_3,GFCF_3)[i])^2),
                          sum((GFCF_3[i]-mean(GFCF_3))^2)),
                 df=c(2,length(GFCF_3)-3,length(GFCF_3)-1),
                 Mean_Sq=c((sum((GFCF_hat(FDI_3,GFCF_3)[i]-mean(GFCF_3))^2))/2,
                          (sum((GFCF_3[i]-GFCF_hat(FDI_3,GFCF_3)[i])^2))/(length(GFCF_3)-3),
                           (sum((GFCF_3[i]-mean(GFCF_3))^2))/(length(GFCF_3)-1)))

print(ANOVA)
##         Source       Sum_Sq df      Mean_Sq
## 1 x1,x2,...,xn 3.324419e+16  2 1.662210e+16
## 2     Residual 6.297181e+20 42 1.499329e+19
## 3        Total 6.297513e+20 44 1.431253e+19

Why is Stationarity so important !

Stationarity means that the statistical properties of a time series do not change over time. If you keep the trend or the seasonal components in the series, your model will be wrong because it will model the actual components instead of the raw values, which will generate false high accuracy and biased insights.

Below is the built-in command to perform linear regression in R. Note that in the first regression we used non-stationary data and the model seemed to perform well and has high accuracy.

In the second regression we used stationary vectors and the model turned out to be insignificant. This doesn’t mean that the theory is wrong as we only used a simple model with two variables. Advanced time series models with multiple variables can accurately explain such theories and generate interesting insights.

model = lm(GFCF_2 ~ FDI_2)
summary(model)
## 
## Call:
## lm(formula = GFCF_2 ~ FDI_2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.792e+10 -2.001e+09 -3.150e+08  1.980e+09  1.374e+10 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.437e+09  8.970e+08   4.947 1.01e-05 ***
## FDI_2       8.386e+00  5.753e-01  14.576  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.812e+09 on 47 degrees of freedom
## Multiple R-squared:  0.8189, Adjusted R-squared:  0.815 
## F-statistic: 212.5 on 1 and 47 DF,  p-value: < 2.2e-16
model = lm(GFCF_3 ~ FDI_3)
summary(model)
## 
## Call:
## lm(formula = GFCF_3 ~ FDI_3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.920e+09 -2.012e+09 -9.433e+08  5.053e+08  1.287e+10 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.707e+09  5.737e+08   4.718 2.53e-05 ***
## FDI_3       3.675e-02  7.713e-01   0.048    0.962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.827e+09 on 43 degrees of freedom
## Multiple R-squared:  5.279e-05,  Adjusted R-squared:  -0.0232 
## F-statistic: 0.00227 on 1 and 43 DF,  p-value: 0.9622
---
title: "Learn R basics with Simple Linear Regression and Time series"
author: "Ibraheem El Ansari"
output: 
  html_document:
    toc_float: TRUE
    theme: default
    toc: TRUE
    code_download: TRUE
    highlight: tango
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{css echo=FALSE}
.columns {display : flex;}
h2 {color: darkblue; font-family: cursive }
p {font-family: Comic Sans MS}
.r {border-radius: 1em; font-family: Lucida Console}
.author {color: black; font-family: cursive; font-size: 16px; float:right;}
.title {color: #00264d; font-family: cursive; text-align: center }
```

## Purpose of this notebook

The main purpose of this notebook is to learn how to get started in R and apply some basic commands like vectors, data importation, loops, if statements, and of course linear regression.

While this notebook is beginner friendly, it does require some basic understanding of how the OLS regression algorithm works.

## Theory

**Gross fixed capital formation (GFCF)** includes spending on land improvements (fences, ditches, drains, and so on); plant, machinery, and equipment purchases; the construction of roads, railways, private residential dwellings, and commercial and industrial buildings.

**Foreign direct investment (FDI)** is an investment from a party in one country into a business or corporation in another country with the intention of establishing a lasting interest.

## Data Importation

The Data was taken from the website **"Perspective Monde"**. Instead of downloading it as a CSV file and then importing it and extract vectors, they provide it in a ready format for R and Python as vectors and Data frames for convenience, which is what we did here:

```{r}
Date_FDI=c(1970:2019)


# Vectors ( This writing is ignored while running the command because it's considered a comment )

FDI=c( 20000000, 23100000, 13000000, 5490000, -20400000, 5020000, 38014962.74, 7994056.273,
      11759988.24, 7437548.637, 89416222.59, 58581335.99, 79528177.1, 46123623.5, 46989196.56, 
      19975166.86, 549182.4961, 59574900.78, 84661627.57, 167056032.1, 165122977.8, 317462140.6,
      422470462.5, 491466064.6, 550924373.9, 334768272.9, 357393801.8, 1079341332, 308712164.4,
      826974026.9, 426553283.9, 2824557252, 480355698, 2312829823, 893325392.8, 1670609689, 
      2460787164, 2825801376, 2466288357, 1970323920, 1240625859, 2521362081, 2841954371, 
      3360909924, 3525384612, 3252913902, 2153363905, 2680109856, 3544387229, 1599761098)

Date_GFCF=c(1960:2018)

GFCF=c( 199877917.2, 229133346.5, 250410018.8, 306056694, 296850449.6, 313012548.2, 336528030.8, 
       415769192.8, 433356367.9,479596877.8, 590455488.6, 647326732.7, 690532081.4, 845163018.3, 
       1128655774, 2229981493, 2755187473, 3530966180, 3295653635, 3815239414, 5675430575, 
       5550459177, 5462138469, 4509216318, 3747639748, 3779465368, 4569944203, 
       4839690401, 5786813414, 7065573384, 7781959744, 8054243608, 8362423940, 8088129003, 
       8392956449, 9350603852, 9414815900, 8909512890, 10128138832, 10930128728, 10480934331, 
       10202074980, 11120357844, 13498857067,16273601240, 17759604422, 20021964801, 
       25416061424, 31838380450, 29413188368, 28576723851, 31926847056, 32032590051, 
       32894652311, 32860711609, 28703644911, 31025847566, 31424989682, 33556322647)

length=c(length(Date_FDI),length(FDI),length(Date_GFCF),length(GFCF))
print(length)
```

If you want to import the data as an excel or csv file, use the following commands :

```{r eval=FALSE}
# For csv files
data = read.csv2("filename.csv")

# For Excel files, we have to install the 'readxl' package and then read it.
install.packages("readxl")
library(readxl)
data = read_excel("filename.xlsx", sheet="name")

# Extract vectors from csv file
vector = data$column_name
```

## Data Cleaning

Note that the FDI has 50 observations starting from 1970 to 2019. The GFCF has 59 observations starting from 1960 to 2018. We will remove the first 10 observation from the GFCF as well as the last one from the FDI the last one so that the two vectors become equal in length:

```{r}
GFCF_2 = GFCF[-(1:10)]
FDI_2 = FDI[-length(FDI)]
length_2=c(length(GFCF_2),length(FDI_2))
print(length_2)
```

Now that the vectors are equal, we can bind them in a dataframe and visualize its head as follows:

```{r}
data=data.frame(Year=c(1970:2018),GFCF_2,FDI_2)
print(head(data,2))
print(tail(data,2))
```

## Data visualisation & Descriptive statistics

```{r fig.show="hold", out.width="50%"}
# Histograms

hist(GFCF_2,col='cornflowerblue')
hist(FDI_2,col='cornflowerblue')
```

```{r fig.show="hold", out.width="50%"}
# Line charts using the 'ggplot' library
library(ggplot2)
ggplot(data, aes(x=c(1970:2018), y=GFCF_2)) + geom_line(color="blue") + theme_bw()
ggplot(data, aes(x=c(1970:2018), y=FDI_2)) + geom_line(color="blue") + theme_bw()
```

The `cat` function allows you to print text next to actual commands, which is what we did to print descriptive statistics. 

```{r}
cat(cat('GFCF:'),cat(' min=',min(GFCF_2)),
     cat(' median=',median(GFCF_2)),
     cat(' max=',max(GFCF_2)),
     cat(' mean=',mean(GFCF_2)),
     cat(' standard_deviation=',sd(GFCF_2)))

cat(cat('FDI:'),cat(' min=',min(FDI_2)),
     cat(' median=',median(FDI_2)),
     cat(' max=',max(FDI_2)),
     cat(' mean=',mean(FDI_2)),
     cat(' standard_deviation=',sd(FDI_2)))
```

```{r}
# Correlation
cat('Correlation between GFCF & FDI:', cor(FDI_2,GFCF_2))
```

## Stationarity

The reason why I chose to work on Time series and not Cross sectional Data is because there is a strong chance that you will use R for Time series analysis if you're reading this article.

There are tens of R libraries that are especially made for time series analysis. For now we will use the `tseries` library.


```{r}
GFCF_3 = diff(GFCF_2,4)
tseries::adf.test(GFCF_3)
```

```{r}
FDI_3 = diff(FDI_2,1)
tseries::adf.test(FDI_3)
```

The *p-value* is less than 5% for both the GFCF and the FDI after 4 and 1 differentiations (integreations) respectively. 

Since the FDI vector has 3 more observations than the GFCF, we need to remove the first three values so they become equal in length while maintaining the same Date index.

```{r}
FDI_3 = FDI_3[-(1:3)]
data_new = data.frame("Year"=c(1974:2018),"GFCF"=GFCF_3,"FDI"=FDI_3)
```


```{r fig.show="hold", out.width="50%"}
ggplot(data_new,aes(x=Year, y=GFCF)) + geom_line(color="blue") + theme_bw()
ggplot(data_new,aes(x=Year, y=FDI)) + geom_line(color="blue") + theme_bw()
```

Using the `ggplot` package requires binding the vectors in dataframes ( data_new ). You can see through the line charts that the series have become stationary. We can proceed to the OLS method.

## Parameters Estimation and Prediction

We will now create a function that returns the value of these two functions:

- Intercept : $f(x,y)=\bar{Y}-\frac{Cov(X,Y)}{V(X)}\bar{X}$
- Slope : $f(x,y)=\frac{Cov(X,Y)}{V(X)}$ 

First we call the function `function(X,Y){}`, and then stock variables inside it using the X and Y variables you put as arguments. The `return` command is specific to the `function` command, it returns the values of the stocked variables.

```{r}
GFCF_par=function(X,Y){ 
    Intercept=mean(Y)-(cov(X,Y)/var(X))*mean(X);Slope=(cov(X,Y)/var(X));
    return(cat(cat('Intercept:',Intercept),cat('   Slope:',Slope)))}
GFCF_par(FDI_3,GFCF_3)
```

The below code is to create the Prediction vector $\hat{Y}$ based on the calculated OLS parameters. Note that we will be referring to the intercept as $\alpha$ and the slope as $\beta$.

- $\hat{Y}=\hat{\alpha} + \hat{\beta}{X}$

```{r}
GFCF_hat=function(X,Y){ 
    Yhat=mean(Y)-(cov(X,Y)/var(X))*mean(X)+(cov(X,Y)/var(X))*X;
    return(Yhat)}

# show first 4 values
print(head(GFCF_hat(FDI_3,GFCF_3),4))
```

## The Sum of Squares

### ESS

- Estimated Sum of Squares = $\sum{(\bar{Y}-{\hat{Y}_i})^2}$

Loops are by far my favorite feature in programming in general, I use them all the time because they facilitate calculations and are so efficient.

There are three ways to create loops in R, that is `for`, `while` and `repeat` loops. We will break down the code below to explain how they work :

- `for` loop: <br> First we need to initialize a variable to stock the sums, let's call it **h**. Inside the **for** loop, we set a list of indexes from 1 until the length of the Y vector, we call these indexes using the letter **i** in the loop.  The **h** variable starts at 0, the first value it will take is $g[1]+0$, the second value is $g[2]+g[1]$, the third value is $g[3]+g[2]$, and so on until we reach the last index.


- `while` loop: <br> The while loop is slightly different. You set a condition for the index **i**, in this case we calculate the sums as long as the index is less than the length of Y `while(i<=length(Y))`, and we set $i=i+1$ to change the index after each iteration.

- `repeat` loop: <br> The repeat loop is kind of a combination of both the for and while loops. we repeat the sum operation like in the for loop, while maintaining the index up to iteration as in the while loop. What is specific for the repeat loop is that we have to set a condition where the loop would `break`, which is if the value of the index become higher than the length of the vector, stop the iterations `if(i>length(Y)){break}}`


```{r}
# Using for loop
ESS=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                  h=0;
                  for ( i in 1:length(Y)){h=g[i]+h};
                  return(cat('ESS=',h))}
ESS(FDI_3,GFCF_3)

# using while loop
ESS=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                  h=0;i=1;
                  while(i<=length(Y)){h=g[i]+h;i=i+1};
                  return(cat('   ESS=',h))}
ESS(FDI_3,GFCF_3)

# using repeat loop
ESS=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                    h=0;i=1;
                    repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                  return(cat('   ESS=',h))}
ESS(FDI_3,GFCF_3)
```


### RSS

- Residual Sum of Squares = $\sum{({Y_i}-{\hat{Y}_i})^2}$

```{r}
# Using for loop
RSS=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                   h=0;
                   for ( i in 1:length(Y)){h=g[i]+h};
                  return(cat('RSS=',h))}
RSS(FDI_3,GFCF_3)

# using while loop
RSS=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                    h=0;i=1;
                    while(i<=length(Y)){h=g[i]+h;i=i+1};
                  return(cat('    RSS=',h))}
RSS(FDI_3,GFCF_3)

# using repeat loop
RSS=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                    h=0;i=1;
                    repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                  return(cat('    RSS=',h))}
RSS(FDI_3,GFCF_3)
```


### TSS

- Total Sum of Squares = $\sum{({Y_i}-{\bar{Y}})^2}$

```{r}
# using for loop
TSS=function(X,Y){g=(Y-mean(Y))^2;
                  h=0;
                  for ( i in 1:length(Y)){h=g[i]+h};
                  return(cat('TSS=',h))}
TSS(FDI_3,GFCF_3)

# using while loop
TSS=function(X,Y){g=(Y-mean(Y))^2;
                  h=0;i=1;
                  while(i<=length(Y)){h=g[i]+h;i=i+1};
                  return(cat('    TSS=',h))}
TSS(FDI_3,GFCF_3)

# using repeat loop
TSS=function(X,Y){g=(Y-mean(Y))^2;
                    h=0;i=1;
                    repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                  return(cat('    TSS=',h))}
TSS(FDI_3,GFCF_3)
```


## Coefficient of determination

- $R^2=\frac{ESS}{TSS}$

In these loops, we have stocked the value of the *ESS* in the $g$ variable, and the value of the *TSS* in the k variable.
We fixed $h=0$ and $z=0$ to stock the value of the iterations just like we did above, and finally we used the `return` function to return the fraction $h/z$.

```{r}
# using for loop
R_squared=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                       k=(Y-mean(Y))^2;
                       h=0;z=0;
                       for ( i in 1:length(Y)){h=g[i]+h;z=k[i]+z};
                       return(cat('R_squared=',h/z))}
R_squared(FDI_3,GFCF_3)

# using while loop
Rsquared=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                         k=(Y-mean(Y))^2;
                         h=0;z=0;i=1;
                         while(i<=length(Y)){h=g[i]+h;z=k[i]+z;i=i+1};
                       return(cat('      R_squared=',h/z))}
R_squared(FDI_3,GFCF_3)

# using repeat loop
Rsquared=function(X,Y){g=((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)-mean(Y))^2;
                         k=(Y-mean(Y))^2;
                         h=0;z=0;i=1;
                         repeat{h=g[i]+h;z=k[i]+z;i=i+1;if(i>length(Y)){break}};
                       return(cat('      R_squared=',h/z))}
R_squared(FDI_3,GFCF_3)
```


## Variances of the model's parameters

**Error variance: ** $\sigma^2=\frac{RSS}{n-k-1}$ with $n=$ numbers of observations and $k=$ number of independent variables.

In these loops we simply used the code of the RSS, and divided its value by the degree of freedom in the `return` command.

```{r}
# using for loop
VarError=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       h=0;
                       for ( i in 1:length(Y)){h=g[i]+h}
                       return(cat('Variance_of_Error=',(h/(length(Y)-2))))}
VarError(FDI_3,GFCF_3)

# using while loop
VarError=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       h=0;i=1;
                       while(i<=length(Y)){h=g[i]+h;i=i+1};
                       return(cat('    Variance_of_Error=',(h/(length(Y)-2))))}
VarError(FDI_3,GFCF_3)

# using repeat loop
VarError=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       h=0;i=1;
                       repeat{h=g[i]+h;i=i+1;if(i>length(Y)){break}};
                       return(cat('    Variance_of_Error=',(h/(length(Y)-2))))}
VarError(FDI_3,GFCF_3)
```


**Intercept Variance: ** $\hat{\sigma}_{\hat{\alpha}}^2 = \sigma_{\epsilon}^2[\frac{1}{n}+\frac{\bar{X}^2}{\sum{(X_i-\bar{X})}^2}]$

We used the code for the RSS for the variance of error, then we simply returned the formula in the `return` command.

```{r}
# using for loop
VarIntercept=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;
                           for ( i in 1:length(Y) ){h=g[i]+h;z=k[i]+z};
                           return(cat('Variance_of_Intercept=:',(h/(length(Y)-2))*((1/length(Y))
                                                                                   +(mean(X)^2/z))))}
VarIntercept(FDI_3,GFCF_3)

# using while loop
VarIntercept=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;i=1;
                           while(i<=length(Y)){h=g[i]+h;z=k[i]+z;i=i+1};
                           return(cat('  Variance_of_Intercept=:',(h/(length(Y)-2))*((1/length(Y))
                                                                                     +(mean(X)^2/z))))}
VarIntercept(FDI_3,GFCF_3)

# using repeat loop
VarIntercept=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;i=1;
                           repeat{h=g[i]+h;z=k[i]+z;i=i+1;if(i>length(Y)){break}};
                           return(cat('  Variance_of_Intercept=:',(h/(length(Y)-2))*((1/length(Y))
                                                                                     +(mean(X)^2/z))))}
VarIntercept(FDI_3,GFCF_3)
```


**Slope variance: ** $\hat{\sigma}_{\hat{\beta}}^2 = \frac{\sigma_{\epsilon}^2}{\sum{(X_i-\bar{X})}^2}$

Same method above. we used the `return` command to return the value of the function.

```{r}
# using for loop
VarSlope=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       k=(X-mean(X))^2;
                       h=0;z=0;
                       for( i in 1:length(Y)){h=g[i]+h;z=k[i]+z};
                       return(cat('Variance_of_Slope=',(h/(length(Y)-2)/z)))}
VarSlope(FDI_3,GFCF_3)

# using while loop
VarSlope=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       k=(X-mean(X))^2;
                       h=0;z=0;i=1;
                       while(i<=length(Y)){h=g[i]+h;z=k[i]+z;i=i+1};
                       return(cat('     Variance_of_Slope=',(h/(length(Y)-2)/z)))}
VarSlope(FDI_3,GFCF_3)

# using repeat loop
VarSlope=function(X,Y){g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                       k=(X-mean(X))^2;
                       h=0;z=0;i=1;
                       repeat{h=g[i]+h;z=k[i]+z;i=i+1;if(i>length(Y)){break}};
                       return(cat('     Variance_of_Slope=',(h/(length(Y)-2)/z)))}
VarSlope(FDI_3,GFCF_3)
```



## Hypothesis testing

**Student test:** $t_{\hat{\rho}}= \mid{\frac{\hat{\rho}}{\hat{\sigma}_{\hat{\rho}}}}\mid$

This code looks complicated but it is far from that. Let's break it down:

- First We stocked the values of the intercept and the slope.
- The $g$ variable is used to stock $(Y_i-\hat{Y_i})^2$
- The $k$ variable is used to stock $(X_i-\bar{X})^2$
- $h$ and $z$ are initialized as 0. At the end of the iterations they will respectively equal  $\sum{(Y_i-\hat{Y_i})^2}$ and $\sum{(X_i-\bar{X})^2}$.
- **SdIntercept** is to calculate the standard deviation of the intercept.
- **SdSlope** is to calculate the standard deviation of the slope.
- **tIntercept** is the value of the student test for the intercept : ${\frac{\hat{\alpha}}{\hat{\sigma}_{\hat{\alpha}}}}$
- **tSlope** is the value of the student test for the slope : ${\frac{\hat{\beta}}{\hat{\sigma}_{\hat{\beta}}}}$
- Finally we will use `if statements` to set the conditions for the test. For example `if(abs(tIntercept)>qt(1-(alpha/2),length(Y)-2))` means that if the absolute value of the intercept's t-test is higher than the Student theoretical value calculated using the `qt` function, then The intercept is statistically significant.

```{r}
t_test=function(X,Y,alpha){Intercept=mean(Y)-(cov(X,Y)/var(X))*mean(X);
                           Slope=cov(X,Y)/var(X);
                           g=(Y-((mean(Y)-(cov(X,Y)/var(X))*mean(X))+((cov(X,Y)/var(X))*X)))^2;
                           k=(X-mean(X))^2;
                           h=0;z=0;
                           for ( i in 1:length(Y) ){h=g[i]+h;z=k[i]+z};
                           SdIntercept=sqrt((h/(length(Y)-2))*((1/length(Y))+(mean(X)^2/z)));
                           SdSlope=sqrt(h/(length(Y)-2)/z);
                           tIntercept=Intercept/SdIntercept;
                           tSlope=Slope/SdSlope;
                           if(abs(tIntercept)>qt(1-(alpha/2),length(Y)-2))
                           {print('The intercept is statistically significant')};
                           if(abs(tIntercept)<=qt(1-(alpha/2),length(Y)-2))
                           {print('The intercept is statistically *non* significant')};
                           if(abs(tSlope)>qt(1-(alpha/2),length(Y)-2))
                           {print('The slope is statistically significant')};
                           if(abs(tSlope)<=qt(1-(alpha/2),length(Y)-2))
                           {print('The slope is statistically *non* significant')}}
t_test(FDI_3,GFCF_3,0.05)
```

Note that we added an *alpha* as an argument to the function, which is the desired level of risk-tolerance.

```{r}
# risk-tolerance level = 1%
t_test(FDI_3,GFCF_3,0.01)
```

**Fisher test:** $F = \frac{R^2}{\frac{1-R^2}{n-k-1}}$

For the Fisher test we calculated the test value using the $R^2$, which is also the coeffcient of correlation squared ( another easy way to calculate it ).

To get the theoretical values for the Fisher test, we use the `qf` function.

```{r}
F_test=function(X,Y,alpha){r=cov(X,Y)/(sd(X)*sd(Y));
                           Rsquared=r^2;
                           F=Rsquared/((1-Rsquared)/(length(Y)-2));
                           if(F>qf(1-alpha,1,length(Y)-2))
                           {print('The model is globally significative')};
                           if(F<=qf(1-alpha,1,length(Y)-2))
                           {print('The model is globally *non* significative')}}
F_test(FDI_3,GFCF_3,0.05)
```

```{r}
# risk-tolerance level = 1%
F_test(FDI_3,GFCF_3,0.01)
```

## ANOVA

**Bonus:** ANOVA table. To create an ANOVA from scratch like we did with all the above applications, we simply just organize the functions in a dataframe as follows :

```{r}
i=1:length(GFCF_3)

ANOVA=data.frame(Source=c("x1,x2,...,xn","Residual","Total"),
                 Sum_Sq=c(sum((GFCF_hat(FDI_3,GFCF_3)[i]-mean(GFCF_3))^2),
                          sum((GFCF_3[i]-GFCF_hat(FDI_3,GFCF_3)[i])^2),
                          sum((GFCF_3[i]-mean(GFCF_3))^2)),
                 df=c(2,length(GFCF_3)-3,length(GFCF_3)-1),
                 Mean_Sq=c((sum((GFCF_hat(FDI_3,GFCF_3)[i]-mean(GFCF_3))^2))/2,
                          (sum((GFCF_3[i]-GFCF_hat(FDI_3,GFCF_3)[i])^2))/(length(GFCF_3)-3),
                           (sum((GFCF_3[i]-mean(GFCF_3))^2))/(length(GFCF_3)-1)))

print(ANOVA)
```

## Why is Stationarity so important !

Stationarity means that the statistical properties of a time series do not change over time. If you keep the trend or the seasonal components in the series, your model will be wrong because it will model the actual components instead of the raw values, which will generate false high accuracy and biased insights.

Below is the built-in command to perform linear regression in R. Note that in the first regression we used non-stationary data and the model seemed to perform well and has high accuracy.

In the second regression we used stationary vectors and the model turned out to be insignificant. This doesn't mean that the theory is wrong as we only used a simple model with two variables. Advanced time series models with multiple variables can accurately explain such theories and generate interesting insights. 

```{r}
model = lm(GFCF_2 ~ FDI_2)
summary(model)
```

```{r}
model = lm(GFCF_3 ~ FDI_3)
summary(model)
```
