Supervised learning: Linear Regression

Linear Regression is one of the simplest and most fundamental algorithms in the field of machine learning and statistics, primarily used for predicting a quantitative response. It’s a parametric approach meaning it assumes a linear relationship between the input variables (independent variables) and the single output variable (dependent variable). Here’s a more detailed look at Linear Regression, including its types, method of operation and assumptions.

Types of Linear Regression

Simple Linear Regression: This involves a single independent variable used to predict a dependent variable. It attempts to establish a linear relationship between the two variables by fitting a linear equation to observed data. The equation of a simple linear regression line

𝑦 = 𝛽0+𝛽1π‘₯+πœ–

where 𝑦 is the dependent variable, π‘₯ is the independent variable, 𝛽0Ξ²0​ is the intercept, 𝛽1Ξ²1​ is the slope, and πœ–Ο΅ is the error term.

Multiple Linear Regression: This involves two or more independent variables used to predict a dependent variable by fitting a linear equation to the observed data. The equation for multiple linear regression is:

𝑦=𝛽0+𝛽1π‘₯1+𝛽2π‘₯2+β‹―+𝛽𝑛π‘₯𝑛+πœ–

where each π‘₯ represents a different independent variable, and each 𝛽 represents the coefficient (or slope) of that variable.

How Linear Regression Works

Linear Regression works by estimating the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. The process involves:

  • Fitting the model: This involves determining the line of best fit through the data points. In the case of simple linear regression, it’s a straight line, whereas, for multiple linear regression, it’s a hyperplane.
  • Minimizing the error: The commonly used method to fit the line is the least squares method, which minimizes the sum of the squares of the residuals (the differences between observed and predicted values).

Key Assumptions

Linear Regression is based on several key assumptions:

  • Linearity: The relationship between the independent and dependent variables is linear.
  • Homoscedasticity: The variance of residual is the same for any value of the independent variables.
  • Independence: Observations are independent of each other.
  • No multicollinearity: In multiple linear regression, the independent variables are not too highly correlated.
  • Normality: For any fixed value of an independent variable, the dependent variable is normally distributed.