To set the stage, what is linear algebra? Well, it’s basically a fancy way of saying “how to do math with lines and planes.” But don’t let that fool you this stuff can be incredibly useful for data science!
So why should you care about linear algebra? For starters, it’s essential for understanding machine learning algorithms. Without it, you might as well be trying to solve a Rubik’s cube blindfolded (and trust us, that’s not easy).
We recommend checking out our new book “Linear Algebra for Machine Learning” which includes step-by-step tutorials and Python source code files.
Now, Time to get going with some examples. Say you have a dataset with two features (x1 and x2) and one target variable (y). You want to find the line that best fits this data in other words, the line of “best fit.” This is where linear algebra comes in handy!
To do this, we’ll use matrix multiplication. First, let’s create a dataset with some made-up numbers:
# Import necessary libraries
import numpy as np # Importing numpy library for matrix operations
from matplotlib import pyplot as plt # Importing matplotlib library for data visualization
# Create dataset with made-up numbers
x1 = [2, 4, 6] # Creating a list of values for feature x1
x2 = [3, 5, 7] # Creating a list of values for feature x2
y = [8, 10, 12] # Creating a list of values for target variable y
data = list(zip(x1, x2, y)) # Combining the three lists into one dataset using the zip function
Next, we’ll create a matrix of our data:
# Create X and Y matrices
# X is a matrix containing the values of x1 and x2 from the data
# Y is a matrix containing the values of y from the data
X = np.array([[xi1, xi2] for (xi1, xi2), _ in data]) # corrected syntax for creating X matrix
Y = np.array([yi for _, yi in data]) # corrected syntax for creating Y matrix
Now we can use matrix multiplication to find the line of best fit:
# Import numpy library and rename linalg module as LA
import numpy as np
from numpy import linalg as LA
# Define X and Y as arrays containing data points
X = [1, 2, 3, 4, 5]
Y = [2, 4, 6, 8, 10]
# Use numpy's linalg module to calculate the coefficients for the line of best fit
coefficients = np.linalg.lstsq(X, Y)[0] # lstsq function returns the least-squares solution to a linear matrix equation
# The [0] index is used to access the first element of the returned tuple, which contains the coefficients for the line of best fit
# Print the calculated coefficients
print(coefficients)
And that’s it! We now have the line of best fit for our data:
y = -1.3 + 2.5×1 + 4.8×2
It may seem intimidating at first, but with a little practice and some helpful resources (like our book), anyone can learn this stuff. And trust us, it’s worth the effort if you want to be a data scientist!