Introduction
Has machine learning ever felt like two somewhat random words put together? Everybody's talking about it, so it must be trendy. Let's learn how you can create your first machine learning model in Python (no, not the snake). If you're new to coding, don't worry. You already have everything you need on your computer!
Dependencies
You might be surprised: all you need to start coding this model is a Google account. Here we go!
Google Colab
How are we going to build this? The answer is Google Colab. Colab is a Python notebook environment that allows you to take advantage of Google's powerful computers for free! Let's start to set it up by adding the application in Google Drive.
Open up Drive and select the New button.
In the dropdown menu, select Connect more apps (if you don't already have Google Colab enabled) and search for Colab and add it to your Drive.
Finally, create a new Colab notebook by clicking the new button and finding it in the drop down menu in More.
Getting Started
Welcome to your Colab environment! You are now using a virtual machine (i.e. a computer) running inside of Google's servers! To start, we're going to be importing one of Google's sample datasets to explore. We're going to need to open the folder that has the datasets in them and then use something called Pandas to open the dataset. Type the following lines of code and hit shift + enter
to run the code.
%cd sample_data # change your directory to where the data is stored
import pandas as pd # importing the library to your code
We've opened the sample_data folder on your Google computer and imported the Pandas library. cd
stands for change directory which is what we did by asking it to switch to the sample_data
folder. The %
tells the notebook we want to access the terminal rather than Python code. It's also called a magic function. Pandas, which is not automatically enabled in Python, needs to be brought in with import
and we can shorten how we call that library using as
.
Next, let's open our dataset. It's a csv
file that contains data on houses on streets all over California. Run these lines of code to take a peek at the data!
data = pd.read_csv("california_housing_train.csv") # load in the file
data.head() # call .head() which is a function which views the first 5 rows
After you run this line, you should see an output like this!
Plotting
Now that we have our Pandas DataFrame loaded in, let's get a better understanding of our data by plotting some of it. We're going to use the matplotlib library to do this, so let's import again using import
.
import matplotlib.pyplot as plt
Done! Now, say that I want to make a histogram of the different buckets of the median house value for each of these different blocks of houses in California. We can make a histogram and specify how many sections, or bins, the histogram should have like this. If you want to learn the syntax to make more matplotlib charts like this one, you can dive into the documentation.
plt.hist(data["median_house_value"], bins=100)
We selected the column of data we wanted by telling matplotlib we want to use the "median_house_value"
column in our DataFrame called data
. We also specified the number of bins
to be 100. Play around with this number and see if you can discover anything interesting about the skew of the data. Here's the graph with 100 bins.
I see something really interesting here. It seems like the dataset rounded down all streets with a median value of $500,000 and combined them all together. This creates a very high right skew in our graph. This might make it hard to get an accurate model without doing more cleaning, but for now let's move on.
Machine Learning
Are you ready? We're going to use linear regression to see if we can predict the median house value given all of the other features in the data. To do this, we need yet another Python library, sklearn. Let's download the library using python's package manager pip
and import the model we need.
!pip install sklearn # ! says to run in terminal
from sklearn.linear_model import LinearRegression # import the linear regression model
Great! Before we can train, or fit our machine learning model, we need to separate the data into a few sections: training & testing data; input (x) and output (y). At its core, we're plotting a line of best fit (remember y=mx+b
from algebra?) but in more than 2 dimensions like you might be used to. Don't worry though, sklearn does this complicated math for us. To start, let's segment our data. The following code breaks up our data into training and testing portions.
X_test = pd.read_csv("california_housing_test.csv") # load in the separate testing data
y_test = X_test["median_house_value"] # set our output equal to the median house value column
X_test = X_test.drop(["median_house_value"], axis = 1) # remove that column from the input. axis = 1 means to remove the column
y_train = data["median_house_value"] # same thing for training data
X_train = data.drop(["median_house_value"], axis = 1)
We removed the output (which will be our predicted house value) from the data to give our model a chance to figure out what it should be. Now, the cool part. Let's fit our model and test it!
lm = LinearRegression().fit(X_train, y_train) # .fit() fits the data to the model
y_pred = lm.predict(X_test) # test how accurate the model is using testing data
print("R-Squared value:",lm.score(X_test,y_test))
If you want to read more about the syntax we used here to fit and test our model, here is where you can find sklearn's documentation on the linear regressor.
Can you believe it? That's all we have to do to implement our model. When I did this, I got an R2 value of ~0.6, but what does that mean? In short, it tells us how well our model's best fit line went through the data, where a score of 1.0 means the model perfectly explains the data. Pretty cool huh? If you wanted to take this a step further, you could try looking if dropping certain columns of your input gives you a better R2.
The last thing we will do is write the equation for our line. Don't worry about the code itself here, it's mainly formatting but it's pretty cool to see the equation our model came up with. Run this code and take a look at what you get! What do you think the coefficients might represent?
my_formatted_list = [ '%.2f' % i for i in lm.coef_ ]
str1 = 'x + '.join(str(e) for e in my_formatted_list)
print("Formula is:\n y = ", str1, ' + ', str(lm.intercept_) )
Conclusion
Congratulations! You've just explored a dataset and built a simple regression model for this data. Machine learning actually has some meaning to you now, and you can begin to explore some more interesting projects. Maybe your next project can be to code the model on your own rather than use an imported model?