Abstract: These exercises will familiarize you with the basics of doing regression and classification using scikit-learn. You will also get an introduction to some techniques for estimating generalization performance.
Background reading: "A few useful things to know about machine learning",
Before beginning, you should verify that you have scikit-learn and matplot lib installed properly. To do so, run the following code block and make sure that you don't get any import errors. Additionally, the version of the sklearn module should be at least 0.13.
import sklearn
import matplotlib
print sklearn.__version__
If you don't have the appropriate version of sklearn installed, try executing the following commands at the linux command prompt:
sudo apt-get install python-sklearn
sudo pip install -U scikit-learn
Tom Mitchell defines what it means for a computer program to learn in the following way: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
This definition highlights a key difference between machine learning and classical statistical methods. That is, machine learning is chiefly concerned with improving future performance based on prior experience.
Another key difference with classical statistical methods is that machine learning focuses on the computational efficiency (in both time and space) of algorithms. For instance, an active area of machine learning research is to create algorithms that have computational efficiency properties that work with “big data”.
The specific software package we will be using to do machine learning is called scikit-learn. Scikit-learn is a very powerful package that supports a vast array of machine learning algorithms. To get a sense of the toolkits capabilities check out the examples page.
To better understand how machine learning differs from classical methods, let’s revisit multiple regression (see Lecture 12).
The first step will be to load a dataset to use for our analysis. Scikit-learn comes with several toy datasets that are quite useful for getting an intuition for machine learning. First, we will be working with a dataset of Boston real estate prices. In order to load the data and print out a detailed description of this dataset use the following code:
from sklearn.datasets import *
data = load_boston()
print data.DESCR
To learn a simple model of housing prices using multiple linear regression, print the model parameters, and print the coefficient of determination use:
from sklearn.datasets import *
from sklearn.linear_model import LinearRegression
data = load_boston()
model = LinearRegression()
model.fit(data.data, data.target)
print model.__dict__
print model.score(data.data, data.target)
This code will tell us how well the model does on explaining the data we used to fit the model. In machine learning, we focus on model performance on unseen data. In order to estimate the performance of the system on unseen data, we can split the data into two sets: the training set and the test set. The following code will fit a model just using the training data and print out the coefficient of determination for both the training and testing data:
from sklearn.datasets import *
from sklearn.cross_validation import train_test_split
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=0.5)
model = LinearRegression()
model.fit(X_train, y_train)
print "Train R2 %f"%model.score(X_train,y_train)
print "Test R2 %f"%model.score(X_test,y_test)
This simple experiment gets at the idea that estimating your model’s performance using the same data used for fitting is not an accurate predictor of how well the model will do on new data.
Further, there is a relationship between model complexity, amount of training data, and the gap between the performance of a model on the training data versus the testing data.
To get a better handle on the Python script learn_dataset_linear_regression.py. This script will generate a plot showing the R2 on the training and the test set versus the number of the 13 housing features included in the dataset. For instance at the value of 3 on the x-axis a random subset of size 3 from the original 13 features was selected for learning. This procedure was repeated 1,000 times to smooth out variability.
Questions:
Why do the curves look the way they do?
Does this remind you of any of the statistics we computed when we first saw multiple linear regression?
In addition to multiple regression, scikit-learn supports many other learning algorithms for both regression and classification. In the classification setting, the goal is to assign a categorical label to an input rather than a continuous value (as in regression). When doing classification, you will want to use both a evaluation function (to use the terminology in the Domingos paper) and a different learning algorithm. To get started we will use an algorithm called multiple logistic regression that is built into scikit-learn. Specifically, we will be building a model to classify images of handwritten digits.
To load the digits and display 10 of the examplars use the following code:
%matplotlib inline
from sklearn.datasets import *
import matplotlib.pyplot as plt
import numpy
digits = load_digits()
print digits.DESCR
fig = plt.figure()
for i in range(10):
subplot = fig.add_subplot(5,2,i+1)
subplot.matshow(numpy.reshape(digits.data[i],(8,8)),cmap='gray')
plt.show()
Next, we will use multinomial logistic regression to learn to classify images of digits based on their pixel brightnesses. As before, we split the data into two sets in order to get an accurate estimate of how well our model will work on future images of digits.
from sklearn.datasets import *
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
data = load_digits()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=0.5)
model = LogisticRegression(C=10**-10)
model.fit(X_train, y_train)
print "Train accuracy %f" %model.score(X_train,y_train)
print "Test accuracy %f"%model.score(X_test,y_test)
Next, we will examine how the amount of training data influences the performance of the learned model. Run the following code to generate a learning curve that shows the performance of the model on a testing set as a function of the amount of training examples used:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy
from sklearn.datasets import *
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
data = load_digits()
print data.DESCR
n_trials = 5
train_percentages = range(5,95,5)
test_accuracies = numpy.zeros(len(train_percentages))
for (i,train_percent) in enumerate(train_percentages):
test_accuracy = numpy.zeros(n_trials)
for n in range(n_trials):
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=train_percent/100.0)
model = LogisticRegression(C=10**-10)
model.fit(X_train, y_train)
test_accuracy[n] = model.score(X_test, y_test)
test_accuracies[i] = test_accuracy.mean()
fig = plt.figure()
plt.plot(train_percentages, test_accuracies)
plt.xlabel('Percentage of Data Used for Training')
plt.xlabel('Accuracy on Test Set')
plt.show()
Questions:
What is the general trend?
Are there parts of the curve that appear to be noisier than others? Why?
To reduce the noise in the curve, increase the number of repeated random trials by editing the code.
There are tons of datasets out there to use for learning. The easiest place to start is to look at the other toy datasets that are built into scikit-learn. Two good places to start are the UCI Machine Learning Repository and Kaggle.