Statistical Machine Learning by using Jupyter Notebook Problem

17 Aug Statistical Machine Learning by using Jupyter Notebook Problem

Posted at 14:28h in computer science by

For this HW, we will pretend that we only have 1000 images of the MNIST dataset.

We will begin first with what we had in our videos:

#commonly used imports
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

#get data from sklearn
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

#get the attributes and labels
X, y = mnist["data"], mnist["target"]

#convert labels from characters to numbers
y = y.astype(np.uint8)

#split into training and test sets
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Suppose we decide on a Decision Tree Classifier (covered later in Chapter 6). There are two hyperparameters we will fine tune for now: max_leaf_nodes and min_samples. Let’s fine tune these using the following code.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

#setup the decision tree
dt_clf = DecisionTreeClassifier( random_state=30)

#the hyperparameters to search through
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}

#initialize the GridSearch with 3-fold cross validation
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#do the search (but only on the first 1000 images)
grid_search_cv.fit(X_train[:1000], y_train[:1000])

What are the best hyperparameters?

Now, let’s fit the training data (again pretending to only have the first 1000 images).

dt_clf = grid_search_cv.best_estimator_
dt_clf.fit(X_train[:1000], y_train[:1000])

Let’s now obtain cross-validation accuracy scores with this fine-tuned model

from sklearn.model_selection import cross_val_score
cross_val_score(dt_clf, X_train[:1000], y_train[:1000], cv=3, scoring="accuracy")

How accurate are your three folds?

If we only had 1000 images and we wanted more to train on, we could do what is known as data augmentation. The augmentation we will do here is to shift the images we do have. For each of the 1000 images, we shift one pixel up, left, right, and down. So this will give us four more images to add to our training. Data augmentation such as this is a very useful technique when there are not enough training instances. If you run the following code, you will see an example of the first image shifted.

from scipy.ndimage.interpolation import shift
def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])

image = X_train[0]
shifted_image_down = shift_image(image, 0, 5)
shifted_image_left = shift_image(image, -5, 0)

plt.figure(figsize=(12,3))
plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

To shift all of our images and add them to a training data set, we can use the following code:

X_train_augmented = [image for image in X_train[:1000]]
y_train_augmented = [label for label in y_train[:1000]]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    for image, label in zip(X_train[:1000], y_train[:1000]):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

You now have 5000 images in the X_train_augmented and y_train_augmented sets (the 1000 original images and each image was shifted up, down, left, and right).

Let’s now fine-tune and fit a decision tree classifier using this augmented training data.

dt_clf = DecisionTreeClassifier( random_state=30)
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [3,4, 5]}
grid_search_cv = GridSearchCV(dt_clf, params, verbose=1, cv=3)

#This will take around 10 minutes to run
grid_search_cv.fit(X_train_augmented, y_train_augmented)

grid_search_cv.best_estimator_

Now, what are the best hyperparameters?

Finally, obtain cross-validation accuracy scores with this model trained on the augmented data.

What are the accuracy scores now?

Did the augmented data help?

Think of one downside to using augmented data and comment.

Submit your code in a Jupyter notebook. Include all code: the code examples above and what you write.
Put your answers to the questions above into markdown cells.
Use a single hashmark heading to label the problem

17 Aug Statistical Machine Learning by using Jupyter Notebook Problem

Related Tags

Who We Are

Some Categories

More Links

We Accept