This page is for the course on Machine Learning
Class Notices:
The class outline is here as a PDF file, and here as HTML. Some of the information it contains is also given below.
The first text for the course is a book by Aurélien Géron, which I have found to be very useful for learning how to program machine-learning algorithms. The book was originally called Hands-on Machine Learning with SciKit-Learn and TensorFlow , and it is available from the O'Reilly website, to which I believe McGill people can get free access. However, the book has been updated, with the new title Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition . Keras is a software layer that makes programming an algorithm even easier than with TensorFlow, the leading platform until very recently.
The second text for the course is Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. It is now quite old. I used to use it as the main text, and it is still useful for many theoretical considerations.
This link takes you to the website for the book. Its contents are completely available online. The book is also available in hardcover . It is published by the MIT Press.
The third text is one that I came across just recently. It is GANs in Action, by Jakub Langr and Vladimir Bok, Manning. I acquired it as an ebook, and I don't know if it is available in a hardcopy version. Although GANs are a quite advanced topic, the first part of this book is a rather good, non-mathematical introduction to many of the topics we will consider.
Provisionally, here is a summary of the topics I covered over the last few years, and hope to be able to cover this year as well, plus one addition, the last in the list.
Software:
In the last few years, I have been working with Python, an interpreted language that has, for the most part, a straightforward syntax, and can be learnt swiftly by anyone with even just a little experience of programming. (Prefer Python 3 to Python 2. The two versions are not completely inter-compatible, and Python 3 provides better functionality.)
The relative simplicity of programming in Python is probably the main reason for which Python has quite the best set of libraries for machine learning, and not just for deep learning. Although deep learning will be the main focus of the course, I plan to look at some other machine-learning techniques, for which the Python libraries are equally useful. The study of some of these other techniques reveals how much all modern machine-learning approaches have in common, despite the fact that some are much better adapted than others for specific applications.
Resources
There is a super-abundance of resources available online for studying
machine learning, and for implementing it. Machine learning is often
coupled with the buzzword Big Data
, and this is simply because
machines usually learn better if they have a lot of data available to
train their algorithms. Many big datasets are available online, the best
known, and probably the most comprehensive, being
Here are some of the available resources which I found useful. I will add to the list as the term proceeds.
Log of material covered:
The first class was on August 28. We began with an overview of what will be covered in the course, along with a brief discussion of the history of machine learning and artificial intelligence. A look at the course outline allowed us to see a considerable number of machine architectures that can be used for various specific propblems.
We looked very briefly at the preface of Géron's book, where he mentions many of the now everyday operations carried out by trained machines. The importance of the Python programming language was emphasised for accessing the numerous libraries that can be used in order to train models from datasets.
On September 2, we discussed two of the main sorts of tasks that can be tackled by machine-learning algorithms: classification and regression. There are many more, but these two are the most important. The result of a classication task is an assignment of an instance to one of a finite number of categories. For instance, does an image (the instance) depict a cat, a dog, of something else? A regression task returns a number, that could perhaps be an estimate, or a forecast, of some variable based on the values of some explanatory variables, where these values constitute the instance. The familiar idea of linear regression is an example of this, and linear regression is indeed used as a machine-learning algorithm in appropriate cases.
An important kind of learning is supervised learning. The machine
is trained, using a training set composed of several instances
along with a label, in the case of a classification task, or a
target, for a regression task. In both cases, the label or the
target is the right
answer, and the training proceeds by trying to
minimise the loss function.
In addition to supervised learning, we can also consider unsupervised learning, semi-supervised learning, and self-supervised learning. Reinforcement learning is somewhat different. It is regularly used to train robots, which get a reward when they do the right thing, and a penalty for the wrong thing.
In his first chapter, Géron provides a long list of recent applications of machine learning, including image classification and natural language processing. We briefly discussed some of these.
We looked at the first bit of Python code provided by Géron, for running a linear regression. This gave us the opportunity to see some aspects of Python syntax, which flow naturally from the fact that, in Python, everything is an object.
We completed the study of Chapter 1 of Géron's book on September 4. This involved discussing over- and under-fitting, separating the training set and the test set, using validation techniques in order to choose good settings for hyperparameters. Regularisation is a general term to describe imposing constraints on a model, so as to avoid over-fitting.
A long list of aspects of data sets that make them unsuitable for machine learning was given. Data may be unrepresentative or biased. It is important to have as much usable data as possible, but care must be taken to make the data appropriate for machine-learning algorithms.
Then we started on Chapter 2, in which we are taken, step by step, through an artificial project that uses data on house prices in California as determined in the 1990 census. After a first phase in which aims and objects are defined, and the data are obtained and downloaded, as much qualitative information as can be readily seen is used in the process of readying the data for training.
On September 9, we went on with Chapter 2 of the textbook. The aim of the artificial project is to predict the median house price in a particular district (geographical location) in California given various characteristics of the house, such as floor area, number of bedrooms, number of bathrooms, and a categorical variable that expresses how desirable the district is in terms of proximity to the ocean and a few other features.
A great many of the points raised in the chapter apply equally well to almost any empirical project, whether in economics (like this one) or another discipline. One thing specific to the housing project treated here is the dataset itself. Instructions are given for accessing it. While one may do so working from a computer linked to the Internet, it can also be done in the cloud, using the Google Colab. The Colab does not save your data, and so, if you don't want to stay online until you complete the project, you must download the data either to your own machine to to somewhere else where you can store files.
Whether you work in the cloud or on your computer, a very convenient way of developing the project is by use of a Jupyter notebook. Such a notebook is comprised of cells, which may contain text, along with graphics, or Python code, or the results of runniing the code through the Python interpreter. The notebook serves as excellent documentation of the project, as its aims and objects can be explained in text cells, while the code cells permit replication of the actual computation. Both text and code cells can be updated, so that mistakes can be corrected easily.
Emphasis was put on aspects of Python syntax that are used in a typical machine-learning project, and on the various steps to be taken in getting data, handling data so as to produce a training set and a test set. Each set contains instances, each characterised ny various attributes, which may be numerical or categorical.
We are skipping lightly over the example in Chapter 2 that Géron treats in detail, because it is an example of linear regression. But in fact linear regression illustrates many of the aspects of supervised learning, and it allows us to compare and contrast the familiar econometric terminology and that used in the context of machine learning. What is useful for later is to see how to use Python and its numerous libraries to handle the data used by a regression and to run the regression so as to get output of various sorts.
We spent some time on September 11 on Géron's example using housing data from California. The application used by Géron for the project is SciKit Learn. Its API (application programming interface) is well thought-out, and is thus suitable for many small or medium-sized projects. Géron walks us through the numerous steps needed to prepare the dataset for use. After the choice of units of measurement, possible data transformations, such as using logs of some variables, cleaning the data and dealing with missing data, the complete data set can be split into the training set and the test set, and training can begin.
We looked at some evaluation measures for the linear regression trained on the California housing data. The most straightforward of these is root mean squared error or RMSE.
We now skipped over a number of chapters to Chapter 6, on Decision Trees. The Iris dataset is used in order to look at decision trees. A tree consists of several nodes, one of which is the root node. This node, and all others that are not leaf nodes, split into two branches, the chosen branch depending on a criterion, specific to the node and based on just one feature. Any instance follows the branches going from the root to some leaf node. The leaf node determines the predicted probabilities for the instance for each class, and the class with the highest probability is the predicted class.
On September 16, we continued work on Decision Trees. Training a decision tree can be achieved by use of the CART algorithm (Classification And Regression Tree). This makes use of the Gini impurity of each node.
An alternative to the Gini impurity is the entropy, but the choice of which of them to use for the CART algorithm seems to matter very little. We continued the study of Decision Trees by seeing how they can be regularised, and how regularisation performs its usual task of avoiding overfitting. A decision tree can be used for a regression task as well as for a classification task, by replacing the Gini impurity or entropy in the CART algorithm by the mean squared error (MSE). A final point about decision trees is that they depend heavily on the nature of the correlations among the features. Principal Components Analysis (PCA), to be discussed later, can be used advantageously in this context. See here for the mathematical background.
Chapter 7 begins the discussion of Ensemble Methods, by which is
meant that the results of different predictors can be combined in order to
yield more precise results. This can be done by either hard or
soft voting. The former is the familiar majority voting - selecting
the prediction that gets the most votes
from the predictors; the
latter instead averages the prediction probabilities produced by the
predictors, and only then selects the class that gets the highest
averaged probability.
The term bagging is short for bootstrap aggregating
. It
generates a set of predictions by training the chosen algorithm not only
on the training set, but on bootstrap
training sets, which are
generated by resampling with replacement from the original training set.
The predictions of the bootstrap sets can then be combined by hard or soft
voting. Because a bootstrap set contains only a subset of the instances in
the original set, the other instances, called oob (for
out-of-bag
) instances, can be used in the construction of a
validation set. The most successful ensemble method so far is the
Random Forest, created by bagging decision trees. An alternative
to bagging is to train on random subsets of the original training set.
On September 23, we continued study of Chapter 7. Another set of ensemble methods is covered by the term boosting. Unlike bagging and pasting, where the new training sets can be trained simultaneously, or concurrently, in parallel, the training sets generated by boosting must be trained sequentially. The method called AdaBoost, for adaptive boosting, works at each stage of an iterative procedure by weighting those instances more heavily which contributed most to the loss function on the previous stage. Once enough iterations have been performed, the results of all the stages can be combined, but with weights determined by how well they performed separately. Gradient boosting uses a different approach, in which at each stage one tries to train a model to fit the residuals of the previous stage. At the end, the predictions are summed to get the ensemble predictions.
With stacking, short for stacked generalisation, a model called a blender is trained in order to perform the aggregation of the results of all the predictors in the ensemble. Its inputs are the outputs of all these predictors, and its targets are unchanged from those of the original training set.
We then went back to Chapter 3, on classification. Most of this chapter is devoted to a study of the MNIST dataset of handwritten digits. Géron tells us how to download the dataset, to see its format, and to understand what the meaning of the data is. Although one would like to train an algorithm to say what digit each image in the dataset is supposed to represent, we start with a binary-classifier, the aim of which is just to say whether a digit is a '5' or not.
The first classifier used is based on stochastic gradient descent. Its performance can be measured by four quantities: true and false positives, and true and false negatives. These in turn define the measures called precision and recall, and a combination of the two, called the F1 measure.
We went quickly through the rest of Chapter 3 on September 25. For multiple classification, the task can be assigned to a set of binary classifiers. There are two approaches: one-over-all, where each possible class gets a binary classifier, and one-over-one, where each pair of classes is assigned a classifier. The second approach requires a lot more classifiers, but each is trained on only a subset of the complete dataset.
The last sort of task considered in this Chapter is multioutput classification. The example given is that of denoising images. The model is trained using inputs that are here a selection of the handwritten digits in the MNIST dataset, to which random noise has been added. The target for each instance is the clean image, which consists of 28×28 pixels, each with a numerical value in the range 0-255 - hence multioutput classification. After training, the model should be able to clean, or denoise, images.
Next we moved on to Chapter 8, on dimensionality reduction. This is made almost essential on account of the curse of dimensionality. After briefly discussing orthogonal projections, we talked about Principal Components Analysis, or PCA. This depends on the singular value decomposition or SVD, which is dealt with more mathematically in the notes I prepared on the topic.
These notes took up some time on September 30. The first topic treated in the notes is the singular value decomposition. An m × n matrix X is expressed as the product of three matrices: UWV⊤, where U and V have orthonormal columns and W is diagonal with non-negative diagonal elements. Existence is proved, but not uniqueness.
Next, we defined the concept of a generalised inverse, and the Moore-Penrose generalised inverse. This led on, finally, to the section on PCA. This analysis can be implemented very easily by use of the SVD, and we saw how it can be used for dimension reduction.
After all this, we started on Chapter 10, where we are introduced to
artificial neural networks (ANNs), which are what is used in
deep learning. The perceptron was an early attempt to mimic the
human (or animal) brain using artificial neurons. It was shown that a
perceptron of the sort originally envisaged is unable to implement the XOR
(for exclusive or
) operation.
The connections between layers are parametrised using connection weights and biases, which convert the linear functions used by a unit in a layer above the input layer into affine functions. One could have any number of layers with only linear functions taking inputs to outputs, but the end result would be equivalent to just one layer. It is necessary to act on the output of the affine functions with an activation function, which must be nonlinear. The first activation functions were step functions. But they have the disadvantage that their derivatives are zero almost everywhere.
On October 2nd, Géron showed us a multi-layer perceptron or MLP that does implement the XOR operation. It is of the simplest type: Above the input layer there is a hidden layer, above which is the output layer. In general, there may be more than one hidden layer.
Then we looked at various choices for activation functions: the logistic or sigmoid function, the hyperbolic tangent, and, the most widely used, the ReLU, or rectified linear unit function. This last is the easiest to compute, and its derivative is either zero or one.
The output layer itself in general uses an activation function before spitting out its outputs, which are the arguments of the loss function. For a regression task, this might just be the identity function, with no transformation at all, but, for a classification task, a good choice is the softmax function, which generates a set of probabilities that can be used for either hard or soft voting to choose the category of an instance.
In order to train a deep model, the training set is randomly split into
mini-batches, which are successively fed into the model. There is first
a feed-forward pass through the data, in which the inputs are converted
into outputs using the affine functions in each unit (a better word than
artificial neuron
), and the activation functions. It is important to
initialise the connection weights randomly, since otherwise all the
nodes in a layer would be identical, thus defeating the idea of each node doing
its own thing.
The outputs are used to evaluate the loss function, and then the partial derivatives of the loss function with respect to all of the parameters, the connection weights and the biases, are computed by backpropagation, which at heart is just the chain rule of the differential calculus. Follow this link for a proper mathematical account of backpropagation. Then, just as with stochastic gradient descent, the parameters are updated by following the negative of the gradient, multiplied by the learning rate η.
Géron provides a table with typical
architectures for MLPs.
Among things to be chosen are the depth - number of layers - and the
width - number of nodes in each one of the layers. Activation functions
must also be specified, although for many regression tasks none is needed for
the output layer. The loss function may be of different sorts, although the
mean squared error is the commonest for regression tasks. If there are many
outliers, mean absolute error may be more robust.
Assignments:
To send me email, click here or write directly to russell.davidson@mcgill.ca.
URL:
https://russell-davidson.research.mcgill.ca/e706/