This page is for the course on Machine Learning
Class Notices:
The class outline is here as a PDF file, and here as HTML. Some of the information it contains is also given below.
The first text for the course is a book by Aurélien Géron, which I have found to be very useful for learning how to program machine-learning algorithms. The book was originally called Hands-on Machine Learning with SciKit-Learn and TensorFlow , and it is available from the O'Reilly website, to which I believe McGill people can get free access. However, the book has been updated, with the new title Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition . Keras is a software layer that makes programming an algorithm even easier than with TensorFlow, the leading platform until very recently.
The second text for the course is Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. It is now quite old. I used to use it as the main text, and it is still useful for many theoretical considerations.
This link takes you to the website for the book. Its contents are completely available online. The book is also available in hardcover . It is published by the MIT Press.
The third text is one that I came across just recently. It is GANs in Action, by Jakub Langr and Vladimir Bok, Manning. I acquired it as an ebook, and I don't know if it is available in a hardcopy version. Although GANs are a quite advanced topic, the first part of this book is a rather good, non-mathematical introduction to many of the topics we will consider.
Provisionally, here is a summary of the topics I covered over the last few years, and hope to be able to cover this year as well, plus one addition, the last in the list.
This year, there is time to explore the emergence of ChatGPT, from OpenAI. We will use two texts available on the O'Reilly website. First, What are ChatGPT and its Friends?, by Mike Loukides. This is a very short book, of only 20 pages, but it sets the stage extremely well. The second text is a longer book, and so we will have time for only some of it. It is Developing Apps with GPT-4 and ChatGPT by Olivier Caelen and Marie-Alice Blete. Study of these should bring us some way towards the frontiers of this very fast-moving field.
Software:
In the last few years, I have been working with Python, an interpreted language that has, for the most part, a straightforward syntax, and can be learnt swiftly by anyone with even just a little experience of programming. (Prefer Python 3 to Python 2. The two versions are not completely inter-compatible, and Python 3 provides better functionality.)
The relative simplicity of programming in Python is probably the main reason for which Python has quite the best set of libraries for machine learning, and not just for deep learning. Although deep learning will be the main focus of the course, I plan to look at some other machine-learning techniques, for which the Python libraries are equally useful. The study of some of these other techniques reveals how much all modern machine-learning approaches have in common, despite the fact that some are much better adapted than others for specific applications.
Resources
There is a super-abundance of resources available online for studying
machine learning, and for implementing it. Machine learning is often
coupled with the buzzword Big Data
, and this is simply because
machines usually learn better if they have a lot of data available to
train their algorithms. Many big datasets are available online, the best
known, and probably the most comprehensive, being
Here are some of the available resources which I found useful. I will add to the list as the term proceeds.
Log of material covered:
The first class, on Wednesday August 30 was marred by the fact that, for unknown and mysterious reasons, I was unable to connect my small computer to the big screen in the classroom. The conventional lecture tried to cover more or less what I wanted to cover, using only marker and whiteboard.
After a brief discussion of the history of machine learning, with perceptrons, statistical learning, and deep learning, that last being based on the notion of the multilayer perceptron, we discussed two of the main sorts of tasks that can be tackled by machine-learning algorithms: classification and regression. There are many more, but these two are the most important. The result of a classication task is an assignment of an instance to one of a finite number of categories. For instance, does an image (the instance) depict a cat, a dog, of something else? A regression task returns a number, that could perhaps be an estimate, or a forecast, of some variable based on the values of some explanatory variables, where these values constitute the instance. The familiar idea of linear regression is an example of this, and linear regression is indeed used as a machine-learning algorithm in appropriate cases.
An important kind of learning is supervised learning. The machine
is trained, using a training set composed of several instances
along with a label, in the case of a classification task, or a
target, for a regression task. In both cases, the label or the
target is the right
answer, and the training proceeds by trying to
minimise the loss function, which measures the difference between
the right answer and the answer given by the algorithm. An essential
aspect of the training is backpropagation, whereby the partial
derivatives of the loss function with respect to all of the parameters of
the algorithm are calculated.
On September 6, we had a brief look at Chapter 5 of the Deep Learning book. This is the first chapter that touches on topics related to machine learning, as the previous chapters are mostly about mathematical prerequisites, useful in and of themselves of course, but not the focus of this course.
We moved on from there to Géron's Hands-On book, starting with the first chapter, which provides an overview of machine learning. It emphasises the need for lots of data, and, contrariwise, the lack of need for careful analysis of problems. There is a list of some of the salient applications of machine learning, complementing a similar, but not identical, list in the Deep Learning book. Géron provides some diagrams that illustrate the differences between an approach based on careful analysis and one based on lots of data.
In addition to supervised learning, we can also consider unsupervised learning, semi-supervised learning, and self-supervised learning. Reinforcement learning is somewhat different. It is regularly used to train robots, which get a reward when they do the right thing, and a penalty for the wrong thing.
We skipped lightly over an example that Géron treats in detail, because it is an example of linear regression. But in fact linear regression illustrates many of the aspects of supervised learning, and it allows us to compare and contrast the familiar econometric terminology and that used in the context of machine learning. What is useful for later is to see how to use Python and its numerous libraries to handle the data used by a regression and to run the regression so as to get output of various sorts.
We pursued the study of how to use Python and its libraries in order to run a linear regression on September 11. Then we carried on with Chapter 1 of the Hands-On book, looking at things that can make machine learning difficult. Of these, lack of suitable data has traditionally been the most important, but now, in the era of Big Data, this is less of a problem. The problem of unrepresentative datasets persists however, and so a machine may be trained to give results in a way a racist might do, for instance, if the training set was restricted to white people. Similar problems have attracted media attention.
An extremely important problem for the training of an algorithm is that of overfitting. In any dataset, we have to take account of the fact that what is observed is a mixture of signal and noise. Overfitting can be seen as learning as much about the noise as about the signal, and it leads to an inability of the trained model to generalise to other datasets.
The usual practice is to reserve a part of the dataset on which training
is to be based as the test set. This is never used for training,
only for evaluating how well a trained model can generalise. Sometimes,
another part of the overall dataset is reserved for validation.
Some aspects of the model to be trained as characterised as
hyperparameters that have to be chosen, or tuned. Bad
choices can lead to overfitting or underfitting. The validation set is
used to compare how well a number of competing models perform, and then,
once a best
model is found, the validation set is reincorporated
into the training set for final training. Only then can the success of the
procedure be judged using the test set.
Regularisation is the name given to the many techniques that can be used to guard against overfitting. By putting extra structure on the models considered, one can coax the training procedure to pay attention to the signal in the data rather than to the noise.
We moved on to the second chapter of Géron's Hands-On book on September 13. This chapter is a long and somewhat laborious account of a machine-learning project in which the model tries to predict the median house price in a particular district (geographical location) in California given various characteristics of the house, such as floor area, number of bedrooms, number of bathrooms, and a categorical variable that expresses how desirable the district is in terms of proximity to the ocean and a few other features.
A great many of the points raised in the chapter apply equally well to almost any empirical project, whether in economics (like this one) or another discipline. One thing specific to the housing project treated here is the dataset itself. Instructions are given for accessing it. While one may do so working from a computer linked to the Internet, it can also be done in the cloud, using the Google Colab. The Colab does not save your data, and so, if you don't want to stay online until you complete the project, you must download the data either to your own machine to to somewhere else where you can store files.
Whether you work in the cloud or on your computer, a very convenient way of developing the project is by use of a Jupyter notebook. Such a notebook is comprised of cells, which may contain text, along with graphics, or Python code, or the results of runniing the code through the Python interpreter. The notebook serves as excellent documentation of the project, as its aims and objects can be explained in text cells, while the code cells permit replication of the actual computation. Both text and code cells can be updated, so that mistakes can be corrected easily.
The application used by Géron for the project is SciKit Learn. Its API (application programming interface) is well thought-out, and is thus suitable for many small or medium-sized projects. Géron walks us through the numerous steps needed to prepare the datset for use. After the choice of units of measurement, possible data transformations, such as using logs of some variables, cleaning the data and dealing with missing data, the complete data set can be split into the training set and the test set, and training can begin.
The first model that is tried is good old-fashioned linear regression. In the first chapter, we saw how to set up such a model, and so this is another exercise with the same type of model. Next week we will see how to use other kinds of models, and evaluate their relative merits.
On September 18, we looked at some evaluation measures for the linear regression trained on the California housing data. The most straightforward of these is root mean squared error or RMSE. By this measure, the linear regression was pretty mediocre. When a Decision Tree was used instead, it was only a very slight improvement. However, use of a Random Forest led to much more convincing results.
Next, we moved on to Chapter 3 of Géron's book, on classification. Most of this chapter is devoted to a study of the MNIST dataset of handwritten digits. Géron tells us how to download the dataset, to see its format, and to understand what the meaning of the data is. Although one would like to train an algorithm to say what digit each image in the dataset is supposed to represent, we start with a binary-classifier, the aim of which is just to say whether a digit is a '5' or not.
The first classifier used is based on stochastic gradient descent. Its performance can be measured by four quantities: true and false positives, and true and false negatives. These in turn define the measures called precision and recall, and a combination of the two, called the F1 measure.
For multiple classification, the task can be assigned to a set of binary classifiers. There are two approaches: one-over-all, where each possible class gets a binary classifier, and one-over-one, where each pair of classes is assigned a classifier. The second approach requires a lot more classifiers, but each is trained on only a subset of the complete dataset.
We finished off Chapter 3 on September 20, talking about multioutput classification. The example given is that of denoising images. The model is trained using inputs that are here a selection of the handwritten digits in the MNIST dataset, to which random noise has been added. The target for each instance is the clean image, which consists of 28×28 pixels, each with a numerical value in the range 0-255 - hence multioutput classification. After training, the model should be able to clean, or denoise, images.
Chapter 4 is about Training Models. Although it discusses some very specific models, the general principles are widely applicable. The first model is our familiar linear regression, in which parameters are estimated by ordinary least squares (OLS). In the book, this terminology is not used; rather the criterion function to be minimised is called the mean squared error, or MSE. If the dataset is not too enormous, minimising the criterion function can be done analytically, by use of any of the numerous algorithms for OLS. One that is given privileged mention makes use of the singular value decomposition, or SVD; see this discussion.
Of more interest generally is the discussion of gradient descent. In the parameter space, the direction of steepest descent is the negative of the gradient of the criterion function. A hyperparameter needed for this method is the learning rate, which specifies how far each step should go in the direction of steepest descent. If the method is called batch gradient descent, then the gradient (the vector of partial derivatives with respect to the model parameters) is computed for the criterion function evaluated for every instance in the dataset, a potentially computationally intensive task. On the other hand, stochastic gradient descent uses the gradient of one single instance. This alleviates the computational burden, but leads to a somewhat random walk in the parameter space. What is used much more often is mini-batch gradient descent, for which the gradient is computed for a randomly chosen subset of instances. This can combine the advantages of batch and stochastic gradient descent.
It is pointed out that randomness can actually help in locating the global minimum of a criterion function. Methods which make full use of randomness are simulated annealing and genetic algorithms, which simulate biological evolution. But these methods are extremely computationally intensive.
In the effort of avoiding both underfitting and overfitting, we may use learning curves. A validation set is separated out of the training set, and the criterion function evaluated after each step of training. The parameters derived from the training set are then used to evaluate the criterion on the validation set. The learning curves are then plots of the criterion function for both the training and validation sets as training proceeds. If both converge quickly to a common value, this is an indication of underfitting. Normally, even after many epochs of training, there should be a gap between what is achieved by the training set and what is achieved by the validation set, because the training inevitably learns about the noise in the training set, which is independent of the noise in the validation set.
The next topic is regularisation, for which there are a great many techniques, all intended to avoid the demon of overfitting. The first of them presented in Géron's book is ridge regression. To the usual criterion function is added a penalty term, which for this technique is proportional to the squared norm of the parameter vector. In the context of polynomial regression, it is shown that ridge regression can smooth out what would otherwise be a very unsmooth regression line, or curve.
On September 25, we continued our study of regularisation. After ridge regression, which uses an L2 penalty term, we looked at Lasso regression, which uses an L1 penalty term. Because the absolute-value function is not differentiable at the origin, lasso regression can set some parameters exactly to zero, something that does not happen with ridge regression. There is also the Elastic net cost function, for which the penalty term is a convex combination of the L1 and L2 penalties.
Perhaps the most useful regularisation technique is early stopping, where the successive passes through the data performed by (stochastic) gradient descent are stopped, not when a minimum of the cost function is reached, but when the validation error starts to increase with successive passes.
The next topic, called logistic regression by Géron, takes us away from the linear regression model. It corresponds to the model called logit in econometrics, where it is used when the dependent variable is binary, True or False, zero or one. The cost function used for this sort of model is just the negative of the loglikelihood function for maximum likelihood estimation of a logit model. Models that can be estimated by logistic regression are called Binary Response Models. With large datasets, it can be useful to introduce some regularisation.
The Iris dataset contains 150 instances of flowers, of which the recorded features are petal length and petal width. It is used to train a binary classifier by logistic regression. The trained model yields a decision boundary in the two-dimensional space of the features. It is a straight line in that space.
There are three subspecies of instances in the dataset, and so we want to
go beyond binary classification. The method proposed is softmax
regression. For each class, a score
is calculated, as a linear
(affine) function of the features and parameters that are specific to each
class. The softmax function of these scores then predicts probabilities
for each class for any given instance. Training proceeds by minimising the
cross-entropy cost function, which depends on the actual targets
and the predicted probabilities. It is, in fact, a generalisation of the
cost function for a binary choice model to a setup with more than two
choices. Classification can be performed by
selecting the class with the greatest predicted probability.
On September 27, we covered all of Chapter 6 of Géron's book, on Decision Trees. The Iris dataset is used in order to look at decision trees. A tree consists of several nodes, one of which is the root node. This node, and all others that are not leaf nodes, split into two branches, the chosen branch depending on a criterion, specific to the node and based on just one feature. Any instance follows the branches going from the root to some leaf node. The leaf node determines the predicted probabilities for the instance for each class, and the class with the highest probability is the predicted class.
Training a decision tree can be achieved by use of the CART algorithm (Classification And Regression Tree). This makes use of the Gini impurity of each node.
An alternative to the Gini impurity is the entropy, but the choice
of which of them to use for the CART algorithm seems to matter very
little. We continued the study of Decision Trees by
seeing how they can be regularised, and how regularisation performs its
usual task of avoiding overfitting. A decision tree can be used for a
regression task as well as for a classification task, by replacing the
Gini impurity or entropy in the CART algorithm by the mean squared error
(MSE). A final point about decision trees is that they depend heavily on
the nature of the correlations among the features. Principal Components
Analysis (PCA) can be used advantageously to rotate
the features so
that they become mutually orthogonal.
Chapter 7 begins the discussion of Ensemble Methods, by which is
meant that the results of different predictors can be combined in order to
yield more precise results. This can be done by either hard or
soft voting. The former is the familiar majority voting - selecting
the prediction that gets the most votes
from the predictors; the
latter instead averages the prediction probabilities produced by the
predictors, and only then selects the class that gets the highest
averaged probability. There are very many ensemble methods, one of which
is the Random Forest, which can perform extremely well.
bootstrap aggregating. It generates a set of predictions by training the chosen algorithm not only on the training set, but on
bootstraptraining sets, which are generated by resampling with replacement from the original training set. The predictions of the bootstrap sets can then be combined by hard or soft voting. Because a bootstrap set contains only a subset of the instances in the original set, the other instances, called oob (for
out-of-bag) instances, can be used in the construction of a validation set. The most successful ensemble method so far is the Random Forest, created by bagging decision trees. An alternative to bagging is pasting, which differs only in that the resampling is done without replacement. This means that the new training sets are just permutations of the original training set, and this may have negative consequences relative to bagging.
Another set of ensemble methods is covered by the term boosting. Unlike bagging and pasting, where the new training sets can be trained simultaneously, or concurrently, the training sets generated by boosting must be trained sequentially. The method called AdaBoost, for adaptive boosting, works at each stage of an iterative procedure by weighting those instances more heavily which contributed most to the loss function on the previous stage. Once enough iterations have been performed, the results of all the stages can be combined, but with weights determined by how well they performed separately. Gradient boosting uses a different approach, in which at each stage one tries to train a model to fit the residuals of the previous stage. At the end, the predictions are summed to get the ensemble predictions.
With stacking, short for stacked generalisation, a model called a blender is trained in order to perform the aggregation of the results of all the predictors in the ensemble. Its inputs are the outputs of all these predictors, and its targets are unchanged from those of the original training set.
Since the online learning platform threw us off when I tried to move on to Chapter 8, we went on instead to the notes I had prepared on the singular value decomposition or SVD, which is the main tool used for principal components analysis or PCA, used mainly as a technique of dimension reduction. Having established the existence of the SVD, and considered some issues that arise if we also wish uniqueness, we proved some of the most important properties of the SVD, including its use in the implementation of ordinary least squares. Next, we will look at PCA.
On October 4, we continued the discussion contained in the notes on the SVD and other topics. In particular, we defined the concept of a generalised inverse, and the Moore-Penrose generalised inverse. This led on, finally, to the section on PCA. This analysis can be implemented very easily by use of the SVD, and we saw how it can be used for dimension reduction.
Since it seems that I am assaulted by gremlins at all turns, it was not possible to return to Chapter 8 of Géron's book, on dimension reduction. Why this happened remains a mystery. So, therefore, we went on to Chapter 9, on unsupervised learning.
The main idea that is exploited in unsupervised learning is that, by use of various techniques, especially clustering, we can generate labels for clusters of data. If a few instances are labelled, then we can extend the labels of these instances to all the other instances in the clusters in which they are found. The first implementation of clustering that we looked at is the k-means algorithm. Its most important hyperparameter is k, the number of clusters.
There are various ways in which k can be chosen, of which some work better than others. We can base the choice on the inertia of the clustering given with a range of values of k, or, better, by use of the silhouette score.
We continued looking at Chapter 9, on unsupervised learning, on October 16, after Thanksgiving and Reading Week. The first topic was image segmentation, where the pixels of an image are grouped into clusters, or segments, on the basis of various criteria. One obvious one is colour - the example given was an image of a flower against a background of greenery, and with a bright red bug sitting on the flower. A distinction was drawn between semantic segmentation and instance segmentation.
Next came semi-supervised learning, in which the first step is usually clustering. Labels could be assigned, essentially arbitrarily, to all instances in each cluster. Better would be the case in which there are a few labelled instances. These (few) instances can be used to train a model, which can then be used to classify all the instances in the complete training set.
Another way to proceed is to perform clustering at first as usual, and then to find a representative instance in each cluster. Then it may be possible to label the representative instances manually, and then train a model on just these. After that, one can do label propagation, by which is meant using the labels on the representative instances for all the instances in their respective clusters. More accuracy still can be achieved by omitting outliers from the training set, outliers being defined as those instances that are farthest away from the closest cluster centre.
Techniques of active learning include an interactive procedure, where human intervention provides labels for those instances about which the current model does not give a definite prediction. The density-based spatial clustering of applications with noise (DBSCAN) algorithm works well if clusters are well separated, and it requires only two hyperparameters.
Another clustering technique that has much flexibility makes use of Gaussian Mixture Models (GMM). These models have the enormous advantage of being generative, meaning that, after training, they can produce, or generate, other instances with specified features. Such models rely on a data-generation process, or DGP, which is what gives them the generative ability.
After all this, we just started on Chapter 10, where we are introduced to artificial neural networks (ANNs), which are what is used in deep learning. There was no time for anything other than a brief history of the waxing and waning of the fortunes of ANNs.
We learned much more about deep learning on October 18. The
perceptron was an early attempt to mimic the human (or animal)
brain using artificial neurons. It was shown that a perceptron of
the sort originally envisaged is unable to implement the XOR (for exclusive
or
) operation. But that problem was overcome when a multi-layer
perceptron was used. Above the input layer there is one or more
hidden layers, above which is the output layer. Géron
gives an example of such a perceptron that does indeed give the correct
output of the XOR operation.
The connections between layers are parametrised using connection weights and biases, which convert the linear functions used by a unit in a layer above the input layer into affine functions. One could have any number of layers with only linear functions taking inputs to outputs, but the end result would be equivalent to just one layer. It is necessary to act on the output of the affine functions with an activation function, which must be nonlinear. The first activation functions were step functions. But they have the disadvantage that their derivatives are zero almost everywhere.
Better choices for activation functions are the logistic or sigmoid function, the hyperbolic tangent, and, the most widely used, the ReLU, or rectified linear unit function. This last is the easiest to compute, and its derivative is either zero or one.
The output layer itself in general uses an activation function before spitting out its outputs, which are the arguments of the loss function. For a regression task, this might just be the identity function, with no transformation at all, but, for a classification task, a good choice is the softmax function, which generates a set of probabilities that can be used for either hard or soft voting to choose the category of an instance.
In order to train a deep model, there is first a feed-forward pass
through the data, in which the inputs are converted into outputs using
the affine functions in each unit (a better word than artificial
neuron
), and the activation functions. The outputs are used to
evaluate the loss function, and then the partial derivatives of the loss
function with respect to all of the parameters, the connection weights and
the biases, are computed by backpropagation, which at heart is just
the chain rule of the differential calculus. Follow this
link for a proper mathematical account of backpropagation. Then, just
as with stochastic gradient descent, the parameters are updated by
following the negative of the gradient, multiplied by the learning
rate η.
Although Scikit-Learn does have classifiers and regression models for multi-layer perceptrons, a much richer set of models is made available by Keras, an API that sits on top of tensorflow. We started on how to use Keras to set up a multi-layer perceptron that uses the Fashion MNIST dataset in order to undertake a classification task of ten sorts of images supposed to represent different items of clothing.
On October 23, we completed the study of Chapter 10 of Géron's
book. Most of it is about the APIs provided by Keras (horn
in
Ancient Greek κερας) for deep learning. The
Sequential class makes it easy to define the
architecture of an ANN. An object of this class can be constructed in two
ways, one in which the layers are set up using the
add function, the other in which the layers are passed
as a list to the Sequential constructor. After
construction, the model must be compiled (by function
compile), specifying the loss function, the optimiser
(like stochastic gradient descent), and, optionally, metrics that can be
used to evaluate the fit of the model. After compilation, the usual
fit function can be called. It generates the
history
of the fit, giving some details of the progress made at
each epoch, or pass through the data. If a validation set is
specified, its fit is also recorded.
Chapter 10 also describes the Functional API, which is declarative, like the Sequential API. Another possibility that offers almost infinite flexibility is to subclass the tf.keras.Model class, and to use the constructor to define the model architecture, and the call member function to define how the training proceeds.
Other topics covered in this chapter are TensorBoard, which provides a visual description of how training proceeds, and keras-tuner, which helps the user in fine-tuning hyperparameters.
We covered more or less everything I wanted to cover in Chapter 11 of the Hands-on book on October 25. We started with the random initialisation of the parameters (weights and biases) using either the normal or the uniform distribution, both centred at the origin and scaled inversely with the number of links in and out of a layer. After that, we were treated to a number of alternatives to the ReLU activation function, all in an attempt to avoid the vanishing-gradients problem that can arise when some units get stuck in negative numbers where the ReLU gradient is zero.
Next came batch normalisation, another technique of avoiding exploding or vanishing gradients in the hidden layers of a deep network. Another technique is gradient clipping. There followed a long list of optimisers, all based on gradient descent, but with numerous tweaks -- AdaGrad, RMSprop, Adam, Adamax, Nadam, AdamW..... One idea is momentum, and a variant of it called Nesterov momentum.
Various aspects of the important topic of transfer learning were introduced. One possibility is the use of pretrained networks; another is unsupervised learning by building up layers progressively, freezing each one before training the next. There will be other applications of the principle of transfer learning, which we can think of as learning from experience.
Then our attention went back to regularisation. After reminders of l1 and l2 regularisation, we looked at one of the most popular and powerful techniques: dropout. Although it seems at first sight that it is throwing away information, what it really does is to constitute an ensemble method, which combines the results of a huge number of submodels. Next week, we will skip Chapters 12 and 13 and move to Chapter 14, on convolutional neural nets (CNN).
We started on Chapter 14 on on October 30. A convolutional neural net is of great use in image processing, where instances have a two-dimensional structure. An image can be scanned using a kernel, which is a small rectangle that moves over the complete image, generating output based just on what is in the kernel. The kernel can move in jumps, the size of which is determined by the stride. Since the kernel could fall off the edge as it approaches the boundary of the image, there are various things that can happen. One of these is zero padding, and another is valid padding, where the size of the output image is smaller than that of the image itself.
The kernel is combined with a number of filters, for instance, horizontal and vertical filters. Each of these gives rise to a feature map that captures only some aspects of the input image, but there are usually many feature maps in each convolutional layer; often their number is an integer power of two.
Convolutional layers are often separated by a pooling layer. This loses information, but it can usefully reduce memory requirements, especially for large images, and can serve for regularisation.
Géron then provides a long list of various architectures, some of which won contests for processing various datasets. A few new ideas were introduced, such as the inception module. It is often possible to make use of a pretrained model with a prize-winning architecture, and to benefit from transfer learning. Next, we will look at examples of this for a variety of tasks.
The end of Chapter 14 on CNNs was covered quite quickly on November 1st. The main topics were identification and localisation of objects in an image using bounding boxes, semantic segmentation, and instance segmentation.
Chapter 15 is entitled Processing Sequences Using RNNs and CNNs.
A new type of neuron
, or unit, called a recurrent unit, is
introduced. The input to such a unit is usually a sequence, and, as each
element of the sequence is fed sequentially to the unit, it is combined
with something produced by the unit when the preceding element was fed in.
A sequence may be used as input into a whole layer of recurrent neurons,
and the output from the layer may be another sequence
(sequence-to-sequence), or just a single element, typically the last in a
sequence of which the other elements are discarded (sequence-to-vector).
It is also possible to output a sequence from a single vector
(vector-to-sequence), where the same vector is fed into each
recurrent unit. A brief mention was made of an encoder-decoder
network, of which much more later.
An important use of recurrent neurons is in forecasting time series. A specific example is given using data from the Chicago Transit Authority, in which it was necessary to take account of various types of seasonality. There is a very strong weekly pattern, and the suggestion is made that it can be removed by differencing the data, that is, looking at the series y(t)-y(t-7). It appeared that there was also a small amount of cyclical behaviour at the annual frequency, and it too can be removed by differencing. For forecasting, if a model is trained on the differenced data, it is necessary to add back the observations that were subtracted out in the differencing process.
This led on to discussion of various time-series models used in econometrics: ARMA, ARIMA, SARIMA models all try to take account of patterns of serial correlation in time series. Explicit use of such models, which involves choosing suitable values for a set of hyperparameters, leads to adequate, but not sensational, forecasts.
In order to train a model with a time series, one can construct instances that are consecutive subsequences, and give them targets that are subsequences one step ahead of the one that constitutes the instance. This allows the model to have some memory, although not a very long one. Methods that allow for longer memory will be discussed later.
On November 6, we began by seeing how RNNs can be used for forecasting time series. The SARIMA model was our reference, and we hope to do better than it with an RNN. The first attempt discussed has just one recurrent neuron, and it fails completely. But use of a layer of several (here 32) recurrent units leads to good results, better than those from the SARIMA model. One might think that having more than one recurrent layer would be even better, but that leads to overfitting and poorer generalisation error. What does help is a multivariate approach, in which both the bus and rail networks are forecast simultaneously.
Next we considered forecasting more than one day ahead. The best approaches are those that try to forecast results for each day in a longer horizon - here a horizon of 14 days. What seemed best was to use a sequence-to-sequence setup, where the output is a sequence of 14 elements, each for one, two, three .... steps ahead, with each element being either a scalar, if only one network is forecast, or a 2-vector, if both are forecast at the same time.
It seems that an RNN cannot cope satisfactorily with a very long series. Two types of memory cells have been shown to extend short-term memory considerably. The first is the LSTM (long short-term memory) cell; the second is the GRU (gated recurrent unit) cell, which, although it is a simplified version of the LSTM cell, seems to work equally well. After that, we saw how a CNN can be combined with an RNN, and even replace it in some circumstances.
Chapter 16 is about Natural Language Processing (NLP). The first sort of model considered is a char-RNN, a recurrent model which tries to predict the next character in a sequence of characters. All of Shakespeare's works can be downloaded, and used to train such a model. Formally, it is just the same as predicting one step ahead with a time series. The model can be used to generate fake text, which has some passing resemblance to Shakespearean English. A sequence, perhaps taken from one of Shakespeare's plays, is used to initialise the text to be generated. Then, starting from this sequence, one character at a time is appended. The best way to do this is to use a softmax activation on top of a dense layer that outputs a vector with one element for each admissible character, including spaces and punctuation. The probabilities thus obtained can then be combined with a temperature, where low temperatures lead to the next character being chosen as the most probable out of the full set, with the probabilities being made progressively more equal as the temperature rises.
Because we can have very long passages of text, it is desirable to train models that may have long memories. This can be achieved by the use of stateful RNNs, where the contents of the hidden layers are preserved from one training iteration to the next, and re-initialised only at the end of each epoch.
The next task treated in Chapter 16 is sentiment analysis. For this, we move from single characters to words. Even if we limit our vocabulary to a thousand words, it is desirable to use an embedding, where each word is represented by a vector of relatively few elements. The embedding is typically learned as training proceeds. There is a set of 50,000 movie reviews available in the Internet Movie Database (IMDb). These are either favourable or unfavourable, so that the targets for each review are binary: 0 for negative, or unfavourable, 1 for positive, or favourable. A model with a layer of GRU cells can be trained to output a binary result. It should have a dense layer on top of the GRU layer with just one output neuron, with sigmoid activation, so that we can assign probabilities to positive or negative sentiments.
On November 8 we continued our study of Chapter 16. After looking at a pretrained model for sentiment analysis, we moved on to the important task of machine translation. This involved learning about the encoder-decoder network, which has a more complicated structure than a straightforward deep model, since the encoder and the decoder are different models, which, although they interact, must be trained separately. The encoder converts its input into an embedding, which is an abstract representation of the signal, or meaning conveyed by the inputs.
The example we looked at involved translation from English to Spanish. The simple model had a very limited vocabulary for both languages, just short of 1000 words. If a substantially larger vocabulary is used, it may be necessary to use a sampled softmax layer for output, to avoid computing the softmax of thousands and thousands of words. This model cannot handle sentences of more than just a few words.
Context is all-important in translation, by machine or human. One way to enlarge context is to use a bidirectional recurrent layer. Note, though, that this is for training the encoder, not the decoder, which would learn nothing if it was allowed to cheat by looking ahead. In both training and inference, the decoder is fed just one word at a time. However, beam search is possible for the decoder. This lets it maintain various possibilities looking ahead, eliminating those that are rendered impossible as further words come in. This led to correct translation of somewhat longer sentences.
A huge step forward in machine translation, and in many other tasks, came
with attention. The slogan is All you need is attention
,
meaning that convolutional and recurrent units are no longer essential to
these tasks. Attention is implemented by use of an alignment model.
It generates probabilities that determine what inputs the main model is to
pay attention to. These may be from a long time back, so that genuine
long-term memory becomes possible, over and above what one can get from
LSTM cells or GRU cells. There exist a few different types of attention,
which use different measures of similarity between the encoder's
output and the decoder's previous hidden state.
Attention layers are a feature of the transformer architecture. Translation transforms sentences of one language into sentences in another, and so it is one example of a transformer. But there are many other tasks that can be handled by transformers. Usually an attention layer is a multi-head attention layer. This lets the transformer pay attention to more than one aspect of the input at the same time. It is important that the decoder remains causal, so that it does not cheat and learn nothing, and this is achieved by masking non-causal inputs.
In translation, word order is important, but is different for different languages. Positional encodings, of various sorts, are used to convey the ordering of words in a sentence.
On November 13, we continued our study of Chapter 16.
There has been what Géron calls an avalanche
of transformer models
following the successful use of the original transformer for machine
translation. In many cases, these models make use of self-supervised
training, something that is immensely useful whenever there are vast amounts of
unsupervised data, but only a little supervised data. Such models output
encodings that represent the main, or salient, properties of the inputs. They
can subsequently be fine-tuned for specific tasks, using whatever supervised
instances that happen to be available.
Vision transformers followed shortly after the transformers used for natural
language processing. A library of transformers is available from the
organisation called Hugging Face
.
The first major topic in Chapter 17 is autoencoders. In the most basic form, an autoencoder tries to learn the identity mapping from input to output - the input serves also as the target for training the autoencoder. This is useful, because the model has to produce an encoding in its hidden layer that is constrained to be of lower dimension than the input. This encoding should then be a concise, efficient, representation of the input, and this encoding can then be used for many purposes.
The first example is Principal Components Analysis. The data set consists of points in three-dimensional space, and the encoding is constrained to be two-dimensional. After training, the model yields the first two principal components of the 3D instances in the test set.
It can be advantageous to stack autoencoders. Such a model is used with the MNIST fashion set, and it is supposed to reproduce the input images in the output. The relatively simple autoencoder given in Géron's book does so, but not with great fidelity.
Autoencoders can also be used for unsupervised pretraining. A large data set can be used to train a stacked autoencoder, and its lower layers, leading to the encoding, can be reused to train a much smaller labelled data set for one or more specific tasks. It can be a good idea to tie the weights of the encoder and the decoder, transposing those from the encoder for use with the decoder. This saves on parameters and speeds up training.
A stacked autoencoder need not use dense layers. Especially when working with images, convolutional autoencoders can incorporate convolutional layers, interspersed with pooling layers, as in an ordinary CNN. This idea is exploited by denoising autoencoders. For training, random noise can be added to the pixels of an image, or alternatively, the input images can be subjected to dropout, so as to lower the signal-to-noise ratio. The untouched images then serve as targets to train a model to reconstruct a clean image from a noisy or blurred one. An example of this was given, where images in the MNIST fashion data set were reconstructed, with greater or lesser fidelity, from the images to which enough random noise was added as to make them largely unrecognisable.
On November 15, we explored more about autoencoders. The first variant we looked at is the sparse autoencoder, for which the encodings must be sparse in a certain sense. There are various ways of penalising non-sparsity in the loss function, in particular L1 and L2 penalties, and another one, the Kullback-Leibler divergence measuring the divergence from a target sparsity to the actual sparsity at any stage of training.
Variational autoencoders are the first ones we have seen capable of generating outputs randomly. The encoder produces two layers in parallel, one for the mean, the other for the standard deviation, of Gaussian (normal) distribution. These combine to make the encodings look like random Gaussian noise. As a result, after training, the decoder, if provided with random Gaussian input, outputs things that look somewhat like the inputs used to train the autoencoder. A variational autoencoder can also perform semantic interpolation. This could be used to let an image morph progressively into another.
Next came the topic of Generative Adversarial Networks (GAN). This
architecture is quite different from that of an autoencoder. It has two
components: the generator and the discriminator. The generator takes a random
input and learns to transform it to a fake
version of the real inputs
that are used to train the discriminator. Phase one of training supplies
labelled inputs to the discriminator - they can be real
,
labelled 1, or fake
, labelled 0. The fakes are produced by the
generator, and the discriminator learns to distinguish real from fake. But in
phase two of training, the discriminator is frozen, while the generator
produces fakes that it falsely labels as real. The discriminator does its best
to tell the generator that they are in fact fake, and so in this phase the
generator learns how best to fool the discriminator.
If all goes well, the unique Nash equilibrium may be reached, in which the generator produces output that the discriminator cannot distinguish from real input: it assigns a probability of a half to each possibility. Unfortunately, many things can go wrong. Mode collapse occurs if both agents see only one sort of output, and forget all other sorts in the training set. Instability in the process of learning weights is another danger. Our knowledge of the theoretical properties of GANs is currently insufficient to diagnose and avoid these problems, and consequently there are various recipes that may or may not help lead to successful training.
The study of this began with Deep convolutional GANs (DCGAN), which are much less unstable during training than the generic GANs we looked at earlier. They can therefore be profitably trained using several epochs, and they yield much better results if this is done. An amusing example of how this can work showed how subtracting the encodings for images of the faces of men without glasses from those of men wearing glasses, and then adding encodings of women without glasses produced images of women wearing glasses. But the arithmetic must be done in the encoding space, not the pixel space.
Among several ad hoc methods that have been found to improve the training of GANs there is a very useful one for image processing in which the GAN grows progressively by adding convolutional layers that progressively increase the size of the image. Although this is a greedy procedure, it helps to keep generated images looking like those in the training set at all sizes of the image. The rather complicated architectures of StyleGANs can produce high-resolution images that are convincing to the human eye.
On November 22, we looked again at StyleGANS. Then the last topic in Chapter 17 is Diffusion models. These are easier to train than GANs, but take longer to generate images. The idea is to train a model to reduce entropy in the following sense: one starts with a clear image, and, over a great many steps, adds noise to the pixels until the image appears to be just random noise. The training goes backwards, starting with the noisy output, and going back progressively until the original clear image is reached.
A great improvement is obtained if the whole business takes place, not in the pixel space, but in a latent space of encodings generated by an autoencoder, with decoding the final step in the process of image generation. This not only leads to better results but also speeds up training.
After training, if the model is given random noise as its input, it generates images that resemble the training images. It works extremely well with the MNIST fashion dataset.
The last chapter of Géron's book that we study is Chapter 18, on Reinforcement Learning. This topic is different in many ways from image processing and natural-language processing. There is no loss function to be used for training; instead the model is thought of as an agent, whose actions lead to rewards, which may be positive or negative. Training is to lead to finding an optimal - or at any rate good - policy. A policy specifies what actions should be taken in various states of the environment.
On November 27, we continued the study of Chapter 18. An example of a policy given by Géron is a robot vacuum cleaner, which should learn how to pick up as much dust as possible in a given amount of time.
The OpenAI Gym is a software environment that simulates various real-world environments. The one studied in this chapter is the CartPole environment, where one tries to keep a pole balanced on a cart from falling over. The environment is completely characterised by four coordinates, two for the pole and two for the cart. At each time step there are only two possible actions, namely pushing the cart to the left or to the right. A possible policy would push left when the pole is angled to the left, and similarly for pushing right. This would be a deterministic policy. It is not very successful.
A policy with random elements is more successful. At each step, the policy provides only a probability for pushing right or left, and this probability depends on the state of the environment. The policy is optimised using policy gradients. These gradients are computed after a series of episodes that try to trade off the advantages of exploration and exploitation. A deep network is trained by making use of action advantages, which explore not just immediate rewards, but a discounted sum of future rewards. Although it is easy to explore the CartPole environment quite quickly, more realistic settings, such as playing board games, have too many possibilities for them to be explored completely in finite time.
More general settings can sometimes be expressed as Markov Decision Processes. They highlight the importance of the state-action pair, which specifies the probabilities assigned to the possible states (of the environment) in the next step. A value function that returns the optimised and discounted value of being in a particular state solves the Bellman equation of dynamic programming. This equation, and also a more useful one which gives the value of a state-action pair, can be solved by a straightforward iterative procedure. This leads to an optimal policy, with no need for learning.
On November 29, we completed the study of Chapter 18, at least as far as I wanted to go with it, since much of it is just a catalogue of the models developed by DeepMind, and its successors. What has to be learned in practice is the set of transition probabilities in the Markov chain, and the rewards associated with any triple of state, action, and following state, this last being a random function of the first two, the state and the action. This learning problem is difficult.
To tackle it, the basic idea is Bellman's dynamic programming algorithms. Equations define the optimal value functions, but, to solve them, iterative procedures are necessary. But even those are insufficient for training a reinforcement learning problem. This is simply because the transition probabilities for a Markov chain are initially unknown, as are the rewards and penalties that accrue as the Markov chain evolves therough time, and so they have to be learned as part of the training process.
The temporal difference (TD) learning algorithm is the basic tool used in order to do this. It assumes that initially all that it is known is the set of possible states of the environment and actions. Exploration is therefore necessary as the first phase of learning. If actions are chosen randomly, it is possible to use an iterative procedure to learn a value function for the different states, or, more usefully, to learn a Q-value function, which ascribes values to state and action pairs. The Q-value function can be used to select the action a that in state s maximises the Q-value. This algorithm converges extremely slowly.
Rather than exploring randomly, we can use an ε-greedy policy, where actions are chosen randomly with probability ε, but by maximising the current Q-value function with the complementary probability. This provides some exploitation along with the exploration. One starts with a value of ε close to one, and progressively reduces it.
A deep Q-network (DQN) tries to perform deep Q-learning using a deep neural network. The Bellman equation is used to generate targets for a state-action pair by using the Q-values currently defined. In order to provide the instances required for training, a replay buffer is kept of past experiences - that is, initial state, action chosen, next state, reward, and, perhaps, Boolean variables for the sequence of actions being done or truncated. Once this buffer contains enough experiences, random samples of its contents are used as training instances. Even so, convergence is very slow, and can be subject to catastrophic forgetting, where subsequent experiences upset what was learned from earlier ones.
The fact that the DQN is used to make predictions and also to set its own targets means that there is a potential feedback loop in the learning process, and this can lead to instability. Towards the end of the chapter, a set of variants of the basic DQN are described that try to alleviate the dangers and slowness of learning. A number of these have had satisfactory results, and are mentioned in a long list.
That brought us to the end of the study of Géron's book. I plan to spend what remains of the term looking at GPTs (generative pretrained transformers), and the user interface to them that is provided by the celebrated ChatGPT. We started on the short book What are ChatGPT and its Friends?, by Mike Loukides.
Although November 30 was a Thursday, it was deemed by McGill to be a Monday, and so we met as usual. The entire time was spent looking at Loukides's short book, which contains a list of advantages and disadvantages associated with the use of ChatGPT. But there is another list, of the numerous friends, rivals, and competitors of ChatGPT, all of which have very similar advantages and disadvantages.
No need to spell out the advantages. While not exactly disadvantages, there are limitations. To me, at least, the most amusing is that ChatGPT cannot really do arithmetic. It is a language model, and would need a sort of calculator plugin to be able to do sums.
A clear disadvantage is cost. Ordinary mortals cannot summon the computing resources need to train a model like ChatGPT, or the real model for which ChatGPT is a user interface, namely GPT-3.5. So long as the costs are borne by the huge corporations working on AI, this need not bother us. Further, we need not, yet, worry about being replaced as human intelligences by artificial intelligences.
Last class was December 4, the day of the first big snowstorm of the
winter. We completed reading through Loukides's short book, and moved on
to the book by Olivier Caelen and Marie-Alice Blete, looking at the
chapter on a Deep Dive
into ChatGPT and friends. We were introduced
to the OpenAI playground, which provides a graphical user interface
(GUI) to ChatGPT itself, and also to several other large language models
(LLMs) maintained by OpenAI. This interface can also provide Python code
to implement the query entered into the GUI, and this code can be used,
extended as desired from the command line in a terminal, or else
incorporated into a larger Python program.
We finished up by examining the Hello world
example for the
playground. This let us explore different ways of interfacing with one of
these LLMs. It was noted that, although the cost of playing in the
playground is very small, it is positive.
Assignments:
To send me email, click here or write directly to russell.davidson@mcgill.ca.
URL:
https://russell-davidson.research.mcgill.ca/e706/