This page is for the course on Machine Learning
Class Notices:
The class outline is here as a PDF file, and here as HTML. Some of the information it contains is also given below.
The first text for the course is a book by Aurélien Géron, which I have found to be very useful for learning how to program machine-learning algorithms. The book was originally called Hands-on Machine Learning with SciKit-Learn and TensorFlow , and it is available from the O'Reilly website, to which I believe McGill people can get free access. However, the book has been updated, with the new title Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition . Keras is a software layer that makes programming an algorithm even easier than with TensorFlow, the leading platform until very recently.
The second text for the course is Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. It is now quite old. I used to use it as the main text, and it is still useful for many theoretical considerations.
This link takes you to the website for the book. Its contents are completely available online. The book is also available in hardcover . It is published by the MIT Press.
The third text is one that I came across just recently. It is GANs in Action, by Jakub Langr and Vladimir Bok, Manning. I acquired it as an ebook, and I don't know if it is available in a hardcopy version. Although GANs are a quite advanced topic, the first part of this book is a rather good, non-mathematical introduction to many of the topics we will consider.
Provisionally, here is a summary of the topics I covered over the last few years, and hope to be able to cover this year as well, plus one addition, the last in the list.
This year, there is time to explore the emergence of ChatGPT, from OpenAI. We will use two texts available on the O'Reilly website. First, What are ChatGPT and its Friends?, by Mike Loukides. This is a very short book, of only 20 pages, but it sets the stage extremely well. The second text is a longer book, and so we will have time for only some of it. It is Developing Apps with GPT-4 and ChatGPT by Olivier Caelen and Marie-Alice Blete. Study of these should bring us some way towards the frontiers of this very fast-moving field.
Software:
In the last few years, I have been working with Python, an interpreted language that has, for the most part, a straightforward syntax, and can be learnt swiftly by anyone with even just a little experience of programming. (Prefer Python 3 to Python 2. The two versions are not completely inter-compatible, and Python 3 provides better functionality.)
The relative simplicity of programming in Python is probably the main reason for which Python has quite the best set of libraries for machine learning, and not just for deep learning. Although deep learning will be the main focus of the course, I plan to look at some other machine-learning techniques, for which the Python libraries are equally useful. The study of some of these other techniques reveals how much all modern machine-learning approaches have in common, despite the fact that some are much better adapted than others for specific applications.
Resources
There is a super-abundance of resources available online for studying
machine learning, and for implementing it. Machine learning is often
coupled with the buzzword Big Data
, and this is simply because
machines usually learn better if they have a lot of data available to
train their algorithms. Many big datasets are available online, the best
known, and probably the most comprehensive, being
Here are some of the available resources which I found useful. I will add to the list as the term proceeds.
Log of material covered:
The first class was on August 29. After a brief discussion of the history of machine learning, with perceptrons, statistical learning, and deep learning, that last being based on the notion of the multilayer perceptron, we discussed two of the main sorts of tasks that can be tackled by machine-learning algorithms: classification and regression. There are many more, but these two are the most important. The result of a classication task is an assignment of an instance to one of a finite number of categories. For instance, does an image (the instance) depict a cat, a dog, of something else? A regression task returns a number, that could perhaps be an estimate, or a forecast, of some variable based on the values of some explanatory variables, where these values constitute the instance. The familiar idea of linear regression is an example of this, and linear regression is indeed used as a machine-learning algorithm in appropriate cases.
An important kind of learning is supervised learning. The machine
is trained, using a training set composed of several instances
along with a label, in the case of a classification task, or a
target, for a regression task. In both cases, the label or the
target is the right
answer, and the training proceeds by trying to
minimise the loss function.
On September 3, we resumed study of Chapter 1 of Géron's book. In addition to supervised learning, we can also consider unsupervised learning, semi-supervised learning, and self-supervised learning. Reinforcement learning is somewhat different. It is regularly used to train robots, which get a reward when they do the right thing, and a penalty for the wrong thing.
Then we moved on to Chapter 2. This chapter is a long and somewhat laborious account of a machine-learning project in which the model tries to predict the median house price in a particular district (geographical location) in California given various characteristics of the house, such as floor area, number of bedrooms, number of bathrooms, and a categorical variable that expresses how desirable the district is in terms of proximity to the ocean and a few other features.
A great many of the points raised in the chapter apply equally well to almost any empirical project, whether in economics (like this one) or another discipline. One thing specific to the housing project treated here is the dataset itself. Instructions are given for accessing it. While one may do so working from a computer linked to the Internet, it can also be done in the cloud, using the Google Colab. The Colab does not save your data, and so, if you don't want to stay online until you complete the project, you must download the data either to your own machine to to somewhere else where you can store files.
Whether you work in the cloud or on your computer, a very convenient way of developing the project is by use of a Jupyter notebook. Such a notebook is comprised of cells, which may contain text, along with graphics, or Python code, or the results of runniing the code through the Python interpreter. The notebook serves as excellent documentation of the project, as its aims and objects can be explained in text cells, while the code cells permit replication of the actual computation. Both text and code cells can be updated, so that mistakes can be corrected easily.
Class on September 5 was devoted to parts of Chapter 2 of the "Hands-On" book. Emphasis was put on aspects of Python syntax that are used in a typical machine-learning project, and on the various steps to be taken in getting data, handling data so as to produce a training set and a test set. Each set contains instances, each characterised ny various attributes, which may be numerical or categorical.
We skipped lightly over the example in Chapter 2 that Géron treats in detail, because it is an example of linear regression. But in fact linear regression illustrates many of the aspects of supervised learning, and it allows us to compare and contrast the familiar econometric terminology and that used in the context of machine learning. What is useful for later is to see how to use Python and its numerous libraries to handle the data used by a regression and to run the regression so as to get output of various sorts.
There was no class on September 10, because I was sick.
On September 12, we returned to Chapter 2 of Géron's book.
The application used by Géron for the project is SciKit Learn. Its API (application programming interface) is well thought-out, and is thus suitable for many small or medium-sized projects. Géron walks us through the numerous steps needed to prepare the datset for use. After the choice of units of measurement, possible data transformations, such as using logs of some variables, cleaning the data and dealing with missing data, the complete data set can be split into the training set and the test set, and training can begin.
We looked at some evaluation measures for the linear regression trained on the California housing data. The most straightforward of these is root mean squared error or RMSE. By this measure, the linear regression was pretty mediocre. When a Decision Tree was used instead, it was only a very slight improvement. However, use of a Random Forest led to much more convincing results.
On September 17, we covered Chapter 3 of the Hands-On book, on classification. Most of this chapter is devoted to a study of the MNIST dataset of handwritten digits. Géron tells us how to download the dataset, to see its format, and to understand what the meaning of the data is. Although one would like to train an algorithm to say what digit each image in the dataset is supposed to represent, we start with a binary-classifier, the aim of which is just to say whether a digit is a '5' or not.
The first classifier used is based on stochastic gradient descent. Its performance can be measured by four quantities: true and false positives, and true and false negatives. These in turn define the measures called precision and recall, and a combination of the two, called the F1 measure.
For multiple classification, the task can be assigned to a set of binary classifiers. There are two approaches: one-over-all, where each possible class gets a binary classifier, and one-over-one, where each pair of classes is assigned a classifier. The second approach requires a lot more classifiers, but each is trained on only a subset of the complete dataset.
The last sort of task considered in this Chapter is multioutput classification. The example given is that of denoising images. The model is trained using inputs that are here a selection of the handwritten digits in the MNIST dataset, to which random noise has been added. The target for each instance is the clean image, which consists of 28×28 pixels, each with a numerical value in the range 0-255 - hence multioutput classification. After training, the model should be able to clean, or denoise, images.
Chapter 4 is about Training Models. We covered most of it on September 19. Although the chapter discusses some very specific models, the general principles are widely applicable. The first model is our familiar linear regression, in which parameters are estimated by ordinary least squares (OLS). In the book, this terminology is not used; rather the criterion function to be minimised is called the mean squared error, or MSE. If the dataset is not too enormous, minimising the criterion function can be done analytically, by use of any of the numerous algorithms for OLS. One that is given privileged mention makes use of the singular value decomposition, or SVD; see this discussion.
Of more interest generally is the discussion of gradient descent. In the parameter space, the direction of steepest descent is the negative of the gradient of the criterion function. A hyperparameter needed for this method is the learning rate, which specifies how far each step should go in the direction of steepest descent. If the method is called batch gradient descent, then the gradient (the vector of partial derivatives with respect to the model parameters) is computed for the criterion function evaluated for every instance in the dataset, a potentially computationally intensive task. On the other hand, stochastic gradient descent uses the gradient of one single instance. This alleviates the computational burden, but leads to a somewhat random walk in the parameter space. What is used much more often is mini-batch gradient descent, for which the gradient is computed for a randomly chosen subset of instances. This can combine the advantages of batch and stochastic gradient descent.
It is pointed out that randomness can actually help in locating the global minimum of a criterion function. Methods which make full use of randomness are simulated annealing and genetic algorithms, which simulate biological evolution. But these methods are extremely computationally intensive.
In the effort of avoiding both underfitting and overfitting, we may use learning curves. A validation set is separated out of the training set, and the criterion function evaluated after each step of training. The parameters derived from the training set are then used to evaluate the criterion on the validation set. The learning curves are then plots of the criterion function for both the training and validation sets as training proceeds. If both converge quickly to a common value, this is an indication of underfitting. Normally, even after many epochs of training, there should be a gap between what is achieved by the training set and what is achieved by the validation set, because the training inevitably learns about the noise in the training set, which is independent of the noise in the validation set.
The next topic is regularisation, for which there are a great many techniques, all intended to avoid the demon of overfitting. The first of them presented in Géron's book is ridge regression. To the usual criterion function is added a penalty term, which for this technique is proportional to the squared norm of the parameter vector. In the context of polynomial regression, it is shown that ridge regression can smooth out what would otherwise be a very unsmooth regression line, or curve.
After ridge regression, which uses an L2 penalty term, we looked at Lasso regression, which uses an L1 penalty term. Because the absolute-value function is not differentiable at the origin, lasso regression can set some parameters exactly to zero, something that does not happen with ridge regression.
On September 25, we continued our study of regularisation. In addition to ridge regression and lasso regression, there is also the Elastic net cost function, for which the penalty term is a convex combination of the L1 and L2 penalties.
Perhaps the most useful regularisation technique is early stopping, where the successive passes through the data performed by (stochastic) gradient descent are stopped, not when a minimum of the cost function is reached, but when the validation error starts to increase with successive passes.
The next topic, called logistic regression by Géron, takes us away from the linear regression model. It corresponds to the model called logit in econometrics, where it is used when the dependent variable is binary, True or False, zero or one. The cost function used for this sort of model is just the negative of the loglikelihood function for maximum likelihood estimation of a logit model. Models that can be estimated by logistic regression are called Binary Response Models. With large datasets, it can be useful to introduce some regularisation.
The Iris dataset contains 150 instances of flowers, of which the recorded features are petal length and petal width. It is used to train a binary classifier by logistic regression. The trained model yields a decision boundary in the two-dimensional space of the features. It is a straight line in that space.
There are three subspecies of instances in the dataset, and so we want to
go beyond binary classification. The method proposed is softmax
regression. For each class, a score
is calculated, as a linear
(affine) function of the features and parameters that are specific to each
class. The softmax function of these scores then predicts probabilities
for each class for any given instance. Training proceeds by minimising the
cross-entropy cost function, which depends on the actual targets
and the predicted probabilities. It is, in fact, a generalisation of the
cost function for a binary choice model to a setup with more than two
choices. Classification can be performed by
selecting the class with the greatest predicted probability.
Next, we started on Chapter 6 of Géron's book, on Decision Trees. The Iris dataset is used in order to look at decision trees. A tree consists of several nodes, one of which is the root node. This node, and all others that are not leaf nodes, split into two branches, the chosen branch depending on a criterion, specific to the node and based on just one feature. Any instance follows the branches going from the root to some leaf node. The leaf node determines the predicted probabilities for the instance for each class, and the class with the highest probability is the predicted class.
On September 26, we continued work on Decision Trees. Training a decision tree can be achieved by use of the CART algorithm (Classification And Regression Tree). This makes use of the Gini impurity of each node.
An alternative to the Gini impurity is the entropy, but the choice
of which of them to use for the CART algorithm seems to matter very
little. We continued the study of Decision Trees by
seeing how they can be regularised, and how regularisation performs its
usual task of avoiding overfitting. A decision tree can be used for a
regression task as well as for a classification task, by replacing the
Gini impurity or entropy in the CART algorithm by the mean squared error
(MSE). A final point about decision trees is that they depend heavily on
the nature of the correlations among the features. Principal Components
Analysis (PCA) can be used advantageously to rotate
the features so
that they become mutually orthogonal.
Chapter 7 begins the discussion of Ensemble Methods, by which is
meant that the results of different predictors can be combined in order to
yield more precise results. This can be done by either hard or
soft voting. The former is the familiar majority voting - selecting
the prediction that gets the most votes
from the predictors; the
latter instead averages the prediction probabilities produced by the
predictors, and only then selects the class that gets the highest
averaged probability.
The term bagging is short for bootstrap aggregating
. It
generates a set of predictions by training the chosen algorithm not only
on the training set, but on bootstrap
training sets, which are
generated by resampling with replacement from the original training set.
The predictions of the bootstrap sets can then be combined by hard or soft
voting. Because a bootstrap set contains only a subset of the instances in
the original set, the other instances, called oob (for
out-of-bag
) instances, can be used in the construction of a
validation set. The most successful ensemble method so far is the
Random Forest, created by bagging decision trees. An alternative to
bagging is pasting, which differs only in that the resampling is
done without replacement. This means that the new training sets are
just permutations of the original training set, and so are useful only
if different predictors are used for the pasted
training sets.
On October 1st, we continued study of Chapter 7. Another set of ensemble methods is covered by the term boosting. Unlike bagging and pasting, where the new training sets can be trained simultaneously, or concurrently, in parallel, the training sets generated by boosting must be trained sequentially. The method called AdaBoost, for adaptive boosting, works at each stage of an iterative procedure by weighting those instances more heavily which contributed most to the loss function on the previous stage. Once enough iterations have been performed, the results of all the stages can be combined, but with weights determined by how well they performed separately. Gradient boosting uses a different approach, in which at each stage one tries to train a model to fit the residuals of the previous stage. At the end, the predictions are summed to get the ensemble predictions.
With stacking, short for stacked generalisation, a model called a blender is trained in order to perform the aggregation of the results of all the predictors in the ensemble. Its inputs are the outputs of all these predictors, and its targets are unchanged from those of the original training set.
Since the online learning platform threw us off when I tried to move on to Chapter 8, we went on instead to the notes I had prepared on the singular value decomposition or SVD, which is the main tool used for principal components analysis or PCA, used mainly as a technique of dimension reduction. Next, we will look more closely at the existence of the SVD, and consider some issues that arise if we also wish uniqueness.
Back in Géron's Chapter 8, a few other methods of dimension reduction are discussed, but PCA remains the method of choice in most cases. A topic that is introduced is the so-called curse of dimensionality.
Chapter 9 is on unsupervised learning, an extremely important topic for all machine-learning models.
The main idea that is exploited in unsupervised learning is that, by use of various techniques, especially clustering, we can generate labels for clusters of data. If a few instances are labelled, then we can extend the labels of these instances to all the other instances in the clusters in which they are found. The first implementation of clustering that we looked at is the k-means algorithm. Its most important hyperparameter is k, the number of clusters. There are various ways in which k can be chosen, of which some work better than others.
Class on October 8 was unusual, since on-campus activities were discouraged. Some of us met as usual, and a recording was made, and uploaded to myCourses, for those who were unable to attend in person.
We continued our study of k-means clustering by looking for the best way to choose k. We can base the choice on the inertia of the clustering given with a range of values of k, or, better, by use of the silhouette score.
A specific example of clustering is image segmentation, where the pixels of an image are grouped into clusters, or segments, on the basis of various criteria. One obvious one is colour - the example given was an image of a flower against a background of greenery, and with a bright red bug sitting on the flower. A distinction was drawn between semantic segmentation and instance segmentation.
Next came semi-supervised learning, in which the first step is usually clustering. Labels could be assigned, essentially arbitrarily, to all instances in each cluster. Better would be the case in which there are a few labelled instances. These (few) instances can be used to train a model, which can then be used to classify all the instances in the complete training set.
Another way to proceed is to perform clustering at first as usual, and then to find a representative instance in each cluster. Then it may be possible to label the representative instances manually, and then train a model on just these. After that, one can do label propagation, by which is meant using the labels on the representative instances for all the instances in their respective clusters. More accuracy still can be achieved by omitting outliers from the training set, outliers being defined as those instances that are farthest away from the closest cluster centre.
Techniques of active learning include an interactive procedure, where human intervention provides labels for those instances about which the current model does not give a definite prediction. The density-based spatial clustering of applications with noise (DBSCAN) algorithm works well if clusters are well separated, and it requires only two hyperparameters.
Another clustering technique that has much flexibility makes use of Gaussian Mixture Models (GMM). These models have the enormous advantage of being generative, meaning that, after training, they can produce, or generate, other instances with specified features. Such models rely on a data-generation process, or DGP, which is what gives them the generative ability.
After all this, we just started on Chapter 10, where we are introduced to artificial neural networks (ANNs), which are what is used in deep learning. There was no time for anything other than a brief history of the waxing and waning of the fortunes of ANNs.
We learned much more about deep learning on October 10. The
perceptron was an early attempt to mimic the human (or animal)
brain using artificial neurons. It was shown that a perceptron of
the sort originally envisaged is unable to implement the XOR (for exclusive
or
) operation. But that problem was overcome when a multi-layer
perceptron was used. Above the input layer there is one or more
hidden layers, above which is the output layer. Géron
gives an example of such a perceptron that does indeed give the correct
output of the XOR operation.
The connections between layers are parametrised using connection weights and biases, which convert the linear functions used by a unit in a layer above the input layer into affine functions. One could have any number of layers with only linear functions taking inputs to outputs, but the end result would be equivalent to just one layer. It is necessary to act on the output of the affine functions with an activation function, which must be nonlinear. The first activation functions were step functions. But they have the disadvantage that their derivatives are zero almost everywhere.
Better choices for activation functions are the logistic or sigmoid function, the hyperbolic tangent, and, the most widely used, the ReLU, or rectified linear unit function. This last is the easiest to compute, and its derivative is either zero or one.
The output layer itself in general uses an activation function before spitting out its outputs, which are the arguments of the loss function. For a regression task, this might just be the identity function, with no transformation at all, but, for a classification task, a good choice is the softmax function, which generates a set of probabilities that can be used for either hard or soft voting to choose the category of an instance.
In order to train a deep model, there is first a feed-forward pass
through the data, in which the inputs are converted into outputs using
the affine functions in each unit (a better word than artificial
neuron
), and the activation functions. The outputs are used to
evaluate the loss function, and then the partial derivatives of the loss
function with respect to all of the parameters, the connection weights and
the biases, are computed by backpropagation, which at heart is just
the chain rule of the differential calculus. Follow this
link for a proper mathematical account of backpropagation. Then, just
as with stochastic gradient descent, the parameters are updated by
following the negative of the gradient, multiplied by the learning
rate η.
After reading week, we met again on October 22, and went on with Chapter 10. Although Scikit-Learn does have classifiers and regression models for multi-layer perceptrons, a much richer set of models is made available by Keras, an API that sits on top of tensorflow. We started on how to use Keras to set up a multi-layer perceptron that uses the Fashion MNIST dataset in order to undertake a classification task of ten sorts of images supposed to represent different items of clothing.
Keras (horn
in
Ancient Greek κερας) provides a very flexible API for
deep learning. The
Sequential class makes it easy to define the
architecture of an ANN. An object of this class can be constructed in two
ways, one in which the layers are set up using the
add function, the other in which the layers are passed
as a list to the Sequential constructor. After
construction, the model must be compiled (by function
compile), specifying the loss function, the optimiser
(like stochastic gradient descent), and, optionally, metrics that can be
used to evaluate the fit of the model. After compilation, the usual
fit function can be called. It generates the
history
of the fit, giving some details of the progress made at
each epoch, or pass through the data. If a validation set is
specified, its fit is also recorded.
Chapter 10 also describes the Functional API, which is declarative, like the Sequential API. Another possibility that offers almost infinite flexibility is to subclass the tf.keras.Model class, but we skipped over the details of this.
Other topics covered in this chapter are TensorBoard, which provides a visual description of how training proceeds, and keras-tuner, which helps the user in fine-tuning hyperparameters. Much of the intuition about the number of hidden layers in a model, and the number of units in a layer was discussed. In general, depth, that is the number of hidden layers, is preferred to width, the number of units per layer. This corresponds to how lower layers learn about coarse aspects of the instances, with higher layers refining the pattern being learned.
Then, in Chapter 11, we started with the random initialisation of the parameters (connection weights and biases) using either the normal or the uniform distribution, both centred at the origin and scaled inversely with the number of links in and out of a layer. After that, we were treated to a number of alternatives to the ReLU activation function, all in an attempt to avoid the vanishing-gradients problem that can arise when some units get stuck in negative numbers where the ReLU gradient is zero.
Next came batch normalisation, another technique of avoiding exploding or vanishing gradients in the hidden layers of a deep network. This procedure also serves for regularisation of the model.
As we saw on October 24, a cruder way to deal with the vanishing or exploding gradient problem is gradient clipping.
Next to be studied in Chapter 11 is a long list of optimisers, all based on gradient descent, but with numerous tweaks -- AdaGrad, RMSprop, Adam, Adamax, Nadam, AdamW..... One idea is momentum, and a variant of it called Nesterov momentum. These are all attempts to do for machine-learning algorithms what is done in more conventional optimisation procedures where one replaces Newton's method for solving first-order conditions, which requires second-order derivatives of the cost function, with quasi-Newtom methods, which have to rely on first-order derivatives only.
Various aspects of the important topic of transfer learning were introduced. One possibility is the use of pretrained networks; another is unsupervised learning by building up layers progressively, freezing each one before training the next. There will be other applications of the principle of transfer learning, which we can think of as learning from experience.
Then our attention went back to regularisation. After reminders of l1 and l2 regularisation, we looked at one of the most popular and powerful techniques: dropout. Although it seems at first sight that it is throwing away information, what it really does is to constitute an ensemble method, which combines the results of a huge number of submodels.
On October 29, we finished Chapter 11, with the procedure called MC dropout (MC = Monte Carlo), which improves the accuracy of a model trained with dropout when it is applied to a test set. We then skipped Chapters 12 and 13, and moved on to Chapter 14.
A convolutional neural net (CNN) is of great use in image processing, where instances have a two-dimensional structure. An image can be scanned using a kernel, which is a small rectangle that moves over the complete image, generating output based just on what is in the kernel. The kernel can move in jumps, the size of which is determined by the stride. Since the kernel could fall off the edge as it approaches the boundary of the image, there are various things that can happen. One of these is zero padding, and another is valid padding, where the size of the output image is smaller than that of the image itself.
The kernel is combined with a number of filters, for instance, horizontal and vertical filters. Each of these gives rise to a feature map that captures only some aspects of the input image, but there are usually many feature maps in each convolutional layer; often their number is an integer power of two. The filters they use are learned as training proceeds.
Convolutional layers are often separated by a pooling layer. This loses information, but it can usefully reduce memory requirements, especially for large images, and can serve for regularisation.
On October 31 (Hallowe'en), we were subjected to Géron's long list of various architectures, some of which won contests for processing various datasets. A few new ideas were introduced, such as the inception module or the SE block. It is often possible to make use of a pretrained model with a prize-winning architecture, and to benefit from transfer learning. We looked at examples of this for a variety of tasks.
The end of Chapter 14 on CNNs was covered quite quickly, so as to avoid tedium. The main topics were identification and localisation of objects in an image using bounding boxes, semantic segmentation, and instance segmentation.
In November 5, we embarked on Chapter 15, entitled Processing Sequences
Using RNNs and CNNs.
A new type of neuron
, or unit, called a recurrent unit, is
introduced. The input to such a unit is usually a sequence, and, as each
element of the sequence is fed sequentially to the unit, it is combined
with something produced by the unit when the preceding element was fed in.
A sequence may be used as input into a whole layer of recurrent neurons,
and the output from the layer may be another sequence
(sequence-to-sequence), or just a single element, typically the last in a
sequence of which the other elements are discarded (sequence-to-vector).
It is also possible to output a sequence from a single vector
(vector-to-sequence), where the same vector is fed into each
recurrent unit. A brief mention was made of an encoder-decoder
network, of which much more later.
An important use of recurrent neurons is in forecasting time series. A specific example is given using data from the Chicago Transit Authority, in which it was necessary to take account of various types of seasonality. There is a very strong weekly pattern, and the suggestion is made that it can be removed by differencing the data, that is, looking at the series y(t)-y(t-7). It appeared that there was also a small amount of cyclical behaviour at the annual frequency, and it too can be removed by differencing. For forecasting, if a model is trained on the differenced data, it is necessary to add back the observations that were subtracted out in the differencing process.
This led on to discussion of various time-series models used in econometrics: ARMA, ARIMA, SARIMA models all try to take account of patterns of serial correlation in time series. Explicit use of such models, which involves choosing suitable values for a set of hyperparameters, leads to adequate, but not sensational, forecasts.
In order to train a model with a time series, one can construct instances that are consecutive subsequences, and give them targets that are subsequences one step ahead of the one that constitutes the instance. This allows the model to have some memory, although not a very long one. Methods that allow for longer memory will be discussed later.
We skipped quickly over much of Géron's treatment of the data from Chicago, but instead talked about how these methods could be used for economic forecasting and even weather forecasting. However, it appeared that an RNN cannot cope satisfactorily with a very long series.
It seems that an RNN cannot cope satisfactorily with a very long series. On November 7, we saw that two types of memory cells have been shown to extend short-term memory considerably. The first is the LSTM (long short-term memory) cell; the second is the GRU (gated recurrent unit) cell, which, although it is a simplified version of the LSTM cell, seems to work equally well. After that, we saw how a CNN can be combined with an RNN, and even replace it in some circumstances.
Chapter 16 is about Natural Language Processing (NLP). The first sort of model considered is a char-RNN, a recurrent model which tries to predict the next character in a sequence of characters. All of Shakespeare's works can be downloaded, and used to train such a model. Formally, it is just the same as predicting one step ahead with a time series. The model can be used to generate fake text, which has some passing resemblance to Shakespearean English. A sequence, perhaps taken from one of Shakespeare's plays, is used to initialise the text to be generated. Then, starting from this sequence, one character at a time is appended. The best way to do this is to use a softmax activation on top of a dense layer that outputs a vector with one element for each admissible character, including spaces and punctuation. The probabilities thus obtained can then be combined with a temperature, where low temperatures lead to the next character being chosen as the most probable out of the full set, with the probabilities being made progressively more equal as the temperature rises.
Because we can have very long passages of text, it is desirable to train models that may have long memories. This can be achieved by the use of stateful RNNs, where the contents of the hidden layers are preserved from one training iteration to the next, and re-initialised only at the end of each epoch.
The next task treated in Chapter 16 is sentiment analysis. For this, we move from single characters to words. Even if we limit our vocabulary to a thousand words, it is desirable to use an embedding, where each word is represented by a vector in a Euclidean space, of which the dimension is a hyperparameter. The embedding is typically learned as training proceeds. There is a set of 50,000 movie reviews available in the Internet Movie Database (IMDb). These are either favourable or unfavourable, so that the targets for each review are binary: 0 for negative, or unfavourable, 1 for positive, or favourable. A model with a layer of GRU cells can be trained to output a binary result. It should have a dense layer on top of the GRU layer with just one output neuron, with sigmoid activation, so that we can assign probabilities to positive or negative sentiments.
Since training a model on all of Shakespeare's works takes a long time, it is preferable to use a pretrained model. After looking at a pretrained model for sentiment analysis, we moved on to the important task of machine translation. This involved learning about the encoder-decoder network, which has a more complicated structure than a straightforward deep model, since the encoder and the decoder are different models, which, although they interact, must be trained separately. The encoder converts its input into an embedding, which is an abstract representation of the signal, or meaning conveyed by the inputs.
The example we looked at involved translation from English to Spanish. The simple model had a very limited vocabulary for both languages, just short of 1000 words. If a substantially larger vocabulary is used, it may be necessary to use a sampled softmax layer for output, to avoid computing the softmax of thousands and thousands of words. This model cannot handle sentences of more than just a few words.
Context is all-important in translation, by machine or human. On November 12, we saw that one way to enlarge context is to use a bidirectional recurrent layer. Note, though, that this is for training the encoder, not the decoder, which would learn nothing if it was allowed to cheat by looking ahead. In both training and inference, the decoder is fed just one word at a time. However, beam search is possible for the decoder. This lets it maintain various possibilities looking ahead, eliminating those that are rendered impossible as further words come in. This led to correct translation of somewhat longer sentences.
A huge step forward in machine translation, and in many other tasks, came
with attention. The slogan is All you need is attention
,
meaning that convolutional and recurrent units are no longer essential to
these tasks. Attention is implemented by use of an alignment model.
It generates probabilities that determine what inputs the main model is to
pay attention to. These may be from a long time back, so that genuine
long-term memory becomes possible, over and above what one can get from
LSTM cells or GRU cells. There exist a few different types of attention,
which use different measures of similarity between the encoder's
output and the decoder's previous hidden state.
Attention layers are a feature of the transformer architecture. Translation transforms sentences of one language into sentences in another, and so it is one example of a transformer. But there are many other tasks that can be handled by transformers. Usually an attention layer is a multi-head attention layer. This lets the transformer pay attention to more than one aspect of the input at the same time. It is important that the decoder remains causal, so that it does not cheat and learn nothing, and this is achieved by masking non-causal inputs.
In translation, word order is important, but is different for different languages. Positional encodings, of various sorts, are used to convey the ordering of words in a sentence.
There has been what Géron calls anavalancheof transformer models following the successful use of the original transformer for machine translation. In many cases, these models make use of self-supervised training, something that is immensely useful whenever there are vast amounts of unsupervised data, but only a little supervised data. Such models output encodings that represent the main, or salient, properties of the inputs. They can subsequently be fine-tuned for specific tasks, using whatever supervised instances that happen to be available.
We resumed the study of transformers on November 14. Vision transformers
work by breaking an image down into smaller patches, for instance
16 x 16 squares. Attention mechanisms focus attention on specific
regions of a 2-dimensional image.
We looked briefy at all of the resources made available by the organisation
called Hugging Face
.
The first major topic in Chapter 17 is autoencoders. In the most basic form, an autoencoder tries to learn the identity mapping from input to output - the input serves also as the target for training the autoencoder. This is useful, because the model has to produce an encoding in its hidden layer that is constrained to be of lower dimension than the input. This encoding should then be a concise, efficient, representation of the input, and this encoding can then be used for many purposes.
The first example is Principal Components Analysis. The data set consists of points in three-dimensional space, and the encoding is constrained to be two-dimensional. After training, the model yields the first two principal components of the 3D instances in the test set.
It can be advantageous to stack autoencoders. Such a model is used with the MNIST fashion set, and it is supposed to reproduce the input images in the output. The relatively simple autoencoder given in Géron's book does so, but not with great fidelity.
Autoencoders can also be used for unsupervised pretraining. A large data set can be used to train a stacked autoencoder, and its lower layers, leading to the encoding, can be reused to train a much smaller labelled data set for one or more specific tasks. It can be a good idea to tie the weights of the encoder and the decoder, transposing those from the encoder for use with the decoder. This saves on parameters and speeds up training.
A stacked autoencoder need not use dense layers. Especially when working with images, convolutional autoencoders can incorporate convolutional layers, interspersed with pooling layers, as in an ordinary CNN. This idea is exploited by denoising autoencoders. For training, random noise can be added to the pixels of an image, or alternatively, the input images can be subjected to dropout, so as to lower the signal-to-noise ratio. The untouched images then serve as targets to train a model to reconstruct a clean image from a noisy or blurred one. An example of this was given, where images in the MNIST fashion data set were reconstructed, with greater or lesser fidelity, from the images to which enough random noise was added as to make them largely unrecognisable.
Another variant is the sparse autoencoder, for which the encodings must be sparse in a certain sense. There are various ways of penalising non-sparsity in the loss function, in particular L1 and L2 penalties, and another one, the Kullback-Leibler divergence measuring the divergence from a target sparsity to the actual sparsity at any stage of training.
Next came the topic of Generative Adversarial Networks (GAN). This
architecture is quite different from that of an autoencoder. It has two
components: the generator and the discriminator. The generator takes a random
input and learns to transform it to a fake
version of the real inputs
that are used to train the discriminator. Phase one of training supplies
labelled inputs to the discriminator - they can be real
,
labelled 1, or fake
, labelled 0. The fakes are produced by the
generator, and the discriminator learns to distinguish real from fake. But in
phase two of training, the discriminator is frozen, while the generator
produces fakes that it falsely labels as real. The discriminator does its best
to tell the generator that they are in fact fake, and so in this phase the
generator learns how best to fool the discriminator.
If all goes well, the unique Nash equilibrium may be reached, in which the generator produces output that the discriminator cannot distinguish from real input: it assigns a probability of a half to each possibility. Unfortunately, many things can go wrong. Mode collapse occurs if both agents see only one sort of output, and forget all other sorts in the training set. Instability in the process of learning weights is another danger. Our knowledge of the theoretical properties of GANs is currently insufficient to diagnose and avoid these problems, and consequently there are various recipes that may or may not help lead to successful training.
The study of this began with Deep convolutional GANs (DCGAN), which are much less unstable during training than the generic GANs we looked at earlier. They can therefore be profitably trained using several epochs, and they yield much better results if this is done. An amusing example of how this can work showed how subtracting the encodings for images of the faces of men without glasses from those of men wearing glasses, and then adding encodings of women without glasses produced images of women wearing glasses. But the arithmetic must be done in the encoding space, not the pixel space.
Among several ad hoc methods that have been found to improve the training of GANs there is a very useful one for image processing in which the GAN grows progressively by adding convolutional layers that progressively increase the size of the image. Although this is a greedy procedure, it helps to keep generated images looking like those in the training set at all sizes of the image. The rather complicated architectures of StyleGANs can produce high-resolution images that are convincing to the human eye.
The last topic in Chapter 17 is Diffusion models. These are easier to train than GANs, but take longer to generate images. The idea is to train a model to reduce entropy in the following sense: one starts with a clear image, and, over a great many steps, adds noise to the pixels until the image appears to be just random noise. The training goes backwards, starting with the noisy output, and going back progressively until the original clear image is reached.
A great improvement is obtained if the whole business takes place, not in the pixel space, but in a latent space of encodings generated by an autoencoder, with decoding the final step in the process of image generation. This not only leads to better results but also speeds up training.
After training, if the model is given random noise as its input, it generates images that resemble the training images. It works extremely well with the MNIST fashion dataset.
November 21 was the day of the Palestinian protest and the fire alarm. We got back to class in time for a little bit of work, starting on the new text by by Olivier Caelen and Marie-Alice Blete. We read swiftly through the first chapter, which recounts some of the history of the development by OpenAI of versions of GPT and the interface chatGPT. Of special interest was the way in which GPT-3 was transformed into InstructGPT, at the cost of a lot of human intervention and use of reinforcement learning, so that responses to questions, or prompts, submitted to GPT-3 could actually be answers to the questions. Next week, we will look at how GPT-4, the latest version of OpenAI's large language model (LLM), can be used for a great number of purposes.
On November 26, we were introduced to the OpenAI playground, which provides a graphical user interface (GUI) to ChatGPT itself, and also to several other large language models (LLMs) maintained by OpenAI. This interface can also provide Python code to implement the query entered into the GUI, and this code can be used, extended as desired from the command line in a terminal, or else incorporated into a larger Python program.
The Hello world
example for the
playground is quite unlike the usual idea of making the program say Hello
World
to you. Here the idea is to look for a completion of a
conversation beginning with Hello world
. But this simple example
let us explore different ways of interfacing with one of
these LLMs. It was noted that, although the cost of playing in the
playground is very small, it is positive.
On November 28 (American Thanksgiving) we continued our Deep Dive
into
ChatGPT and other LLMs maintained by OpenAI. The API lets us incorporate into
our own code most of the things that can be done using the Chat interface, and
probably more. We saw the distinction made between chat completion and
text completion.
Then we embarked on a short book What are ChatGPT and its Friends?, by Mike Loukides. It contains a list of advantages and disadvantages associated with the use of ChatGPT. But there is another list, of the numerous friends, rivals, and competitors of ChatGPT, all of which have very similar advantages and disadvantages.
No need to spell out the advantages. While not exactly disadvantages, there are limitations. To me, at least, the most amusing is that ChatGPT cannot really do arithmetic. It is a language model, and would need a sort of calculator plugin to be able to do sums.
A clear disadvantage is cost. Ordinary mortals cannot summon the computing resources need to train a model like ChatGPT, or the real models for which ChatGPT is a user interface, including GPT-3.5, GPT-3.5-turbo, GPT-4, and GPT-4o. So long as the costs are borne by the huge corporations working on AI, this need not bother us. Further, we need not, yet, worry about being replaced as human intelligences by artificial intelligences.
The last class was on December 3. We began with looking a little more at
the writings of Mike Loukides, not only on ChatGPT and friends, but also
his Radar Trends
for November. He gives a link to a couple of videos
showing a robot playing table tennis, quite successfully, against a male human
opponent. I found these videos amusing.
The course was brought to an end by letting us interact with ChatGPT, trying to find either text questions or image descriptions that the model couldn't handle. We didn't have a lot of success, but the challenge to give an image of the McGill campus didn't succeed at first.
Assignments:
To send me email, click here or write directly to russell.davidson@mcgill.ca.
URL:
https://russell-davidson.research.mcgill.ca/e706/