This page is for the course on Machine Learning
Above are the new pictures generated for this course by ChatGPT. I think I prefer the old one!
Class Notices:
The class outline is here as a PDF file, and here as HTML. Some of the information it contains is also given below.
The first text for the course is a book by Aurélien Géron, which I have found to be very useful for learning how to program machine-learning algorithms. The book was originally called Hands-on Machine Learning with SciKit-Learn and TensorFlow , and it is available from the O'Reilly website, to which I believe McGill people can get free access. However, the book has been updated, with the new title Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition . Keras is a software layer that makes programming an algorithm even easier than with TensorFlow, the leading platform until very recently.
The second text for the course is Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. It is now quite old. I used to use it as the main text, and it is still useful for many theoretical considerations.
This link takes you to the website for the book. Its contents are completely available online. The book is also available in hardcover . It is published by the MIT Press.
The third text is one that I came across just recently. It is GANs in Action, by Jakub Langr and Vladimir Bok, Manning. I acquired it as an ebook, and I don't know if it is available in a hardcopy version. Although GANs are a quite advanced topic, the first part of this book is a rather good, non-mathematical introduction to many of the topics we will consider.
Provisionally, here is a summary of the topics I covered over the last few years, and hope to be able to cover this year as well, plus one addition, the last in the list.
Software:
In the last few years, I have been working with Python, an interpreted language that has, for the most part, a straightforward syntax, and can be learnt swiftly by anyone with even just a little experience of programming. (Prefer Python 3 to Python 2. The two versions are not completely inter-compatible, and Python 3 provides better functionality.)
The relative simplicity of programming in Python is probably the main reason for which Python has quite the best set of libraries for machine learning, and not just for deep learning. Although deep learning will be the main focus of the course, I plan to look at some other machine-learning techniques, for which the Python libraries are equally useful. The study of some of these other techniques reveals how much all modern machine-learning approaches have in common, despite the fact that some are much better adapted than others for specific applications.
Resources
There is a super-abundance of resources available online for studying
machine learning, and for implementing it. Machine learning is often
coupled with the buzzword Big Data
, and this is simply because
machines usually learn better if they have a lot of data available to
train their algorithms. Many big datasets are available online, the best
known, and probably the most comprehensive, being
Here are some of the available resources which I found useful. I will add to the list as the term proceeds.
Log of material covered:
The first class was on August 28. We began with an overview of what will be covered in the course, along with a brief discussion of the history of machine learning and artificial intelligence. A look at the course outline allowed us to see a considerable number of machine architectures that can be used for various specific propblems.
We looked very briefly at the preface of Géron's book, where he mentions many of the now everyday operations carried out by trained machines. The importance of the Python programming language was emphasised for accessing the numerous libraries that can be used in order to train models from datasets.
On September 2, we discussed two of the main sorts of tasks that can be tackled by machine-learning algorithms: classification and regression. There are many more, but these two are the most important. The result of a classication task is an assignment of an instance to one of a finite number of categories. For instance, does an image (the instance) depict a cat, a dog, of something else? A regression task returns a number, that could perhaps be an estimate, or a forecast, of some variable based on the values of some explanatory variables, where these values constitute the instance. The familiar idea of linear regression is an example of this, and linear regression is indeed used as a machine-learning algorithm in appropriate cases.
An important kind of learning is supervised learning. The machine
is trained, using a training set composed of several instances
along with a label, in the case of a classification task, or a
target, for a regression task. In both cases, the label or the
target is the right
answer, and the training proceeds by trying to
minimise the loss function.
In addition to supervised learning, we can also consider unsupervised learning, semi-supervised learning, and self-supervised learning. Reinforcement learning is somewhat different. It is regularly used to train robots, which get a reward when they do the right thing, and a penalty for the wrong thing.
In his first chapter, Géron provides a long list of recent applications of machine learning, including image classification and natural language processing. We briefly discussed some of these.
We looked at the first bit of Python code provided by Géron, for running a linear regression. This gave us the opportunity to see some aspects of Python syntax, which flow naturally from the fact that, in Python, everything is an object.
We completed the study of Chapter 1 of Géron's book on September 4. This involved discussing over- and under-fitting, separating the training set and the test set, using validation techniques in order to choose good settings for hyperparameters. Regularisation is a general term to describe imposing constraints on a model, so as to avoid over-fitting.
A long list of aspects of data sets that make them unsuitable for machine learning was given. Data may be unrepresentative or biased. It is important to have as much usable data as possible, but care must be taken to make the data appropriate for machine-learning algorithms.
Then we started on Chapter 2, in which we are taken, step by step, through an artificial project that uses data on house prices in California as determined in the 1990 census. After a first phase in which aims and objects are defined, and the data are obtained and downloaded, as much qualitative information as can be readily seen is used in the process of readying the data for training.
On September 9, we went on with Chapter 2 of the textbook. The aim of the artificial project is to predict the median house price in a particular district (geographical location) in California given various characteristics of the house, such as floor area, number of bedrooms, number of bathrooms, and a categorical variable that expresses how desirable the district is in terms of proximity to the ocean and a few other features.
A great many of the points raised in the chapter apply equally well to almost any empirical project, whether in economics (like this one) or another discipline. One thing specific to the housing project treated here is the dataset itself. Instructions are given for accessing it. While one may do so working from a computer linked to the Internet, it can also be done in the cloud, using the Google Colab. The Colab does not save your data, and so, if you don't want to stay online until you complete the project, you must download the data either to your own machine to to somewhere else where you can store files.
Whether you work in the cloud or on your computer, a very convenient way of developing the project is by use of a Jupyter notebook. Such a notebook is comprised of cells, which may contain text, along with graphics, or Python code, or the results of runniing the code through the Python interpreter. The notebook serves as excellent documentation of the project, as its aims and objects can be explained in text cells, while the code cells permit replication of the actual computation. Both text and code cells can be updated, so that mistakes can be corrected easily.
Emphasis was put on aspects of Python syntax that are used in a typical machine-learning project, and on the various steps to be taken in getting data, handling data so as to produce a training set and a test set. Each set contains instances, each characterised ny various attributes, which may be numerical or categorical.
We are skipping lightly over the example in Chapter 2 that Géron treats in detail, because it is an example of linear regression. But in fact linear regression illustrates many of the aspects of supervised learning, and it allows us to compare and contrast the familiar econometric terminology and that used in the context of machine learning. What is useful for later is to see how to use Python and its numerous libraries to handle the data used by a regression and to run the regression so as to get output of various sorts.
We spent some time on September 11 on Géron's example using housing data from California. The application used by Géron for the project is SciKit Learn. Its API (application programming interface) is well thought-out, and is thus suitable for many small or medium-sized projects. Géron walks us through the numerous steps needed to prepare the dataset for use. After the choice of units of measurement, possible data transformations, such as using logs of some variables, cleaning the data and dealing with missing data, the complete data set can be split into the training set and the test set, and training can begin.
We looked at some evaluation measures for the linear regression trained on the California housing data. The most straightforward of these is root mean squared error or RMSE.
We now skipped over a number of chapters to Chapter 6, on Decision Trees. The Iris dataset is used in order to look at decision trees. A tree consists of several nodes, one of which is the root node. This node, and all others that are not leaf nodes, split into two branches, the chosen branch depending on a criterion, specific to the node and based on just one feature. Any instance follows the branches going from the root to some leaf node. The leaf node determines the predicted probabilities for the instance for each class, and the class with the highest probability is the predicted class.
On September 16, we continued work on Decision Trees. Training a decision tree can be achieved by use of the CART algorithm (Classification And Regression Tree). This makes use of the Gini impurity of each node.
An alternative to the Gini impurity is the entropy, but the choice of which of them to use for the CART algorithm seems to matter very little. We continued the study of Decision Trees by seeing how they can be regularised, and how regularisation performs its usual task of avoiding overfitting. A decision tree can be used for a regression task as well as for a classification task, by replacing the Gini impurity or entropy in the CART algorithm by the mean squared error (MSE). A final point about decision trees is that they depend heavily on the nature of the correlations among the features. Principal Components Analysis (PCA), to be discussed later, can be used advantageously in this context. See here for the mathematical background.
Chapter 7 begins the discussion of Ensemble Methods, by which is
meant that the results of different predictors can be combined in order to
yield more precise results. This can be done by either hard or
soft voting. The former is the familiar majority voting - selecting
the prediction that gets the most votes
from the predictors; the
latter instead averages the prediction probabilities produced by the
predictors, and only then selects the class that gets the highest
averaged probability.
The term bagging is short for bootstrap aggregating
. It
generates a set of predictions by training the chosen algorithm not only
on the training set, but on bootstrap
training sets, which are
generated by resampling with replacement from the original training set.
The predictions of the bootstrap sets can then be combined by hard or soft
voting. Because a bootstrap set contains only a subset of the instances in
the original set, the other instances, called oob (for
out-of-bag
) instances, can be used in the construction of a
validation set. The most successful ensemble method so far is the
Random Forest, created by bagging decision trees. An alternative
to bagging is to train on random subsets of the original training set.
On September 23, we continued study of Chapter 7. Another set of ensemble methods is covered by the term boosting. Unlike bagging and pasting, where the new training sets can be trained simultaneously, or concurrently, in parallel, the training sets generated by boosting must be trained sequentially. The method called AdaBoost, for adaptive boosting, works at each stage of an iterative procedure by weighting those instances more heavily which contributed most to the loss function on the previous stage. Once enough iterations have been performed, the results of all the stages can be combined, but with weights determined by how well they performed separately. Gradient boosting uses a different approach, in which at each stage one tries to train a model to fit the residuals of the previous stage. At the end, the predictions are summed to get the ensemble predictions.
With stacking, short for stacked generalisation, a model called a blender is trained in order to perform the aggregation of the results of all the predictors in the ensemble. Its inputs are the outputs of all these predictors, and its targets are unchanged from those of the original training set.
We then went back to Chapter 3, on classification. Most of this chapter is devoted to a study of the MNIST dataset of handwritten digits. Géron tells us how to download the dataset, to see its format, and to understand what the meaning of the data is. Although one would like to train an algorithm to say what digit each image in the dataset is supposed to represent, we start with a binary-classifier, the aim of which is just to say whether a digit is a '5' or not.
The first classifier used is based on stochastic gradient descent. Its performance can be measured by four quantities: true and false positives, and true and false negatives. These in turn define the measures called precision and recall, and a combination of the two, called the F1 measure.
We went quickly through the rest of Chapter 3 on September 25. For multiple classification, the task can be assigned to a set of binary classifiers. There are two approaches: one-over-all, where each possible class gets a binary classifier, and one-over-one, where each pair of classes is assigned a classifier. The second approach requires a lot more classifiers, but each is trained on only a subset of the complete dataset.
The last sort of task considered in this Chapter is multioutput classification. The example given is that of denoising images. The model is trained using inputs that are here a selection of the handwritten digits in the MNIST dataset, to which random noise has been added. The target for each instance is the clean image, which consists of 28×28 pixels, each with a numerical value in the range 0-255 - hence multioutput classification. After training, the model should be able to clean, or denoise, images.
Next we moved on to Chapter 8, on dimensionality reduction. This is made almost essential on account of the curse of dimensionality. After briefly discussing orthogonal projections, we talked about Principal Components Analysis, or PCA. This depends on the singular value decomposition or SVD, which is dealt with more mathematically in the notes I prepared on the topic.
These notes took up some time on September 30. The first topic treated in the notes is the singular value decomposition. An m × n matrix X is expressed as the product of three matrices: UWV⊤, where U and V have orthonormal columns and W is diagonal with non-negative diagonal elements. Existence is proved, but not uniqueness.
Next, we defined the concept of a generalised inverse, and the Moore-Penrose generalised inverse. This led on, finally, to the section on PCA. This analysis can be implemented very easily by use of the SVD, and we saw how it can be used for dimension reduction.
After all this, we started on Chapter 10, where we are introduced to
artificial neural networks (ANNs), which are what is used in
deep learning. The perceptron was an early attempt to mimic the
human (or animal) brain using artificial neurons. It was shown that a
perceptron of the sort originally envisaged is unable to implement the XOR
(for exclusive or
) operation.
The connections between layers are parametrised using connection weights and biases, which convert the linear functions used by a unit in a layer above the input layer into affine functions. One could have any number of layers with only linear functions taking inputs to outputs, but the end result would be equivalent to just one layer. It is necessary to act on the output of the affine functions with an activation function, which must be nonlinear. The first activation functions were step functions. But they have the disadvantage that their derivatives are zero almost everywhere.
On October 2nd, Géron showed us a multi-layer perceptron or MLP that does implement the XOR operation. It is of the simplest type: Above the input layer there is a hidden layer, above which is the output layer. In general, there may be more than one hidden layer.
Then we looked at various choices for activation functions: the logistic or sigmoid function, the hyperbolic tangent, and, the most widely used, the ReLU, or rectified linear unit function. This last is the easiest to compute, and its derivative is either zero or one.
The output layer itself in general uses an activation function before spitting out its outputs, which are the arguments of the loss function. For a regression task, this might just be the identity function, with no transformation at all, but, for a classification task, a good choice is the softmax function, which generates a set of probabilities that can be used for either hard or soft voting to choose the category of an instance.
In order to train a deep model, the training set is randomly split into
mini-batches, which are successively fed into the model. There is first
a feed-forward pass through the data, in which the inputs are converted
into outputs using the affine functions in each unit (a better word than
artificial neuron
), and the activation functions. It is important to
initialise the connection weights randomly, since otherwise all the
nodes in a layer would be identical, thus defeating the idea of each node doing
its own thing.
The outputs are used to evaluate the loss function, and then the partial derivatives of the loss function with respect to all of the parameters, the connection weights and the biases, are computed by backpropagation, which at heart is just the chain rule of the differential calculus. Follow this link for a proper mathematical account of backpropagation. Then, just as with stochastic gradient descent, the parameters are updated by following the negative of the gradient, multiplied by the learning rate η.
Géron provides a table with typical
architectures for MLPs.
Among things to be chosen are the depth - number of layers - and the
width - number of nodes in each one of the layers. Activation functions
must also be specified, although for many regression tasks none is needed for
the output layer. The loss function may be of different sorts, although the
mean squared error is the commonest for regression tasks. If there are many
outliers, mean absolute error may be more robust.
We began our study of what can be done with Keras on October 7. Although Scikit-Learn does have classifiers and regression models for multi-layer perceptrons, a much richer set of models is made available by Keras, an API that sits on top of TensorFlow. Our task was to set up an MLP for classification of the images in the Fashion MNIST dataset that includes ten sorts of images supposed to represent different items of clothing.
Keras (horn
in
Ancient Greek κερας) provides a very flexible API for
deep learning. The
Sequential class makes it easy to define the
architecture of an ANN. An object of this class can be constructed in two
ways, one in which the layers are set up using the
add function, the other in which the layers are passed
as a list to the Sequential constructor. After
construction, the model must be compiled (by function
compile), specifying the loss function, the optimiser
(like stochastic gradient descent), and, optionally, metrics that can be
used to evaluate the fit of the model. After compilation, the usual
fit function can be called. It generates the
history
of the fit, giving some details of the progress made at
each epoch, or pass through the data. If a validation set is
specified, its fit is also recorded.
The California housing dataset was used to set up an MLP for a regression task. The Sequential class works in almost the same way as for the MNIST classification task. The main difference is that the output layer has only one node, instead of the ten needed for classifying the images, and has no activation function, instead of the softmax function.
Chapter 10 also describes the Functional API, which is declarative, like the Sequential API. We just started on seeing how to use it for a model with two channels, one deep, one wide.
The Functional API is almost infinitely adaptable, and can be used to construct models of great complexity. As we saw on October 9, the model with two channels can be extended so that the separate channels can have different inputs, and the model can generate more than one output. Although the models that can be constructed this way are very flexible, they are static, and for many purposes we want a dynamic model, which can respond differently in different situations. This can be achieved by subclassing the Keras Model class.
Subclassing is an essential operation for object-oriented programming. It allows the programmer to make use of the functionality of a base class, and add to it, or override it, in a subclass. Géron takes the two-input two-output model he had developed using the Functional API, and codes an equivalent model by subclassing. The necessary functions are the constructor and the call function. In this way construction of the objects of the model is separated from the the way they are put together. In general, the call function can be programmed to do almost anything in a dynamic fashion.
For purposes of observing the progress of the fit function, and for many others, one can define callbacks, that is, functions that are called when a certain point in the execution of the program is reached. In particular, such callbacks are used in conjunction with TensorBoard, which provides a visual description of how training is proceeding.
On October 21, we went straight on to Chapter 11, on training neural network models. We started with the random initialisation of the parameters (connection weights and biases) using either the normal or the uniform distribution, both centred at the origin and scaled inversely with the number of links in and out of a layer. After that, we were treated to a number of alternatives to the ReLU activation function, all in an attempt to avoid the vanishing-gradients problem that can arise when some units get stuck in negative numbers where the ReLU gradient is zero.
Next came batch normalisation, another technique of avoiding exploding or vanishing gradients in the hidden layers of a deep network. This procedure also serves for regularisation of the model. A cruder way to deal with the vanishing or exploding gradient problem is gradient clipping.
Various aspects of the important topic of transfer learning were introduced. One possibility is the use of pretrained networks; another is unsupervised learning by building up layers progressively, freezing each one before training the next. There will be other applications of the principle of transfer learning, which we can think of as learning from experience.
On October 23, next to be studied in Chapter 11 is a long list of optimisers, all based on gradient descent, but with numerous tweaks -- AdaGrad, RMSprop, Adam, Adamax, Nadam, AdamW..... One idea is momentum, and a variant of it called Nesterov momentum. These are all attempts to do for machine-learning algorithms what is done in more conventional optimisation procedures where one replaces Newton's method for solving first-order conditions, which requires second-order derivatives of the cost function, with quasi-Newton methods, which have to rely on first-order derivatives only.
After looking at optimisers, the next topic logically is the learning rate. Here too there are numerous possibilities: exponential scheduling, piecewise constant rates, performance scheduling, and 1cycle scheduling. The last two are normally the best performers.
We spoke about various norms: ℓ1, ℓ2, up to ℓ∞. These are used in penalty terms added to the loss function and used for regularisation. Use of the LASSO (Least Absolute Shrinkage and Selection Operator) selection mechanism can lead to a sparse network. Ridge regression uses the ℓ2 norm for regularisation, but does not lead to some parameters being set to zero.
We continued the study of regularisation on October 28. After reminders of ℓ1 and ℓ2 regularisation, we looked at one of the most popular and powerful techniques: dropout. Although it seems at first sight that it is throwing away information, what it really does is to constitute an ensemble method, which combines the results of a huge number of submodels.
Next came the procedure called MC dropout (MC = Monte Carlo), which improves the accuracy of a model trained with dropout when it is applied to a test set. Basically, Monte Carlo methods are used to evaluate integrals by use of averaging over values of the integrand, so that MC dropout benefits from two different sources of randomness. We then skipped Chapters 12 and 13, and moved on to Chapter 14.
A convolutional neural net (CNN) is of great use in image processing, where instances have a two-dimensional structure. An image can be scanned using a kernel, which is a small rectangle that moves over the complete image, generating output based just on what is in the kernel. The kernel can be thought of as a receptive field.
CNNs were the topic considered on October 30. The kernel can move in jumps, the size of which is determined by the stride. Since the kernel could fall off the edge as it approaches the boundary of the image, there are various things that can happen. One of these is zero padding, and another is valid padding, where the size of the output image is smaller than that of the image itself.
The kernel is combined with a number of filters, for instance, horizontal and vertical filters. Each of these gives rise to a feature map that captures only some aspects of the input image, but there are usually many feature maps in each convolutional layer; often their number is an integer power of two. The filters they use are learned as training proceeds.
Convolutional layers are often separated by a pooling layer. This loses information, but it can usefully reduce memory requirements, especially for large images, and can serve for regularisation.
On November 4, we looked quickly at the long list in Chapter 14 of various architectures, some of which won contests for processing various datasets. A few new ideas were introduced, such as the inception module, the residual unit, or the SE block. It is often possible to make use of a pretrained model with a prize-winning architecture, and to benefit from transfer learning. We looked at examples of this for a variety of tasks.
The end of Chapter 14 on CNNs was covered quite quickly, so as to avoid tedium. The main topics were identification and localisation of objects in an image using bounding boxes, semantic segmentation, and instance segmentation. In semantic segmentation, each pixel in an image is classified according to the sort of object the representation of which in the image contains the pixel.
Then we went on to Chapter 15, on recurrent neural nets (RNN),
which are used for processing sequences, such as the time series used in
econometrics. A new type of neuron
, or unit, called a recurrent
unit, is introduced. The input to such a unit is usually a sequence, and,
as each element of the sequence is fed sequentially to the unit, it is combined
with something produced by the unit when the preceding element was fed in.
We more or less completed the study of Chapter 15 on November 6. A recurrent unit normally takes a sequence as input, and inputs can be fed into a layer of recurrent units. The output from a single recurrent unit may be another sequence (sequence to sequence), or just a single element, typically the last in a sequence of which the other elements are discarded (sequence-to-vector). It is also possible to output a sequence from a single vector (vector-to-sequence), where the same vector is fed into each recurrent unit. A brief mention was made of an encoder-decoder network, of which much more later.
An important use of recurrent neurons is in forecasting time series. A specific example is given using data from the Chicago Transit Authority, in which it was necessary to take account of various types of seasonality. There is a very strong weekly pattern, and the suggestion is made that it can be removed by differencing the data, that is, looking at the series y(t)-y(t-7). It appeared that there was also a small amount of cyclical behaviour at the annual frequency, and it too can be removed by differencing. For forecasting, if a model is trained on the differenced data, it is necessary to add back the observations that were subtracted out in the differencing process.
This led on to discussion of various time-series models used in econometrics: ARMA, ARIMA, SARIMA models all try to take account of patterns of serial correlation in time series. Explicit use of such models, which involves choosing suitable values for a set of hyperparameters, leads to adequate, but not sensational, forecasts.
In order to train a model with a time series, one can construct instances that are consecutive subsequences, and give them targets that are subsequences one step ahead of the one that constitutes the instance. This allows the model to have some memory, although not a very long one.
We skipped quickly over much of Géron's treatment of the data from Chicago, but instead talked about how these methods could be used for economic forecasting and even weather forecasting, but it appeared that an RNN cannot cope satisfactorily with a very long series.
However, we saw that two types of memory cells have been shown to extend short-term memory considerably. The first is the LSTM (long short-term memory) cell; the second is the GRU (gated recurrent unit) cell, which, although it is a simplified version of the LSTM cell, seems to work equally well. After that, we saw how a one-dimensional CNN can be combined with an RNN, and even replace it in some circumstances.
We were again stuck with version 2 of Géron's test. I think I have resolved this problem, although I, like ChatGPT, can make mistakes! We embarked on Chapter 16 on Natural Language Processing. The first sort of model considered is a char-RNN, a recurrent model which tries to predict the next character in a sequence of characters. All of Shakespeare's works can be downloaded, and used to train such a model. Formally, it is just the same as predicting one step ahead with a time series. The model can be used to generate fake text, which has some passing resemblance to Shakespearean English. A sequence, perhaps taken from one of Shakespeare's plays, is used to initialise the text to be generated. Then, starting from this sequence, one character at a time is appended. The best way to do this is to use a softmax activation on top of a dense layer that outputs a vector with one element for each admissible character, including spaces and punctuation. The probabilities thus obtained can then be combined with a temperature, where low temperatures lead to the next character being chosen as the most probable out of the full set, with the probabilities being made progressively more equal as the temperature rises.
Because we can have very long passages of text, it is desirable to train models that may have long memories. This can be achieved by the use of stateful RNNs, where the contents of the hidden layers are preserved from one training iteration to the next, and re-initialised only at the end of each epoch.
The next task treated in Chapter 16 is sentiment analysis. For this, we move from single characters to words. Even if we limit our vocabulary to a thousand words, it is desirable to use an embedding, where each word is represented by a vector in a Euclidean space, of which the dimension is a hyperparameter. The embedding is typically learned as training proceeds. There is a set of 50,000 movie reviews available in the Internet Movie Database (IMDb). These are either favourable or unfavourable, so that the targets for each review are binary: 0 for negative, or unfavourable, 1 for positive, or favourable. A model with a layer of GRU cells can be trained to output a binary result. It should have a dense layer on top of the GRU layer with just one output neuron, with sigmoid activation, so that we can assign probabilities to positive or negative sentiments.
We continued the study of natural language processing on November 13. It is often preferable to use a pretrained model. Earlier versions of these suffer from the problem that a word can have quite different meanings in different contexts. More recent, more complicated, models can handle this.
Then we moved on to the important task of machine translation. This involved learning about the encoder-decoder network, which has a more complicated structure than a straightforward deep model, since the encoder and the decoder are different models, which, although they interact, must be trained separately. The encoder converts its input into an embedding, which is an abstract representation of the signal, or meaning conveyed by the inputs.
The example we looked at involved translation from English to Spanish. The simple model had a very limited vocabulary for both languages, just short of 1000 words. If a substantially larger vocabulary is used, it may be necessary to use a sampled softmax layer for output, to avoid computing the softmax of thousands and thousands of words. This model cannot handle sentences of more than just a few words.
Context is all-important in translation, by machine or human. On November 12, we saw that one way to enlarge context is to use a bidirectional recurrent layer. Note, though, that this is for training the encoder, not the decoder, which would learn nothing if it was allowed to cheat by looking ahead. In both training and inference, the decoder is fed just one word at a time. However, beam search is possible for the decoder. This lets it maintain various possibilities looking ahead, eliminating those that are rendered impossible as further words come in. This led to correct translation of somewhat longer sentences.
A huge step forward in machine translation, and in many other tasks, came
with attention. The slogan is All you need is attention
,
meaning that convolutional and recurrent units are no longer essential to
these tasks. Attention is implemented by use of an alignment model.
It generates probabilities that determine what inputs the main model is to
pay attention to. These may be from a long time back, so that genuine
long-term memory becomes possible, over and above what one can get from
LSTM cells or GRU cells. There exist a few different types of attention,
which use different measures of similarity between the encoder's
output and the decoder's previous hidden state.
Attention layers are a feature of the transformer architecture. Translation transforms sentences of one language into sentences in another, and so it is one example of a transformer. But there are many other tasks that can be handled by transformers. Usually an attention layer is a multi-head attention layer. This lets the transformer pay attention to more than one aspect of the input at the same time. It is important that the decoder remains causal, so that it does not cheat and learn nothing, and this is achieved by masking non-causal inputs.
In translation, word order is important, but is different for different languages. Positional encodings, of various sorts, are used to convey the ordering of words in a sentence.
We spent perhaps too much time on November 18 on the second part of
Chapter 16. There has been what Géron calls an avalanche
of
transformer models
following the successful use of the original transformer for machine
translation. In many cases, these models make use of self-supervised
training, something that is immensely useful whenever there are vast amounts of
unsupervised data, but only a little supervised data. Such models output
encodings that represent the main, or salient, properties of the inputs. They
can subsequently be fine-tuned for specific tasks, using whatever supervised
instances that happen to be available.
Vision transformers
work by breaking an image down into smaller patches, for instance
16 x 16 squares. Attention mechanisms focus attention on specific
regions of a 2-dimensional image.
We looked briefy at all of the resources made available by the organisation
called Hugging Face
. In addition to various data sets, it has a library
of transformers. These can be imported and, with only a couple of lines of
code, provide a model for numerous tasks, such as, for instance, sentiment
analysis.
On November 20, we started on Chapter 17. The first major topic in the chapter is autoencoders. In the most basic form, an autoencoder tries to learn the identity mapping from input to output - the input serves also as the target for training the autoencoder. This is useful, because the model has to produce an encoding in its hidden layer that is constrained to be of lower dimension than the input. This encoding should then be a concise, efficient, representation of the input, and this encoding can then be used for many purposes. It is a version of the embeddings introduced earlier, and is also known as a latent representation.
The first example is Principal Components Analysis. The data set consists of points in three-dimensional space, and the encoding is constrained to be two-dimensional. After training, the model yields (approximately) the first two principal components of the 3D instances in the test set.
It can be advantageous to stack autoencoders. Such a model is used with the MNIST fashion set, and it is supposed to reproduce the input images in the output. The relatively simple autoencoder given in Géron's book does so, but not with great fidelity.
Autoencoders can also be used for unsupervised pretraining. A large data set can be used to train a stacked autoencoder, and its lower layers, leading to the encoding, can be reused to train a much smaller labelled data set for one or more specific tasks. It can be a good idea to tie the weights of the encoder and the decoder, transposing those from the encoder for use with the decoder. This saves on parameters and speeds up training.
A stacked autoencoder need not use dense layers. Especially when working with images, convolutional autoencoders can incorporate convolutional layers, interspersed with pooling layers, as in an ordinary CNN. This idea is exploited by denoising autoencoders. For training, random noise can be added to the pixels of an image, or alternatively, the input images can be subjected to dropout, so as to lower the signal-to-noise ratio. The untouched images then serve as targets to train a model to reconstruct a clean image from a noisy or blurred one. An example of this was given, where images in the MNIST fashion data set were reconstructed, with greater or lesser fidelity, from the images to which enough random noise was added as to make them largely unrecognisable.
Another variant is the sparse autoencoder, for which the encodings must be sparse in a certain sense. There are various ways of penalising non-sparsity in the loss function, in particular L1 and L2 penalties, and another one, the Kullback-Leibler divergence measuring the divergence from a target sparsity to the actual sparsity at any stage of training.
Variational autoencoders, which we studied on November 25, are the first ones we have seen capable of generating outputs randomly. The encoder produces two layers in parallel, one for the mean, the other for the standard deviation, of Gaussian (normal) distribution. These combine to make the encodings look like random Gaussian noise. As a result, after training, the decoder, if provided with random Gaussian input, outputs things that look somewhat like the inputs used to train the autoencoder. A variational autoencoder can also perform semantic interpolation. This could be used to let an image morph progressively into another.
Next came the topic of Generative Adversarial Networks (GAN). This
architecture is quite different from that of an autoencoder. It has two
components: the generator and the discriminator. The generator takes a random
input and learns to transform it to a fake
version of the real inputs
that are used to train the discriminator. Phase one of training supplies
labelled inputs to the discriminator - they can be real
,
labelled 1, or fake
, labelled 0. The fakes are produced by the
generator, and the discriminator learns to distinguish real from fake. But in
phase two of training, the discriminator is frozen, while the generator
produces fakes that it falsely labels as real. The discriminator does its best
to tell the generator that they are in fact fake, and so in this phase the
generator learns how best to fool the discriminator.
If all goes well, the unique Nash equilibrium may be reached, in which the generator produces output that the discriminator cannot distinguish from real input: it assigns a probability of a half to each possibility. Unfortunately, many things can go wrong. Mode collapse occurs if both agents see only one sort of output, and forget all other sorts in the training set. Instability in the process of learning weights is another danger. Our knowledge of the theoretical properties of GANs is currently insufficient to diagnose and avoid these problems, and consequently there are various recipes that may or may not help lead to successful training.
The study of this began with Deep convolutional GANs (DCGAN), which are much less unstable during training than the generic GANs we looked at earlier. They can therefore be profitably trained using several epochs, and they yield much better results if this is done.
November 27 was the second last class. We completed our study of Chapter 17, starting where we left off with DCGANs. An amusing example of arithmetic in embedding space showed how subtracting the encodings for images of the faces of men without glasses from those of men wearing glasses, and then adding encodings of women without glasses produced images of women wearing glasses. But the arithmetic must be done in the encoding space, not the pixel space.
Among several ad hoc methods that have been found to improve the training of GANs there is a very useful one for image processing in which the GAN grows progressively by adding convolutional layers that progressively increase the size of the image. Although this is a greedy procedure, it helps to keep generated images looking like those in the training set at all sizes of the image. The rather complicated architectures of StyleGANs can produce high-resolution images that are convincing to the human eye. They do this by introducing separate realisations of noise at different layers, and thus work at several levels of resolution. The idea is to make generated images resemble the training images locally as well as globally.
The last topic in Chapter 17 is Diffusion models. These are easier to train than GANs, but take longer to generate images. The idea is to train a model to reduce entropy in the following sense: one starts with a clear image, and, over a great many steps, adds noise to the pixels until the image appears to be just random noise. Generating an image goes backwards, starting with (Gaussian) noise, and going back progressively until a clear image is reached.
A great improvement is obtained if the whole business takes place, not in the pixel space, but in a latent space of encodings generated by an autoencoder, with decoding the final step in the process of image generation. This not only leads to better results but also speeds up training.
After training, if the model is given random noise as its input, it generates images that resemble the training images. It works extremely well with the MNIST fashion dataset.
Rather than going on to Géron's final chapter on Reinforcement Learning,
we started on another Hands-on
book, this time Hands-On Generative AI
with Transformers and Diffusion Models, by Omar Sanseviero, Pedro
Cuenca, Apolinário Passos, and Jonathan Whitaker. It is available on the
O'Reilly platform.
It starts where we left off with Géron, on the topic of generating
images. We are told how to download a pretrained model from Hugging Face, and
use it to generate images specified by providing a text caption. The image we
looked at came from the caption a photograph of an astronaut riding a
horse
.
The last class of the term was on December 2. We skimmed through several
chapters of the second Hands-on
textbook, looking mostly at pictures and
diagrams, and occasional bits of code. The API used by this book is not Keras,
but rather a competitor, pytorch.
The book provides many references to resources that enable many tasks. Text classification and text generation are given extensive treatment, followed by similar things for images. Generating audio is another topic that is treated, although the results could not be presented in a book. The common theme of all these models is that of the transformer, although its realisations differ a great deal in the details.
Assignments:
The Bach assignments sent to me were all acceptable, and many of them were very good. There are just two points that were made in a couple of the assignments that are worth bringing to everyone's attention. In one, a clever bit of data augmentation was done, by taking each chorale and transposing it up and down various numbers of semitones while staying inside the range of notes in the training set. This multiplied the size of the training set by six or seven, and made learning Bach's musical language easier to learn. Another assignment also transposed the chorales, but so as to make all those in a major key in C major, and all those in a minor key in A minor (which is the relative minor of C major). This took out the noise associated with the fact that the chorales are in several different keys, and made it easier for the model to focus on the melodic and harmonic progressions.
Below you can hear the most successful of the generated Bach-like files. The music starts with the beginning of one of Bach's own chorales and then moves on to the generated chord sequences. Although Bach liked dissonance, the dissonances in the generated section get progressively more unlike Bach, and more harsh-sounding to the ear.
To send me email, click here or write directly to russell.davidson@mcgill.ca.
URL:
https://russell-davidson.research.mcgill.ca/e706/