Try drawing a digit on the canvas!
1v1 Least Squares (n/a ms)
In this article, we go over 3 basic models that tackle the MNIST dataset in distinct ways and compare their properties, strengths, and weaknesses. You can play with each one of these models above interactively and view their outputs in the bar graphs.
The Task
To those who aren’t familiar with machine learning, converting an image into a number might appear a daunting task. However, it becomes easier if we think of the problem in the following way:
A grayscale picture is simply a grid of pixel brightnesses, which are real values. That is, each image is some element in the set , where are the width and height of the image, respectively. So, we can solve the problem if we can find a function from .
To do this, we construct a model using our training images and labels .
Least Squares
This method involves creating 45 linear maps from for each unique pair selected from our categories 0..9 that infers whether an image most likely belongs to pair or . We can minimize Mean Squared Error (MSE) using some linear algebra. First, instead of dealing with images in , we can “flatten” them to .
Define the weights of as , a vector of length . To get the output of the model, we compute
where is a digit.
We want to minimize the MSE over all samples
To do this, we create a new matrix , which only contains images which belong to class or along with a column of for the bias, and a matrix , which similarly only contains labels in , but replaces with and with .
Now, our problem has been reduced to
The solution to this is given by , where is the pseudoinverse of the matrix (Proof left as exercise to the reader 😁).
Once we have for all pairs (45 in total), we can represent our desired function as
def f(x):
score = [0] * 10
for i, j, f_ij in pair_functions:
out_ij = f_ij(x)
if out_ij > 0:
score[i] += 1
score[j] -= 1
else:
score[j] += 1
score[i] -= 1
return argmax(score)
Each of the 45 models “votes” for either its
or
. The score
array is
what you see above in the bar graph.
Fully Connected Network
A Fully Connected Network, or FCN is a much larger model than the least squares model. Instead of projecting our labels onto the principal subspace of the data, we can directly learn a mapping from the input space to the output space.
For a single layer network, we assume that can be approximated by
where is some nonlinear function. It is possible to learn the matrix such that the error (Categorical Cross Entropy) is minimized in the local neighborhood through gradient descent. In the demo, we use a 2 layer network that maps the image to , and the result of that to . This is represented by
where we have to learn matrices and . In our case, and
converts the output to a probability distribution, which is shown above in the bar graphs.
Convolutional Network
A limitation of the two above models is that they don’t see visual features like
humans do. For example, a handwritten 1
is a
regardless of where it was
painted on the canvas. However, since the LS and FCN models do not have a notion
of space or proximity, they will simply point to the category which most likely
have those exact pixels on.
Here, we bring in convolutions. Convolutions take an image and a kernel, run the kernel through the image, and produce an output image that contains the weighted sum of the image pixels and the kernel values.
Notice how convolutions encode spatial data like plain networks do not. Since pixels nearby are usually highly correlated with each other, we can downsample the convolution output with a max pool and preserve most of the information. After passing the image through a bunch of (trained) kernels, we get a set of matrices that represent the existence of a learned spatial feature. Finally, we can flatten and pass these into an FCN, which can now map spatial data into categories.
The output of this FCN (with softmax activation) is shown above.
Model Comparison
Note: The last 3 columns are qualitative and relative to each other.
Model | Number of Parameters | Training Time | Inference Time | Accuracy |
---|---|---|---|---|
Least Squares | Low | Fast | Low | |
FCN | High | Fast | Good | |
Convolutional Network | Very High | Slow | Excellent |
Observations:
- Least Squares model is very fast but has a weak ability to generalize
- The CNN’s parameters are very efficient to store
- Relative to the inference time of the CNN, LS and FCN are very fast
Exercises
See how the models respond to these inputs:
- Empty canvas
- A
1
in the center - A
1
in the far left side - A
1
in the far right side - A
0
with a line/dot in the center - A
9
, with the top slightly disconnected - Slightly rotated digits
- Very thin digits
- Very thick digits
Can you find 2 inputs that have a 1 pixel difference that map to different categories?
Implementation Details
All three of the models are running on your browser in plain JavaScript; no frameworks or packages were used.
Canvas
The canvas is backed by an array of numbers that contain the alpha value that is displayed. Every time any pixel is updated, the entire thing is redrawn. The only other interesting detail is the brightness dropoff function I used:
const plateau = 0.3;
// dist is distance^2 from center
const alpha = Math.min(1 - dist / r2 + plateau, 1);
pixels[yc * 28 + xc] = Math.max(pixels[yc * 28 + xc], alpha);
I initially tried a dropoff of 1-dist/r2
but it faded the center too much. So
I added the plateau
variable that shifts the function up but clamped it down
with Math.min
so that alpha doesn’t exceed 1. This gives the brush a more
natural look.
Least Squares
I got the weights from a project I did in ECE 174 with Professor Piya Pal. Inference is simply 45 dot products and scoring
function evalLSModel(digit, weights) {
const scores = new Array(10).fill(0);
for (const pairConfig of weights) {
const [i, j, w] = pairConfig;
// Vector dot product
const result = vdot(digit, w);
if (result > 0) {
scores[i] += 1;
scores[j] -= 1;
} else {
scores[j] += 1;
scores[i] -= 1;
}
}
return scores;
}
Fully Connected Network
The main work with FCN inference is the matrix dot product, which I implemented in the standard way.
function matrixDot(matrix1, matrix2, rows1, cols1, rows2, cols2) {
// Check if the matrices can be multiplied
if (cols1 !== rows2) {
console.error("Invalid matrix dimensions for dot product");
return null;
}
// Initialize result matrix with zeros
const result = new Array(rows1 * cols2).fill(0);
// Perform dot product
for (let i = 0; i < rows1; i++) {
for (let j = 0; j < cols2; j++) {
for (let k = 0; k < cols1; k++) {
result[i * cols2 + j] +=
matrix1[i * cols1 + k] * matrix2[k * cols2 + j];
}
}
}
return result;
}
I stored matrices in a single 1D Array
for better cache locality and fewer
heap allocations. As per the formula above, inference consists of 2 matrix dot
products and 2 activation function applications. The push(1)
calls are to
calculate the bias.
function evalNN(digit, weights) {
const digitCopy = [...digit];
digitCopy.push(1);
// layer 1 params
const [w1, [rows1, cols1]] = weights[0];
const out1 = matrixDot(digitCopy, w1, 1, digitCopy.length, rows1, cols1).map(relu);
const [w2, [rows2, cols2]] = weights[1];
out1.push(1);
const out2 = matrixDot(out1, w2, 1, out1.length, rows2, cols2);
return softmax(out2);
}
Convolutional Network
The convnet here is quite small. In Pytorch it is
nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(32, 64, kernel_size=3),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Dropout(0.5),
nn.Linear(1600, 10),
nn.Softmax(dim=1)
)
For inference, we just have to port the forward passes to JavaScript. Conv2d (with in/out channels) is given by
function conv2d(
nInChan,
nOutChan,
inputData,
inputHeight,
inputWidth,
kernel,
bias,
) {
if (inputData.length !== inputHeight * inputWidth * nInChan) {
console.error("Invalid input size");
return;
}
if (kernel.length !== 3 * 3 * nInChan * nOutChan) {
console.error("Invalid kernel size");
return;
}
const kernelHeight = 3;
const kernelWidth = 3;
// Compute output dimensions
const outputHeight = inputHeight - kernelHeight + 1;
const outputWidth = inputWidth - kernelWidth + 1;
const output = new Array(nOutChan * outputHeight * outputWidth).fill(0);
for (let i = 0; i < outputHeight; i++) {
for (let j = 0; j < outputWidth; j++) {
for (let outChan = 0; outChan < nOutChan; outChan++) {
let sum = 0;
// apply filter at single location over all input channels
for (let inChan = 0; inChan < nInChan; inChan++) {
for (let row = 0; row < 3; row++) {
for (let col = 0; col < 3; col++) {
const inI =
inChan * (inputHeight * inputWidth) +
(i + row) * inputWidth +
(j + col);
const kI =
outChan * (nInChan * 3 * 3) +
inChan * (3 * 3) +
row * 3 +
col;
sum += inputData[inI] * kernel[kI];
}
}
}
sum += bias[outChan];
const outI =
outChan * (outputHeight * outputWidth) +
i * outputWidth +
j;
output[outI] = sum;
}
}
}
return output;
}
I know it’s ugly. I’m just putting it here for reference. Heads up for maxpool:
function maxPool2d(nInChannels, inputData, inputHeight, inputWidth) {
if (inputData.length !== inputHeight * inputWidth * nInChannels) {
console.error("maxpool2d: invalid input height/width");
return;
}
const poolSize = 2;
const stride = 2;
const outputHeight = Math.floor((inputHeight - poolSize) / stride) + 1;
const outputWidth = Math.floor((inputWidth - poolSize) / stride) + 1;
const output = new Array(outputHeight * outputWidth * nInChannels).fill(0);
for (let chan = 0; chan < nInChannels; chan++) {
for (let i = 0; i < outputHeight; i++) {
for (let j = 0; j < outputWidth; j++) {
let m = 0;
for (let row = 0; row < poolSize; row++) {
for (let col = 0; col < poolSize; col++) {
const ind =
chan * (inputHeight * inputWidth) +
(i * stride + row) * inputWidth +
(j * stride + col);
m = Math.max(m, inputData[ind]);
}
}
const outI =
chan * (outputHeight * outputWidth) + i * outputWidth + j;
output[outI] = m;
}
}
}
return output;
}
And yes, the only reason I am dealing with that abomination of index calculating code is for that Blazing fast 🔥JavaScript🔥 Web App performance. And finally, here’s the function that ties it all together:
function evalConv(digit, weights) {
const [
[f1, fshape1], // conv filter weights
[b1, bshape1], // conv bias
[f2, fshape2],
[b2, fbshape2],
[w, wshape], // fcn weights
[b, bshape], // fcn bias
] = weights;
const x1 = conv2d(1, 32, digit, 28, 28, f1, b1).map(relu);
const x2 = maxPool2d(32, x1, 26, 26);
const x3 = conv2d(32, 64, x2, 13, 13, f2, b2).map(relu);
const x4 = maxPool2d(64, x3, 11, 11);
const x5 = matrixDot(w, x4, 10, 1600, 1600, 1);
const x6 = vsum(x5, b);
const out = softmax(x6);
return out;
}
Conclusion
I hope you all enjoy playing around with the app. If you have any questions or feedback, feel free to leave a comment below.