{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {},
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"# Tutorial 3: Dimensionality Reduction & Reconstruction\n",
"\n",
"**Week 1, Day 4: Dimensionality Reduction**\n",
"\n",
"**By Neuromatch Academy**\n",
"\n",
"**Content creators:** Alex Cayco Gajic, John Murray\n",
"\n",
"**Content reviewers:** Roozbeh Farhoudi, Matt Krause, Spiros Chavlis, Richard Gao, Michael Waskom, Siddharth Suresh, Natalie Schaworonkow, Ella Batty\n",
"\n",
"**Production editors:** Spiros Chavlis"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Tutorial Objectives\n",
"\n",
"*Estimated timing of tutorial: 50 minutes*\n",
"\n",
"In this notebook we'll learn to apply PCA for dimensionality reduction, using a classic dataset that is often used to benchmark machine learning algorithms: MNIST. We'll also learn how to use PCA for reconstruction and denoising.\n",
"\n",
"Overview:\n",
"- Perform PCA on MNIST\n",
"- Calculate the variance explained\n",
"- Reconstruct data with different numbers of PCs\n",
"- (Bonus) Examine denoising using PCA\n",
"\n",
"\n",
"You can learn more about the MNIST dataset [here](https://en.wikipedia.org/wiki/MNIST_database)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @markdown\n",
"from IPython.display import IFrame\n",
"from ipywidgets import widgets\n",
"out = widgets.Output()\n",
"with out:\n",
" print(f\"If you want to download the slides: https://osf.io/download/kaq2x/\")\n",
" display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/kaq2x/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n",
"display(out)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install and import feedback gadget\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Install and import feedback gadget\n",
"\n",
"!pip3 install vibecheck datatops --quiet\n",
"\n",
"from vibecheck import DatatopsContentReviewContainer\n",
"def content_review(notebook_section: str):\n",
" return DatatopsContentReviewContainer(\n",
" \"\", # No text prompt\n",
" notebook_section,\n",
" {\n",
" \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
" \"name\": \"neuromatch_cn\",\n",
" \"user_key\": \"y1x3mpx5\",\n",
" },\n",
" ).render()\n",
"\n",
"\n",
"feedback_prefix = \"W1D4_T3\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Imports\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Figure Settings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Figure Settings\n",
"import logging\n",
"logging.getLogger('matplotlib.font_manager').disabled = True\n",
"\n",
"import ipywidgets as widgets # interactive display\n",
"%config InlineBackend.figure_format = 'retina'\n",
"plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting Functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Plotting Functions\n",
"\n",
"def plot_variance_explained(variance_explained):\n",
" \"\"\"\n",
" Plots eigenvalues.\n",
"\n",
" Args:\n",
" variance_explained (numpy array of floats) : Vector of variance explained\n",
" for each PC\n",
"\n",
" Returns:\n",
" Nothing.\n",
"\n",
" \"\"\"\n",
"\n",
" plt.figure()\n",
" plt.plot(np.arange(1, len(variance_explained) + 1), variance_explained,\n",
" '--k')\n",
" plt.xlabel('Number of components')\n",
" plt.ylabel('Variance explained')\n",
" plt.show()\n",
"\n",
"\n",
"def plot_MNIST_reconstruction(X, X_reconstructed, keep_dims):\n",
" \"\"\"\n",
" Plots 9 images in the MNIST dataset side-by-side with the reconstructed\n",
" images.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column\n",
" corresponds to a different\n",
" random variable\n",
" X_reconstructed (numpy array of floats) : Data matrix each column\n",
" corresponds to a different\n",
" random variable\n",
" keep_dims (int) : Dimensions to keep\n",
"\n",
" Returns:\n",
" Nothing.\n",
" \"\"\"\n",
"\n",
" plt.figure()\n",
" ax = plt.subplot(121)\n",
" k = 0\n",
" for k1 in range(3):\n",
" for k2 in range(3):\n",
" k = k + 1\n",
" plt.imshow(np.reshape(X[k, :], (28, 28)),\n",
" extent=[(k1 + 1) * 28, k1 * 28, (k2 + 1) * 28, k2 * 28],\n",
" vmin=0, vmax=255)\n",
" plt.xlim((3 * 28, 0))\n",
" plt.ylim((3 * 28, 0))\n",
" plt.tick_params(axis='both', which='both', bottom=False, top=False,\n",
" labelbottom=False)\n",
" ax.set_xticks([])\n",
" ax.set_yticks([])\n",
" plt.title('Data')\n",
" plt.clim([0, 250])\n",
" ax = plt.subplot(122)\n",
" k = 0\n",
" for k1 in range(3):\n",
" for k2 in range(3):\n",
" k = k + 1\n",
" plt.imshow(np.reshape(np.real(X_reconstructed[k, :]), (28, 28)),\n",
" extent=[(k1 + 1) * 28, k1 * 28, (k2 + 1) * 28, k2 * 28],\n",
" vmin=0, vmax=255)\n",
" plt.xlim((3 * 28, 0))\n",
" plt.ylim((3 * 28, 0))\n",
" plt.tick_params(axis='both', which='both', bottom=False, top=False,\n",
" labelbottom=False)\n",
" ax.set_xticks([])\n",
" ax.set_yticks([])\n",
" plt.clim([0, 250])\n",
" plt.title(f'Reconstructed K: {keep_dims}')\n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
"\n",
"def plot_MNIST_sample(X):\n",
" \"\"\"\n",
" Plots 9 images in the MNIST dataset.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" Nothing.\n",
"\n",
" \"\"\"\n",
"\n",
" fig, ax = plt.subplots()\n",
" k = 0\n",
" for k1 in range(3):\n",
" for k2 in range(3):\n",
" k = k + 1\n",
" plt.imshow(np.reshape(X[k, :], (28, 28)),\n",
" extent=[(k1 + 1) * 28, k1 * 28, (k2+1) * 28, k2 * 28],\n",
" vmin=0, vmax=255)\n",
" plt.xlim((3 * 28, 0))\n",
" plt.ylim((3 * 28, 0))\n",
" plt.tick_params(axis='both', which='both', bottom=False, top=False,\n",
" labelbottom=False)\n",
" plt.clim([0, 250])\n",
" ax.set_xticks([])\n",
" ax.set_yticks([])\n",
" plt.show()\n",
"\n",
"\n",
"def plot_MNIST_weights(weights):\n",
" \"\"\"\n",
" Visualize PCA basis vector weights for MNIST. Red = positive weights,\n",
" blue = negative weights, white = zero weight.\n",
"\n",
" Args:\n",
" weights (numpy array of floats) : PCA basis vector\n",
"\n",
" Returns:\n",
" Nothing.\n",
" \"\"\"\n",
"\n",
" fig, ax = plt.subplots()\n",
" plt.imshow(np.real(np.reshape(weights, (28, 28))), cmap='seismic')\n",
" plt.tick_params(axis='both', which='both', bottom=False, top=False,\n",
" labelbottom=False)\n",
" plt.clim(-.15, .15)\n",
" plt.colorbar(ticks=[-.15, -.1, -.05, 0, .05, .1, .15])\n",
" ax.set_xticks([])\n",
" ax.set_yticks([])\n",
" plt.show()\n",
"\n",
"\n",
"def plot_eigenvalues(evals, xlimit=False):\n",
" \"\"\"\n",
" Plots eigenvalues.\n",
"\n",
" Args:\n",
" (numpy array of floats) : Vector of eigenvalues\n",
" (boolean) : enable plt.show()\n",
" Returns:\n",
" Nothing.\n",
"\n",
" \"\"\"\n",
"\n",
" plt.figure()\n",
" plt.plot(np.arange(1, len(evals) + 1), evals, 'o-k')\n",
" plt.xlabel('Component')\n",
" plt.ylabel('Eigenvalue')\n",
" plt.title('Scree plot')\n",
" if xlimit:\n",
" plt.xlim([0, 100]) # limit x-axis up to 100 for zooming\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Helper Functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Helper Functions\n",
"\n",
"def add_noise(X, frac_noisy_pixels):\n",
" \"\"\"\n",
" Randomly corrupts a fraction of the pixels by setting them to random values.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix\n",
" frac_noisy_pixels (scalar) : Fraction of noisy pixels\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Data matrix + noise\n",
"\n",
" \"\"\"\n",
"\n",
" X_noisy = np.reshape(X, (X.shape[0] * X.shape[1]))\n",
" N_noise_ixs = int(X_noisy.shape[0] * frac_noisy_pixels)\n",
" noise_ixs = np.random.choice(X_noisy.shape[0], size=N_noise_ixs,\n",
" replace=False)\n",
" X_noisy[noise_ixs] = np.random.uniform(0, 255, noise_ixs.shape)\n",
" X_noisy = np.reshape(X_noisy, (X.shape[0], X.shape[1]))\n",
"\n",
" return X_noisy\n",
"\n",
"\n",
"def change_of_basis(X, W):\n",
" \"\"\"\n",
" Projects data onto a new basis.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponding to a\n",
" different random variable\n",
" W (numpy array of floats) : new orthonormal basis columns correspond to\n",
" basis vectors\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Data matrix expressed in new basis\n",
" \"\"\"\n",
"\n",
" Y = np.matmul(X, W)\n",
"\n",
" return Y\n",
"\n",
"\n",
"def get_sample_cov_matrix(X):\n",
" \"\"\"\n",
" Returns the sample covariance matrix of data X.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Covariance matrix\n",
"\"\"\"\n",
"\n",
" X = X - np.mean(X, 0)\n",
" cov_matrix = 1 / X.shape[0] * np.matmul(X.T, X)\n",
" return cov_matrix\n",
"\n",
"\n",
"def sort_evals_descending(evals, evectors):\n",
" \"\"\"\n",
" Sorts eigenvalues and eigenvectors in decreasing order. Also aligns first two\n",
" eigenvectors to be in first two quadrants (if 2D).\n",
"\n",
" Args:\n",
" evals (numpy array of floats) : Vector of eigenvalues\n",
" evectors (numpy array of floats) : Corresponding matrix of eigenvectors\n",
" each column corresponds to a different\n",
" eigenvalue\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Vector of eigenvalues after sorting\n",
" (numpy array of floats) : Matrix of eigenvectors after sorting\n",
" \"\"\"\n",
"\n",
" index = np.flip(np.argsort(evals))\n",
" evals = evals[index]\n",
" evectors = evectors[:, index]\n",
" if evals.shape[0] == 2:\n",
" if np.arccos(np.matmul(evectors[:, 0],\n",
" 1 / np.sqrt(2) * np.array([1, 1]))) > np.pi / 2:\n",
" evectors[:, 0] = -evectors[:, 0]\n",
" if np.arccos(np.matmul(evectors[:, 1],\n",
" 1 / np.sqrt(2)*np.array([-1, 1]))) > np.pi / 2:\n",
" evectors[:, 1] = -evectors[:, 1]\n",
"\n",
" return evals, evectors\n",
"\n",
"\n",
"def pca(X):\n",
" \"\"\"\n",
" Performs PCA on multivariate data. Eigenvalues are sorted in decreasing order\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Data projected onto the new basis\n",
" (numpy array of floats) : Corresponding matrix of eigenvectors\n",
" (numpy array of floats) : Vector of eigenvalues\n",
"\n",
" \"\"\"\n",
"\n",
" X = X - np.mean(X, 0)\n",
" cov_matrix = get_sample_cov_matrix(X)\n",
" evals, evectors = np.linalg.eigh(cov_matrix)\n",
" evals, evectors = sort_evals_descending(evals, evectors)\n",
" score = change_of_basis(X, evectors)\n",
"\n",
" return score, evectors, evals"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 0: Intro to PCA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 1: PCA for dimensionality reduction\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 1: PCA for dimensionality reduction\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', 'oO0bbInoO_0'), ('Bilibili', 'BV1up4y1S7xs')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_PCA_for_dimensionality_reduction_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 1: Perform PCA on MNIST\n",
"\n",
"The MNIST dataset consists of 70,000 images of individual handwritten digits. Each image is a 28x28 pixel grayscale image. For convenience, each 28x28 pixel image is often unravelled into a single 784 (=28x28) element vector, so that the whole dataset is represented as a 70,000 x 784 matrix. Each row represents a different image, and each column represents a different pixel.\n",
"\n",
"Enter the following cell to load the MNIST dataset and plot the first nine images. It may take a few minutes to load."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_openml\n",
"\n",
"# GET mnist data\n",
"mnist = fetch_openml(name='mnist_784', as_frame=False, parser='auto')\n",
"X = mnist.data\n",
"\n",
"# Visualize\n",
"plot_MNIST_sample(X)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"The MNIST dataset has an extrinsic dimensionality of 784, much higher than the 2-dimensional examples used in the previous tutorials! To make sense of this data, we'll use dimensionality reduction. But first, we need to determine the intrinsic dimensionality $K$ of the data. One way to do this is to look for an \"elbow\" in the scree plot, to determine which eigenvalues are significant."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 1: Scree plot of MNIST\n",
"\n",
"In this exercise you will examine the scree plot in the MNIST dataset.\n",
"\n",
"**Steps:**\n",
"- Perform PCA on the dataset using our function `pca` from tutorial 2 (already loaded in) and examine the scree plot.\n",
"- When do the eigenvalues appear (by eye) to reach zero? (**Hint:** use `plt.xlim` to zoom into a section of the plot).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"help(pca)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"#################################################\n",
"## TODO for students\n",
"# Fill out function and remove\n",
"raise NotImplementedError(\"Student exercise: perform PCA and visualize scree plot\")\n",
"#################################################\n",
"\n",
"# Perform PCA\n",
"score, evectors, evals = ...\n",
"\n",
"# Plot the eigenvalues\n",
"plot_eigenvalues(evals, xlimit=True) # limit x-axis up to 100 for zooming\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"# Perform PCA\n",
"score, evectors, evals = pca(X)\n",
"\n",
"# Plot the eigenvalues\n",
"with plt.xkcd():\n",
" plot_eigenvalues(evals, xlimit=True) # limit x-axis up to 100 for zooming"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Scree_plot_of_MNIST_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 2: Calculate the variance explained\n",
"\n",
"*Estimated timing to here from start of tutorial: 15 min*\n",
"\n",
"The scree plot suggests that most of the eigenvalues are near zero, with fewer than 100 having large values. Another common way to determine the intrinsic dimensionality is by considering the variance explained. This can be examined with a cumulative plot of the fraction of the total variance explained by the top $K$ components, i.e.,\n",
"\n",
"\\begin{equation}\n",
"\\text{var explained} = \\frac{\\sum_{i=1}^K \\lambda_i}{\\sum_{i=1}^N \\lambda_i}\n",
"\\end{equation}\n",
"\n",
"where $\\lambda_i$ is the $i^{th}$ eigenvalue and $N$ is the total number of components (the original number of dimensions in the data).\n",
"\n",
"The intrinsic dimensionality is often quantified by the $K$ necessary to explain a large proportion of the total variance of the data (often a defined threshold, e.g., 90%)."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 2: Plot the explained variance\n",
"\n",
"In this exercise you will plot the explained variance.\n",
"\n",
"**Steps:**\n",
"- Fill in the function below to calculate the fraction variance explained as a function of the number of principal components. **Hint:** use `np.cumsum`.\n",
"- Plot the variance explained using `plot_variance_explained`.\n",
"\n",
"**Questions:**\n",
"- How many principal components are required to explain 90% of the variance?\n",
"- How does the intrinsic dimensionality of this dataset compare to its extrinsic dimensionality?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"def get_variance_explained(evals):\n",
" \"\"\"\n",
" Calculates variance explained from the eigenvalues.\n",
"\n",
" Args:\n",
" evals (numpy array of floats) : Vector of eigenvalues\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Vector of variance explained\n",
"\n",
" \"\"\"\n",
"\n",
" #################################################\n",
" ## TO DO for students: calculate the explained variance using the equation\n",
" ## from Section 2.\n",
" # Comment once you've filled in the function\n",
" raise NotImplementedError(\"Student exercise: calculate explain variance!\")\n",
" #################################################\n",
"\n",
" # Cumulatively sum the eigenvalues\n",
" csum = ...\n",
"\n",
" # Normalize by the sum of eigenvalues\n",
" variance_explained = ...\n",
"\n",
" return variance_explained\n",
"\n",
"\n",
"# Calculate the variance explained\n",
"variance_explained = get_variance_explained(evals)\n",
"\n",
"# Visualize\n",
"plot_variance_explained(variance_explained)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"\n",
"def get_variance_explained(evals):\n",
" \"\"\"\n",
" Plots eigenvalues.\n",
"\n",
" Args:\n",
" (numpy array of floats) : Vector of eigenvalues\n",
"\n",
" Returns:\n",
" Nothing.\n",
"\n",
" \"\"\"\n",
"\n",
" # Cumulatively sum the eigenvalues\n",
" csum = np.cumsum(evals)\n",
"\n",
" # Normalize by the sum of eigenvalues\n",
" variance_explained = csum / np.sum(evals)\n",
"\n",
" return variance_explained\n",
"\n",
"\n",
"# Calculate the variance explained\n",
"variance_explained = get_variance_explained(evals)\n",
"\n",
"# Visualize\n",
"with plt.xkcd():\n",
" plot_variance_explained(variance_explained)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Plot_the_explained_variance_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 3: Reconstruct data with different numbers of PCs\n",
"\n",
"*Estimated timing to here from start of tutorial: 25 min*\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 2: Data Reconstruction\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 2: Data Reconstruction\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', 'ZCUhW26AdBQ'), ('Bilibili', 'BV1XK4y1s7KF')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Data_reconstruction_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"Now we have seen that the top 100 or so principal components of the data can explain most of the variance. We can use this fact to perform *dimensionality reduction*, i.e., by storing the data using only 100 components rather than the samples of all 784 pixels. Remarkably, we will be able to reconstruct much of the structure of the data using only the top 100 components. To see this, recall that to perform PCA we projected the data $\\bf X$ onto the eigenvectors of the covariance matrix:\n",
"\n",
"\\begin{equation}\n",
"\\bf S = X W\n",
"\\end{equation}\n",
"\n",
"Since $\\bf W$ is an orthogonal matrix, ${\\bf W}^{-1} = {\\bf W}^\\top$. So by multiplying by ${\\bf W}^\\top$ on each side we can rewrite this equation as \n",
"\n",
"\\begin{equation}\n",
"{\\bf X = S W}^\\top.\n",
"\\end{equation}\n",
"\n",
"This now gives us a way to reconstruct the data matrix from the scores and loadings. To reconstruct the data from a low-dimensional approximation, we just have to truncate these matrices. Let's denote ${\\bf S}_{1:K}$ and ${\\bf W}_{1:K}$ the matrices with only the first $K$ columns of $\\bf S$ and $\\bf W$, respectively. Then our reconstruction is:\n",
"\n",
"\\begin{equation}\n",
"{\\bf \\hat X = S}_{1:K} ({\\bf W}_{1:K})^\\top.\n",
"\\end{equation}"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 3: Data reconstruction\n",
"\n",
"Fill in the function below to reconstruct the data using different numbers of principal components.\n",
"\n",
"**Steps:**\n",
"\n",
"* Fill in the following function to reconstruct the data based on the weights and scores. Don't forget to add the mean!\n",
"* Make sure your function works by reconstructing the data with all $K=784$ components. The two images should look identical."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"def reconstruct_data(score, evectors, X_mean, K):\n",
" \"\"\"\n",
" Reconstruct the data based on the top K components.\n",
"\n",
" Args:\n",
" score (numpy array of floats) : Score matrix\n",
" evectors (numpy array of floats) : Matrix of eigenvectors\n",
" X_mean (numpy array of floats) : Vector corresponding to data mean\n",
" K (scalar) : Number of components to include\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Matrix of reconstructed data\n",
"\n",
" \"\"\"\n",
"\n",
" #################################################\n",
" ## TO DO for students: Reconstruct the original data in X_reconstructed\n",
" # Comment once you've filled in the function\n",
" raise NotImplementedError(\"Student exercise: reconstructing data function!\")\n",
" #################################################\n",
"\n",
" # Reconstruct the data from the score and eigenvectors\n",
" # Don't forget to add the mean!!\n",
" X_reconstructed = ...\n",
"\n",
" return X_reconstructed\n",
"\n",
"\n",
"K = 784 # data dimensions\n",
"\n",
"# Reconstruct the data based on all components\n",
"X_mean = np.mean(X, 0)\n",
"X_reconstructed = reconstruct_data(score, evectors, X_mean, K)\n",
"\n",
"# Plot the data and reconstruction\n",
"plot_MNIST_reconstruction(X, X_reconstructed, K)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"def reconstruct_data(score, evectors, X_mean, K):\n",
" \"\"\"\n",
" Reconstruct the data based on the top K components.\n",
"\n",
" Args:\n",
" score (numpy array of floats) : Score matrix\n",
" evectors (numpy array of floats) : Matrix of eigenvectors\n",
" X_mean (numpy array of floats) : Vector corresponding to data mean\n",
" K (scalar) : Number of components to include\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Matrix of reconstructed data\n",
"\n",
" \"\"\"\n",
"\n",
" # Reconstruct the data from the score and eigenvectors\n",
" # Don't forget to add the mean!!\n",
" X_reconstructed = np.matmul(score[:, :K], evectors[:, :K].T) + X_mean\n",
"\n",
" return X_reconstructed\n",
"\n",
"\n",
"K = 784 # data dimensions\n",
"\n",
"# Reconstruct the data based on all components\n",
"X_mean = np.mean(X, 0)\n",
"X_reconstructed = reconstruct_data(score, evectors, X_mean, K)\n",
"\n",
"# Plot the data and reconstruction\n",
"with plt.xkcd():\n",
" plot_MNIST_reconstruction(X, X_reconstructed, K)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Data_reconstruction_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Interactive Demo 3: Reconstruct the data matrix using different numbers of PCs\n",
"\n",
"Now run the code below and experiment with the slider to reconstruct the data matrix using different numbers of principal components.\n",
"\n",
"1. How many principal components are necessary to reconstruct the numbers (by eye)? How does this relate to the intrinsic dimensionality of the data?\n",
"2. Do you see any information in the data with only a single principal component?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Make sure you execute this cell to enable the widget!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Make sure you execute this cell to enable the widget!\n",
"\n",
"def refresh(K=100):\n",
" X_reconstructed = reconstruct_data(score, evectors, X_mean, K)\n",
" plot_MNIST_reconstruction(X, X_reconstructed, K)\n",
"\n",
"\n",
"_ = widgets.interact(refresh, K=(1, 784, 10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove explanation\n",
"\"\"\"\n",
"1) By around 91 components, they're pretty clear and similar to the original data.\n",
"This means that the intrinsic dimensionality of the data is less than 100.\n",
"\n",
"2) You can't see too much but you seem to be able to partially distinguish ones or other numbers with\n",
"a vertical straight line (like the 9) from the rest, which resembles 0 in the reconstruction. You could\n",
"probably usually distinguish 0s from 1s.\n",
"\"\"\";"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Reconstruct_the_data_matrix_using_different_numbers_of_PCs_Interactive_Demo_and_discussion\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 4: Visualize PCA components\n",
"\n",
"*Estimated timing to here from start of tutorial: 40 min*"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 4: Visualization of the weights\n",
"\n",
"Next, let's take a closer look at the first principal component by visualizing its corresponding weights.\n",
"\n",
"**Steps:**\n",
"\n",
"* Enter `plot_MNIST_weights` to visualize the weights of the first basis vector.\n",
"* What structure do you see? Which pixels have a strong positive weighting? Which have a strong negative weighting? What kinds of images would this basis vector differentiate (hint: think about the last demo with 1 component)?\n",
"* Try visualizing the second and third basis vectors. Do you see any structure? What about the 100th basis vector? 500th? 700th?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"help(plot_MNIST_weights)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"################################################################\n",
"# Comment once you've filled in the function\n",
"raise NotImplementedError(\"Student exercise: visualize PCA components\")\n",
"################################################################\n",
"\n",
"# Plot the weights of the first principal component\n",
"plot_MNIST_weights(...)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"\n",
"# Plot the weights of the first principal component\n",
"with plt.xkcd():\n",
" plot_MNIST_weights(evectors[:, 0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Visualization_of_the_weights_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Summary\n",
"\n",
"*Estimated timing of tutorial: 50 minutes*\n",
"\n",
"* In this tutorial, we learned how to use PCA for dimensionality reduction by selecting the top principal components. This can be useful as the intrinsic dimensionality ($K$) is often less than the extrinsic dimensionality ($N$) in neural data. $K$ can be inferred by choosing the number of eigenvalues necessary to capture some fraction of the variance.\n",
"* We also learned how to reconstruct an approximation of the original data using the top $K$ principal components. In fact, an alternate formulation of PCA is to find the $K$ dimensional space that minimizes the reconstruction error.\n",
"* Noise tends to inflate the apparent intrinsic dimensionality, however the higher components reflect noise rather than new structure in the data. PCA can be used for denoising data by removing noisy higher components.\n",
"* In MNIST, the weights corresponding to the first principal component appear to discriminate between a 0 and 1. We will discuss the implications of this for data visualization in the following tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Notation\n",
"\n",
"\\begin{align}\n",
"K &\\quad \\text{selected number of principal components}\\\\\n",
"N &\\quad \\text{total number of principal components}\\\\\n",
"\\bf W &\\quad \\text{weights, loadings matrix}\\\\\n",
"{\\bf X} &\\quad \\text{original data matrix}\\\\\n",
"\\bf S &\\quad \\text{projected matrix, scores}\\\\\n",
"{\\bf S}_{1:K} &\\quad \\text{first K columns of score matrix } \\bf S\\\\\n",
"{\\bf W}_{1:K} &\\quad \\text{first K columns of weight matrix } \\bf W\\\\\n",
"\\end{align}"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Bonus Section 1: Examine denoising using PCA\n",
"\n",
"In this lecture, we saw that PCA finds an optimal low-dimensional basis to minimize the reconstruction error. Because of this property, PCA can be useful for denoising corrupted samples of the data."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Bonus Coding Exercise 1: Add noise to the data\n",
"In this exercise you will add salt-and-pepper noise to the original data and see how that affects the eigenvalues.\n",
"\n",
"**Steps:**\n",
"- Use the function `add_noise` to add noise to 20% of the pixels.\n",
"- Then, perform PCA and plot the variance explained. How many principal components are required to explain 90% of the variance? How does this compare to the original data?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"help(add_noise)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"#################################################\n",
"## TO DO for students\n",
"# Comment once you've filled in the function\n",
"raise NotImplementedError(\"Student exercise: make MNIST noisy and compute PCA!\")\n",
"#################################################\n",
"\n",
"np.random.seed(2020) # set random seed\n",
"\n",
"# Add noise to data\n",
"X_noisy = ...\n",
"\n",
"# Perform PCA on noisy data\n",
"score_noisy, evectors_noisy, evals_noisy = ...\n",
"\n",
"# Compute variance explained\n",
"variance_explained_noisy = ...\n",
"\n",
"# Visualize\n",
"plot_MNIST_sample(X_noisy)\n",
"plot_variance_explained(variance_explained_noisy)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"\n",
"np.random.seed(2020) # set random seed\n",
"\n",
"# Add noise to data\n",
"X_noisy = add_noise(X, .2)\n",
"\n",
"# Perform PCA on noisy data\n",
"score_noisy, evectors_noisy, evals_noisy = pca(X_noisy)\n",
"\n",
"# Compute variance explained\n",
"variance_explained_noisy = get_variance_explained(evals_noisy)\n",
"\n",
"# Visualize\n",
"with plt.xkcd():\n",
" plot_MNIST_sample(X_noisy)\n",
" plot_variance_explained(variance_explained_noisy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Add_noise_to_the_data_Bonus_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Bonus Coding Exercise 2: Denoising\n",
"\n",
"Next, use PCA to perform denoising by projecting the noise-corrupted data onto the basis vectors found from the original dataset. By taking the top K components of this projection, we can reduce noise in dimensions orthogonal to the K-dimensional latent space.\n",
"\n",
"**Steps:**\n",
"- Subtract the mean of the noise-corrupted data.\n",
"- Project the data onto the basis found with the original dataset (`evectors`, not `evectors_noisy`) and take the top $K$ components.\n",
"- Reconstruct the data as normal, using the top 50 components.\n",
"- Play around with the amount of noise and K to build intuition.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"#################################################\n",
"## TO DO for students\n",
"# Comment once you've filled in the function\n",
"raise NotImplementedError(\"Student exercise: reconstruct noisy data from PCA\")\n",
"#################################################\n",
"\n",
"np.random.seed(2020) # set random seed\n",
"\n",
"# Add noise to data\n",
"X_noisy = ...\n",
"\n",
"# Compute mean of noise-corrupted data\n",
"X_noisy_mean = ...\n",
"\n",
"# Project onto the original basis vectors\n",
"projX_noisy = ...\n",
"\n",
"# Reconstruct the data using the top 50 components\n",
"K = 50\n",
"X_reconstructed = ...\n",
"\n",
"# Visualize\n",
"plot_MNIST_reconstruction(X_noisy, X_reconstructed, K)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"np.random.seed(2020) # set random seed\n",
"\n",
"# Add noise to data\n",
"X_noisy = add_noise(X, .2)\n",
"\n",
"# Compute mean of noise-corrupted data\n",
"X_noisy_mean = np.mean(X_noisy, 0)\n",
"\n",
"# Project onto the original basis vectors\n",
"projX_noisy = np.matmul(X_noisy - X_noisy_mean, evectors)\n",
"\n",
"# Reconstruct the data using the top 50 components\n",
"K = 50\n",
"X_reconstructed = reconstruct_data(projX_noisy, evectors, X_noisy_mean, K)\n",
"\n",
"# Visualize\n",
"with plt.xkcd():\n",
" plot_MNIST_reconstruction(X_noisy, X_reconstructed, K)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Denoising_Bonus_Exercise\")"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"NgVDJxt9fWm9",
"KXKX7e8hfWnB"
],
"include_colab_link": true,
"name": "W1D4_Tutorial3",
"provenance": [],
"toc_visible": true
},
"kernel": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
}
},
"nbformat": 4,
"nbformat_minor": 0
}