{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {},
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"# Tutorial 2: Principal Component Analysis\n",
"\n",
"**Week 1, Day 4: Dimensionality Reduction**\n",
"\n",
"**By Neuromatch Academy**\n",
"\n",
"**Content creators:** Alex Cayco Gajic, John Murray\n",
"\n",
"**Content reviewers:** Roozbeh Farhoudi, Matt Krause, Spiros Chavlis, Richard Gao, Michael Waskom, Siddharth Suresh, Natalie Schaworonkow, Ella Batty\n",
"\n",
"**Production editor:** Spiros Chavlis"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Tutorial Objectives\n",
"\n",
"*Estimated timing of tutorial: 45 minutes*\n",
"\n",
"In this notebook we'll learn how to perform PCA by projecting the data onto the eigenvectors of its covariance matrix.\n",
"\n",
"Overview:\n",
"- Calculate the eigenvectors of the sample covariance matrix.\n",
"- Perform PCA by projecting data onto the eigenvectors of the covariance matrix.\n",
"- Plot and explore the eigenvalues.\n",
"\n",
"To quickly refresh your knowledge of eigenvalues and eigenvectors, you can watch this [short video](https://www.youtube.com/watch?v=kwA3qM0rm7c) (4 minutes) for a geometrical explanation. For a deeper understanding, this [in-depth video](https://www.youtube.com/watch?v=PFDu9oVAE-g&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab&index=14) (17 minutes) provides an excellent basis and is beautifully illustrated."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @markdown\n",
"from IPython.display import IFrame\n",
"from ipywidgets import widgets\n",
"out = widgets.Output()\n",
"with out:\n",
" print(f\"If you want to download the slides: https://osf.io/download/kaq2x/\")\n",
" display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/kaq2x/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n",
"display(out)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Setup\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install and import feedback gadget\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Install and import feedback gadget\n",
"\n",
"!pip3 install vibecheck datatops --quiet\n",
"\n",
"from vibecheck import DatatopsContentReviewContainer\n",
"def content_review(notebook_section: str):\n",
" return DatatopsContentReviewContainer(\n",
" \"\", # No text prompt\n",
" notebook_section,\n",
" {\n",
" \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
" \"name\": \"neuromatch_cn\",\n",
" \"user_key\": \"y1x3mpx5\",\n",
" },\n",
" ).render()\n",
"\n",
"\n",
"feedback_prefix = \"W1D4_T2\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# Imports\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Figure Settings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Figure Settings\n",
"import logging\n",
"logging.getLogger('matplotlib.font_manager').disabled = True\n",
"\n",
"import ipywidgets as widgets # interactive display\n",
"%config InlineBackend.figure_format = 'retina'\n",
"plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting Functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Plotting Functions\n",
"\n",
"def plot_eigenvalues(evals):\n",
" \"\"\"\n",
" Plots eigenvalues.\n",
"\n",
" Args:\n",
" (numpy array of floats) : Vector of eigenvalues\n",
"\n",
" Returns:\n",
" Nothing.\n",
"\n",
" \"\"\"\n",
" plt.figure(figsize=(4, 4))\n",
" plt.plot(np.arange(1, len(evals) + 1), evals, 'o-k')\n",
" plt.xlabel('Component')\n",
" plt.ylabel('Eigenvalue')\n",
" plt.title('Scree plot')\n",
" plt.xticks(np.arange(1, len(evals) + 1))\n",
" plt.ylim([0, 2.5])\n",
" plt.show()\n",
"\n",
"\n",
"def plot_data(X):\n",
" \"\"\"\n",
" Plots bivariate data. Includes a plot of each random variable, and a scatter\n",
" scatter plot of their joint activity. The title indicates the sample\n",
" correlation calculated from the data.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" Nothing.\n",
" \"\"\"\n",
"\n",
" fig = plt.figure(figsize=[8, 4])\n",
" gs = fig.add_gridspec(2, 2)\n",
" ax1 = fig.add_subplot(gs[0, 0])\n",
" ax1.plot(X[:, 0], color='k')\n",
" plt.ylabel('Neuron 1')\n",
" ax2 = fig.add_subplot(gs[1, 0])\n",
" ax2.plot(X[:, 1], color='k')\n",
" plt.xlabel('Sample Number (sorted)')\n",
" plt.ylabel('Neuron 2')\n",
" ax3 = fig.add_subplot(gs[:, 1])\n",
" ax3.plot(X[:, 0], X[:, 1], '.', markerfacecolor=[.5, .5, .5],\n",
" markeredgewidth=0)\n",
" ax3.axis('equal')\n",
" plt.xlabel('Neuron 1 activity')\n",
" plt.ylabel('Neuron 2 activity')\n",
" plt.title('Sample corr: {:.1f}'.format(np.corrcoef(X[:, 0], X[:, 1])[0, 1]))\n",
" plt.show()\n",
"\n",
"\n",
"def plot_data_new_basis(Y):\n",
" \"\"\"\n",
" Plots bivariate data after transformation to new bases. Similar to plot_data\n",
" but with colors corresponding to projections onto basis 1 (red) and\n",
" basis 2 (blue).\n",
" The title indicates the sample correlation calculated from the data.\n",
"\n",
" Note that samples are re-sorted in ascending order for the first random\n",
" variable.\n",
"\n",
" Args:\n",
" Y (numpy array of floats) : Data matrix in new basis each column\n",
" corresponds to a different random variable\n",
"\n",
" Returns:\n",
" Nothing.\n",
" \"\"\"\n",
"\n",
" fig = plt.figure(figsize=[8, 4])\n",
" gs = fig.add_gridspec(2, 2)\n",
" ax1 = fig.add_subplot(gs[0, 0])\n",
" ax1.plot(Y[:, 0], 'r')\n",
" plt.ylabel('Projection \\n basis vector 1')\n",
" ax2 = fig.add_subplot(gs[1, 0])\n",
" ax2.plot(Y[:, 1], 'b')\n",
" plt.xlabel('Sample number')\n",
" plt.ylabel('Projection \\n basis vector 2')\n",
" ax3 = fig.add_subplot(gs[:, 1])\n",
" ax3.plot(Y[:, 0], Y[:, 1], '.', color=[.5, .5, .5])\n",
" ax3.axis('equal')\n",
" plt.xlabel('Projection basis vector 1')\n",
" plt.ylabel('Projection basis vector 2')\n",
" plt.title('Sample corr: {:.1f}'.format(np.corrcoef(Y[:, 0], Y[:, 1])[0, 1]))\n",
" plt.show()\n",
"\n",
"\n",
"def plot_basis_vectors(X, W):\n",
" \"\"\"\n",
" Plots bivariate data as well as new basis vectors.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
" W (numpy array of floats) : Square matrix representing new orthonormal\n",
" basis each column represents a basis vector\n",
"\n",
" Returns:\n",
" Nothing.\n",
" \"\"\"\n",
"\n",
" plt.figure(figsize=[4, 4])\n",
" plt.plot(X[:, 0], X[:, 1], '.', color=[.5, .5, .5], label='Data')\n",
" plt.axis('equal')\n",
" plt.xlabel('Neuron 1 activity')\n",
" plt.ylabel('Neuron 2 activity')\n",
" plt.plot([0, W[0, 0]], [0, W[1, 0]], color='r', linewidth=3,\n",
" label='Basis vector 1')\n",
" plt.plot([0, W[0, 1]], [0, W[1, 1]], color='b', linewidth=3,\n",
" label='Basis vector 2')\n",
" plt.legend()\n",
" plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Helper functions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Helper functions\n",
"\n",
"def sort_evals_descending(evals, evectors):\n",
" \"\"\"\n",
" Sorts eigenvalues and eigenvectors in decreasing order. Also aligns first two\n",
" eigenvectors to be in first two quadrants (if 2D).\n",
"\n",
" Args:\n",
" evals (numpy array of floats) : Vector of eigenvalues\n",
" evectors (numpy array of floats) : Corresponding matrix of eigenvectors\n",
" each column corresponds to a different\n",
" eigenvalue\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Vector of eigenvalues after sorting\n",
" (numpy array of floats) : Matrix of eigenvectors after sorting\n",
" \"\"\"\n",
"\n",
" index = np.flip(np.argsort(evals))\n",
" evals = evals[index]\n",
" evectors = evectors[:, index]\n",
" if evals.shape[0] == 2:\n",
" if np.arccos(np.matmul(evectors[:, 0],\n",
" 1 / np.sqrt(2) * np.array([1, 1]))) > np.pi / 2:\n",
" evectors[:, 0] = -evectors[:, 0]\n",
" if np.arccos(np.matmul(evectors[:, 1],\n",
" 1 / np.sqrt(2) * np.array([-1, 1]))) > np.pi / 2:\n",
" evectors[:, 1] = -evectors[:, 1]\n",
" return evals, evectors\n",
"\n",
"\n",
"def get_data(cov_matrix):\n",
" \"\"\"\n",
" Returns a matrix of 1000 samples from a bivariate, zero-mean Gaussian\n",
"\n",
" Note that samples are sorted in ascending order for the first random\n",
" variable.\n",
"\n",
" Args:\n",
" var_1 (scalar) : variance of the first random variable\n",
" var_2 (scalar) : variance of the second random variable\n",
" cov_matrix (numpy array of floats) : desired covariance matrix\n",
"\n",
" Returns:\n",
" (numpy array of floats) : samples from the bivariate Gaussian,\n",
" with each column corresponding to a\n",
" different random variable\n",
" \"\"\"\n",
"\n",
" mean = np.array([0, 0])\n",
" X = np.random.multivariate_normal(mean, cov_matrix, size=1000)\n",
" indices_for_sorting = np.argsort(X[:, 0])\n",
" X = X[indices_for_sorting, :]\n",
" return X\n",
"\n",
"\n",
"def calculate_cov_matrix(var_1, var_2, corr_coef):\n",
" \"\"\"\n",
" Calculates the covariance matrix based on the variances and\n",
" correlation coefficient.\n",
"\n",
" Args:\n",
" var_1 (scalar) : variance of the first random variable\n",
" var_2 (scalar) : variance of the second random variable\n",
" corr_coef (scalar) : correlation coefficient\n",
"\n",
" Returns:\n",
" (numpy array of floats) : covariance matrix\n",
" \"\"\"\n",
" cov = corr_coef * np.sqrt(var_1 * var_2)\n",
" cov_matrix = np.array([[var_1, cov], [cov, var_2]])\n",
" return cov_matrix\n",
"\n",
"\n",
"def define_orthonormal_basis(u):\n",
" \"\"\"\n",
" Calculates an orthonormal basis given an arbitrary vector u.\n",
"\n",
" Args:\n",
" u (numpy array of floats) : arbitrary 2D vector used for new basis\n",
"\n",
" Returns:\n",
" (numpy array of floats) : new orthonormal basis columns correspond to\n",
" basis vectors\n",
" \"\"\"\n",
"\n",
" u = u / np.sqrt(u[0] ** 2 + u[1] ** 2)\n",
" w = np.array([-u[1], u[0]])\n",
" W = np.column_stack((u, w))\n",
" return W\n",
"\n",
"\n",
"def change_of_basis(X, W):\n",
" \"\"\"\n",
" Projects data onto a new basis.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponding to a\n",
" different random variable\n",
" W (numpy array of floats) : new orthonormal basis columns correspond to\n",
" basis vectors\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Data matrix expressed in new basis\n",
" \"\"\"\n",
"\n",
" Y = np.matmul(X, W)\n",
" return Y"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 0: Intro to PCA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video 1: PCA\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 1: PCA\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', '-f6T9--oM0E'), ('Bilibili', 'BV1hK411H7Zi')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_PCA_Video\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 1: Calculate the eigenvectors of the the sample covariance matrix\n",
"\n",
"As we saw in the lecture, PCA represents data in a new orthonormal basis defined by the eigenvectors of the covariance matrix. Remember that in the previous tutorial, we generated bivariate normal data with a specified covariance matrix $\\bf \\Sigma$, whose $(i,j)$th element is:\n",
"\n",
"\\begin{equation}\n",
"\\Sigma_{ij} = \\mathbb{E}[ x_i x_j ] - \\mathbb{E}[ x_i] \\mathbb{E}[ x_j ] .\n",
"\\end{equation}\n",
"\n",
"However, in real life we don't have access to this ground-truth covariance\n",
"matrix. To get around this, we can use the sample covariance matrix, $\\bf\\hat\\Sigma$, which is calculated directly from the data. The $(i,j)$th element of the sample covariance matrix is:\n",
"\n",
"\\begin{equation}\n",
"\\hat \\Sigma_{ij} = \\frac{1}{N_\\text{samples}}{\\bf x}_i^\\top {\\bf x}_j - \\bar {\\bf x}_i \\bar{\\bf x}_j ,\n",
"\\end{equation}\n",
"\n",
"where ${\\bf x}_i = [ x_i(1), x_i(2), \\dots,x_i(N_\\text{samples})]^\\top$ is a column vector representing all measurements of neuron $i$, and $\\bar {\\bf x}_i$ is the mean of neuron $i$ across samples:\n",
"\n",
"\\begin{equation}\n",
"\\bar {\\bf x}_i = \\frac{1}{N_\\text{samples}} \\sum_{k=1}^{N_\\text{samples}} x_i(k).\n",
"\\end{equation}\n",
"\n",
"If we assume that the data has already been mean-subtracted, then we can write the sample covariance matrix in a much simpler matrix form:\n",
"\n",
"\\begin{equation}\n",
"{\\bf \\hat \\Sigma} = \\frac{1}{N_\\text{samples}} {\\bf X}^\\top {\\bf X}.\n",
"\\end{equation}\n",
"\n",
"where $\\bf X$ is the full data matrix (each column of $\\bf X$ is a different $\\bf x_i$)."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 1.1: Calculate the covariance matrix\n",
"\n",
"Before calculating the eigenvectors, you must first calculate the sample covariance matrix.\n",
"\n",
"**Steps**\n",
"* Complete the function `get_sample_cov_matrix` by first subtracting the sample mean of the data, then calculate $\\bf \\hat \\Sigma$ using the equation above.\n",
"* Use `get_data` to generate bivariate normal data, and calculate the sample covariance matrix with your finished `get_sample_cov_matrix`. Compare this estimate to the true covariate matrix using `calculate_cov_matrix`. You've seen both `get_data` and `calculate_cov_matrix` in Tutorial 1.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"help(get_data)\n",
"help(calculate_cov_matrix)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"def get_sample_cov_matrix(X):\n",
" \"\"\"\n",
" Returns the sample covariance matrix of data X\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Covariance matrix\n",
" \"\"\"\n",
"\n",
" #################################################\n",
" ## TODO for students: calculate the covariance matrix\n",
" # Fill out function and remove\n",
" raise NotImplementedError(\"Student exercise: calculate the covariance matrix!\")\n",
" #################################################\n",
"\n",
" # Subtract the mean of X\n",
" X = ...\n",
"\n",
" # Calculate the covariance matrix (hint: use np.matmul)\n",
" cov_matrix = ...\n",
"\n",
" return cov_matrix\n",
"\n",
"\n",
"# Set parameters\n",
"np.random.seed(2020) # set random seed\n",
"variance_1 = 1\n",
"variance_2 = 1\n",
"corr_coef = 0.8\n",
"\n",
"# Calculate covariance matrix\n",
"cov_matrix = calculate_cov_matrix(variance_1, variance_2, corr_coef)\n",
"print(cov_matrix)\n",
"\n",
"# Generate data with that covariance matrix\n",
"X = get_data(cov_matrix)\n",
"\n",
"# Get sample covariance matrix\n",
"sample_cov_matrix = get_sample_cov_matrix(X)\n",
"print(sample_cov_matrix)\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"SAMPLE OUTPUT\n",
"```\n",
"[[1. 0.8]\n",
" [0.8 1. ]]\n",
"[[0.99315313 0.82347589]\n",
" [0.82347589 1.01281397]]\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"\n",
"def get_sample_cov_matrix(X):\n",
" \"\"\"\n",
" Returns the sample covariance matrix of data X\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Covariance matrix\n",
" \"\"\"\n",
"\n",
" # Subtract the mean of X\n",
" X = X - np.mean(X, 0)\n",
"\n",
" # Calculate the covariance matrix (hint: use np.matmul)\n",
" cov_matrix = 1 / X.shape[0] * np.matmul(X.T, X)\n",
"\n",
" return cov_matrix\n",
"\n",
"\n",
"# Set parameters\n",
"np.random.seed(2020) # set random seed\n",
"variance_1 = 1\n",
"variance_2 = 1\n",
"corr_coef = 0.8\n",
"\n",
"# Calculate covariance matrix\n",
"cov_matrix = calculate_cov_matrix(variance_1, variance_2, corr_coef)\n",
"print(cov_matrix)\n",
"\n",
"# Generate data with that covariance matrix\n",
"X = get_data(cov_matrix)\n",
"\n",
"# Get sample covariance matrix\n",
"sample_cov_matrix = get_sample_cov_matrix(X)\n",
"print(sample_cov_matrix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Calculate_the_covariance_matrix_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"We get a sample covariance matrix that doesn't look too far off from the true covariance matrix, so that's a good sign!"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 1.2: Eigenvectors of the covariance matrix\n",
"\n",
"Next you will calculate the eigenvectors of the covariance matrix. Plot them along with the data to check that they align with the geometry of the data.\n",
"\n",
"**Steps:**\n",
"* Calculate the eigenvalues and eigenvectors of the sample covariance matrix. (**Hint:** use `np.linalg.eigh`, which finds the eigenvalues of a symmetric matrix).\n",
"* Use the provided code to sort the eigenvalues in descending order.\n",
"* Plot the eigenvectors on a scatter plot of the data, using the function `plot_basis_vectors`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"help(sort_evals_descending)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"#################################################\n",
"## TODO for students\n",
"# Fill out function and remove\n",
"raise NotImplementedError(\"Student exercise: calculate and sort eigenvalues\")\n",
"#################################################\n",
"\n",
"# Calculate the eigenvalues and eigenvectors\n",
"evals, evectors = ...\n",
"\n",
"# Sort the eigenvalues in descending order\n",
"evals, evectors = ...\n",
"\n",
"# Visualize\n",
"plot_basis_vectors(X, evectors)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"\n",
"# Calculate the eigenvalues and eigenvectors\n",
"evals, evectors = np.linalg.eigh(cov_matrix)\n",
"\n",
"# Sort the eigenvalues in descending order\n",
"evals, evectors = sort_evals_descending(evals, evectors)\n",
"\n",
"# Visualize\n",
"with plt.xkcd():\n",
" plot_basis_vectors(X, evectors)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Eigenvectors_of_the_covariance_matrix_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Section 2: Perform PCA by projecting data onto the eigenvectors\n",
"\n",
"*Estimated timing to here from start of tutorial: 25 min*\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"To perform PCA, we will project the data onto the eigenvectors of the covariance matrix, i.e.,:\n",
"\n",
"\\begin{equation}\n",
"\\bf S = X W\n",
"\\end{equation}\n",
"\n",
"where $\\bf S$ is an $N_\\text{samples} \\times N$ matrix representing the projected data (also called *scores*), and $\\bf W$ is an $N\\times N$ orthogonal matrix, each of whose columns represents the eigenvectors of the covariance matrix (also called *weights* or *loadings*)."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Coding Exercise 2: PCA implementation\n",
"\n",
"You will now perform PCA on the data using the intuition and functions you have developed so far. Fill in the function below to carry out the steps to perform PCA by projecting the data onto the eigenvectors of its covariance matrix.\n",
"\n",
"**Steps:**\n",
"* First subtract the mean and calculate the sample covariance matrix.\n",
"* Then find the eigenvalues and eigenvectors and sort them in descending order.\n",
"* Finally project the mean-centered data onto the eigenvectors."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"help(change_of_basis)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"```python\n",
"def pca(X):\n",
" \"\"\"\n",
" Sorts eigenvalues and eigenvectors in decreasing order.\n",
"\n",
" Args:\n",
" X (numpy array of floats): Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Data projected onto the new basis\n",
" (numpy array of floats) : Vector of eigenvalues\n",
" (numpy array of floats) : Corresponding matrix of eigenvectors\n",
"\n",
" \"\"\"\n",
"\n",
" #################################################\n",
" ## TODO for students: calculate the covariance matrix\n",
" # Fill out function and remove\n",
" raise NotImplementedError(\"Student exercise: sort eigenvalues/eigenvectors!\")\n",
" #################################################\n",
"\n",
" # Calculate the sample covariance matrix\n",
" cov_matrix = ...\n",
"\n",
" # Calculate the eigenvalues and eigenvectors\n",
" evals, evectors = ...\n",
"\n",
" # Sort the eigenvalues in descending order\n",
" evals, evectors = ...\n",
"\n",
" # Project the data onto the new eigenvector basis\n",
" score = ...\n",
"\n",
" return score, evectors, evals\n",
"\n",
"\n",
"# Perform PCA on the data matrix X\n",
"score, evectors, evals = pca(X)\n",
"\n",
"# Plot the data projected into the new basis\n",
"plot_data_new_basis(score)\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove solution\n",
"def pca(X):\n",
" \"\"\"\n",
" Performs PCA on multivariate data.\n",
"\n",
" Args:\n",
" X (numpy array of floats) : Data matrix each column corresponds to a\n",
" different random variable\n",
"\n",
" Returns:\n",
" (numpy array of floats) : Data projected onto the new basis\n",
" (numpy array of floats) : Vector of eigenvalues\n",
" (numpy array of floats) : Corresponding matrix of eigenvectors\n",
"\n",
" \"\"\"\n",
"\n",
" # Calculate the sample covariance matrix\n",
" cov_matrix = get_sample_cov_matrix(X)\n",
"\n",
" # Calculate the eigenvalues and eigenvectors\n",
" evals, evectors = np.linalg.eigh(cov_matrix)\n",
"\n",
" # Sort the eigenvalues in descending order\n",
" evals, evectors = sort_evals_descending(evals, evectors)\n",
"\n",
" # Project the data onto the new eigenvector basis\n",
" score = change_of_basis(X, evectors)\n",
"\n",
" return score, evectors, evals\n",
"\n",
"\n",
"# Perform PCA on the data matrix X\n",
"score, evectors, evals = pca(X)\n",
"\n",
"# Plot the data projected into the new basis\n",
"with plt.xkcd():\n",
" plot_data_new_basis(score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_PCA_implementation_Exercise\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
" Finally, we will examine the eigenvalues of the covariance matrix. Remember that each eigenvalue describes the variance of the data projected onto its corresponding eigenvector. This is an important concept because it allows us to rank the PCA basis vectors based on how much variance each one can capture. First run the code below to plot the eigenvalues (sometimes called the \"scree plot\"). Which eigenvalue is larger?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"plot_eigenvalues(evals)"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Interactive Demo 2: Exploration of the correlation coefficient\n",
"\n",
"Run the following cell and use the slider to change the correlation coefficient in the data. You should see the scree plot and the plot of basis vectors updated.\n",
"\n",
"1. What happens to the eigenvalues as you change the correlation coefficient?\n",
"2. Can you find a value for which both eigenvalues are equal?\n",
"3. Can you find a value for which only one eigenvalue is nonzero?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Make sure you execute this cell to enable the widget!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @markdown Make sure you execute this cell to enable the widget!\n",
"\n",
"def refresh(corr_coef=.8):\n",
" cov_matrix = calculate_cov_matrix(variance_1, variance_2, corr_coef)\n",
" X = get_data(cov_matrix)\n",
" score, evectors, evals = pca(X)\n",
" plot_eigenvalues(evals)\n",
" plot_basis_vectors(X, evectors)\n",
"\n",
"\n",
"_ = widgets.interact(refresh, corr_coef=(-1, 1, .1))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {}
},
"outputs": [],
"source": [
"# to_remove explanation\n",
"\n",
"\"\"\"\n",
"1) As the absolute value of the correlation coefficient decreases, the two eigenvalues\n",
"become more similar (closer in value)\n",
"\n",
"2) The two eigenvalues are extremely similar (basically equal) with a correlation coefficient of 0.\n",
"\n",
"3) A correlation coefficient of 1 or -1 results in only one non-zero eigenvalues.\n",
"\"\"\";"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Exploration_of_the_correlation_coefficient_Interactive_Demo_and_Discussion\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Summary\n",
"\n",
"*Estimated timing of tutorial: 45 minutes*\n",
"\n",
"- In this tutorial, we learned that the goal of PCA is to find an orthonormal basis capturing the directions of maximum variance of the data. More precisely, the $i$th basis vector is the direction that maximizes the projected variance, while being orthogonal to all previous basis vectors. Mathematically, these basis vectors are the eigenvectors of the covariance matrix (also called *loadings*).\n",
"- PCA also has the useful property that the projected data (*scores*) are uncorrelated.\n",
"- The projected variance along each basis vector is given by its corresponding eigenvalue. This is important because it allows us to rank the \"importance\" of each basis vector in terms of how much of the data variability it explains. An eigenvalue of zero means there is no variation along that direction so it can be dropped without losing any information about the original data.\n",
"- In the next tutorial, we will use this property to reduce the dimensionality of high-dimensional data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Notation\n",
"\n",
"\\begin{align}\n",
"\\mathbf{x_i} &\\quad \\text{all measurements of neuron } i\\\\\n",
"\\bar{\\bf x_i} &\\quad \\text{mean across samples for neuron } i \\\\\n",
"\\bf \\Sigma &\\quad \\text{covariance matrix}\\\\\n",
"\\bf \\hat \\Sigma &\\quad \\text{sample covariance matrix}\\\\\n",
"\\bf W &\\quad \\text{weights, loadings matrix}\\\\\n",
"{\\bf X} &\\quad \\text{original data matrix}\\\\\n",
"\\bf S &\\quad \\text{projected matrix, scores}\\\\\n",
"N &\\quad \\text{data dimensionality}\\\\\n",
"N_\\text{samples} &\\quad \\text{number of samples}\\\\\n",
"\\end{align}"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"---\n",
"# Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"## Bonus Section 1: Mathematical basis of PCA properties"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Video 2: Properties of PCA\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"# @title Video 2: Properties of PCA\n",
"from ipywidgets import widgets\n",
"from IPython.display import YouTubeVideo\n",
"from IPython.display import IFrame\n",
"from IPython.display import display\n",
"\n",
"\n",
"class PlayVideo(IFrame):\n",
" def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
" self.id = id\n",
" if source == 'Bilibili':\n",
" src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
" elif source == 'Osf':\n",
" src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
" super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
"\n",
"\n",
"def display_videos(video_ids, W=400, H=300, fs=1):\n",
" tab_contents = []\n",
" for i, video_id in enumerate(video_ids):\n",
" out = widgets.Output()\n",
" with out:\n",
" if video_ids[i][0] == 'Youtube':\n",
" video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
" height=H, fs=fs, rel=0)\n",
" print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
" else:\n",
" video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
" height=H, fs=fs, autoplay=False)\n",
" if video_ids[i][0] == 'Bilibili':\n",
" print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
" elif video_ids[i][0] == 'Osf':\n",
" print(f'Video available at https://osf.io/{video.id}')\n",
" display(video)\n",
" tab_contents.append(out)\n",
" return tab_contents\n",
"\n",
"\n",
"video_ids = [('Youtube', 'p56UrMRt6-U'), ('Bilibili', 'BV1xa4y1a77c')]\n",
"tab_contents = display_videos(video_ids, W=730, H=410)\n",
"tabs = widgets.Tab()\n",
"tabs.children = tab_contents\n",
"for i in range(len(tab_contents)):\n",
" tabs.set_title(i, video_ids[i][0])\n",
"display(tabs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Properties_of_PCA_Bonus_Video\")"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"include_colab_link": true,
"name": "W1D4_Tutorial2",
"provenance": [],
"toc_visible": true
},
"kernel": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
}
},
"nbformat": 4,
"nbformat_minor": 0
}