{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {}, "id": "view-in-github" }, "source": [ "\"Open   \"Open" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "# Tutorial 4: Nonlinear Dimensionality Reduction\n", "\n", "**Week 1, Day 4: Dimensionality Reduction**\n", "\n", "**By Neuromatch Academy**\n", "\n", "**Content creators:** Alex Cayco Gajic, John Murray\n", "\n", "**Content reviewers:** Roozbeh Farhoudi, Matt Krause, Spiros Chavlis, Richard Gao, Michael Waskom, Siddharth Suresh, Natalie Schaworonkow, Ella Batty\n", "\n", "**Production editors:** Spiros Chavlis" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Tutorial Objectives\n", "\n", "*Estimated timing of tutorial: 35 minutes*\n", "\n", "In this notebook we'll explore how dimensionality reduction can be useful for visualizing and inferring structure in your data. To do this, we will compare PCA with t-SNE, a nonlinear dimensionality reduction method.\n", "\n", "Overview:\n", "- Visualize MNIST in 2D using PCA.\n", "- Visualize MNIST in 2D using t-SNE." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# @markdown\n", "from IPython.display import IFrame\n", "from ipywidgets import widgets\n", "out = widgets.Output()\n", "with out:\n", " print(f\"If you want to download the slides: https://osf.io/download/kaq2x/\")\n", " display(IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/kaq2x/?direct%26mode=render%26action=download%26mode=render\", width=730, height=410))\n", "display(out)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Setup\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install and import feedback gadget\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Install and import feedback gadget\n", "\n", "!pip3 install vibecheck datatops --quiet\n", "\n", "from vibecheck import DatatopsContentReviewContainer\n", "def content_review(notebook_section: str):\n", " return DatatopsContentReviewContainer(\n", " \"\", # No text prompt\n", " notebook_section,\n", " {\n", " \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n", " \"name\": \"neuromatch_cn\",\n", " \"user_key\": \"y1x3mpx5\",\n", " },\n", " ).render()\n", "\n", "\n", "feedback_prefix = \"W1D4_T4\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# Imports\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Figure Settings\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Figure Settings\n", "import logging\n", "logging.getLogger('matplotlib.font_manager').disabled = True\n", "\n", "import ipywidgets as widgets # interactive display\n", "%config InlineBackend.figure_format = 'retina'\n", "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting Functions\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Plotting Functions\n", "\n", "def visualize_components(component1, component2, labels, show=True):\n", " \"\"\"\n", " Plots a 2D representation of the data for visualization with categories\n", " labelled as different colors.\n", "\n", " Args:\n", " component1 (numpy array of floats) : Vector of component 1 scores\n", " component2 (numpy array of floats) : Vector of component 2 scores\n", " labels (numpy array of floats) : Vector corresponding to categories of\n", " samples\n", "\n", " Returns:\n", " Nothing.\n", "\n", " \"\"\"\n", "\n", " plt.figure()\n", " plt.scatter(x=component1, y=component2, c=labels, cmap='tab10')\n", " plt.xlabel('Component 1')\n", " plt.ylabel('Component 2')\n", " plt.colorbar(ticks=range(10))\n", " plt.clim(-0.5, 9.5)\n", " if show:\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Section 0: Intro to applications" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Video 1: PCA Applications\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# @title Video 1: PCA Applications\n", "from ipywidgets import widgets\n", "from IPython.display import YouTubeVideo\n", "from IPython.display import IFrame\n", "from IPython.display import display\n", "\n", "\n", "class PlayVideo(IFrame):\n", " def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n", " self.id = id\n", " if source == 'Bilibili':\n", " src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n", " elif source == 'Osf':\n", " src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n", " super(PlayVideo, self).__init__(src, width, height, **kwargs)\n", "\n", "\n", "def display_videos(video_ids, W=400, H=300, fs=1):\n", " tab_contents = []\n", " for i, video_id in enumerate(video_ids):\n", " out = widgets.Output()\n", " with out:\n", " if video_ids[i][0] == 'Youtube':\n", " video = YouTubeVideo(id=video_ids[i][1], width=W,\n", " height=H, fs=fs, rel=0)\n", " print(f'Video available at https://youtube.com/watch?v={video.id}')\n", " else:\n", " video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n", " height=H, fs=fs, autoplay=False)\n", " if video_ids[i][0] == 'Bilibili':\n", " print(f'Video available at https://www.bilibili.com/video/{video.id}')\n", " elif video_ids[i][0] == 'Osf':\n", " print(f'Video available at https://osf.io/{video.id}')\n", " display(video)\n", " tab_contents.append(out)\n", " return tab_contents\n", "\n", "\n", "video_ids = [('Youtube', '2Zb93aOWioM'), ('Bilibili', 'BV1Jf4y1R7UZ')]\n", "tab_contents = display_videos(video_ids, W=730, H=410)\n", "tabs = widgets.Tab()\n", "tabs.children = tab_contents\n", "for i in range(len(tab_contents)):\n", " tabs.set_title(i, video_ids[i][0])\n", "display(tabs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_PCA_Applications_Video\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Section 1: Visualize MNIST in 2D using PCA\n", "\n", "In this exercise, we'll visualize the first few components of the MNIST dataset to look for evidence of structure in the data. But in this tutorial, we will also be interested in the label of each image (i.e., which numeral it is from 0 to 9). Start by running the following cell to reload the MNIST dataset (this takes a few seconds)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n", "\n", "# Get images\n", "mnist = fetch_openml(name='mnist_784', as_frame=False, parser='auto')\n", "X_all = mnist.data\n", "\n", "# Get labels\n", "labels_all = np.array([int(k) for k in mnist.target])" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "**Note:** We saved the complete dataset as `X_all` and the labels as `labels_all`." ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "To perform PCA, we now will use the method implemented in sklearn. Run the following cell to set the parameters of PCA - we will only look at the top 2 components because we will be visualizing the data in 2D." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "# Initializes PCA\n", "pca_model = PCA(n_components=2)\n", "\n", "# Performs PCA\n", "pca_model.fit(X_all)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Coding Exercise 1: Visualization of MNIST in 2D using PCA\n", "\n", "Fill in the code below to perform PCA and visualize the top two components. For better visualization, take only the first 2,000 samples of the data (this will also make t-SNE much faster in the following section of the tutorial so don't skip this step!)\n", "\n", "**Suggestions:**\n", "- Truncate the data matrix at 2,000 samples. You will also need to truncate the array of labels.\n", "- Perform PCA on the truncated data.\n", "- Use the function `visualize_components` to plot the labeled data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "help(visualize_components)\n", "help(pca_model.transform)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "```python\n", "#################################################\n", "## TODO for students: take only 2,000 samples and perform PCA\n", "# Comment once you've completed the code\n", "raise NotImplementedError(\"Student exercise: perform PCA\")\n", "#################################################\n", "\n", "# Take only the first 2000 samples with the corresponding labels\n", "X, labels = ...\n", "\n", "# Perform PCA\n", "scores = pca_model.transform(X)\n", "\n", "# Plot the data and reconstruction\n", "visualize_components(...)\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# to_remove solution\n", "\n", "# Take only the first 2000 samples with the corresponding labels\n", "X, labels = X_all[:2000, :], labels_all[:2000]\n", "\n", "# Perform PCA\n", "scores = pca_model.transform(X)\n", "\n", "# Plot the data and reconstruction\n", "with plt.xkcd():\n", " visualize_components(scores[:, 0], scores[:, 1], labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_Visualization_of_MNIST_in_2D_using_PCA_Exercise\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Think! 1: PCA Visualization\n", "\n", "1. What do you see? Are different samples corresponding to the same numeral clustered together? Is there much overlap?\n", "2. Do some pairs of numerals appear to be more distinguishable than others?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# to_remove explanation\n", "\"\"\"\n", "1) Images corresponding to the some labels (numbers) are sort of clustered together\n", "in some cases but there's a lot of overlap and definitely not a clear distinction between\n", "all the number clusters.\n", "\n", "2) The zeros and ones seem fairly non-overlapping.\n", "\"\"\";" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_PCA_Visualization_Discussion\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Section 2: Visualize MNIST in 2D using t-SNE\n", "\n", "*Estimated timing to here from start of tutorial: 15 min*\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Video 2: Nonlinear Methods\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "remove-input" ] }, "outputs": [], "source": [ "# @title Video 2: Nonlinear Methods\n", "from ipywidgets import widgets\n", "from IPython.display import YouTubeVideo\n", "from IPython.display import IFrame\n", "from IPython.display import display\n", "\n", "\n", "class PlayVideo(IFrame):\n", " def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n", " self.id = id\n", " if source == 'Bilibili':\n", " src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n", " elif source == 'Osf':\n", " src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n", " super(PlayVideo, self).__init__(src, width, height, **kwargs)\n", "\n", "\n", "def display_videos(video_ids, W=400, H=300, fs=1):\n", " tab_contents = []\n", " for i, video_id in enumerate(video_ids):\n", " out = widgets.Output()\n", " with out:\n", " if video_ids[i][0] == 'Youtube':\n", " video = YouTubeVideo(id=video_ids[i][1], width=W,\n", " height=H, fs=fs, rel=0)\n", " print(f'Video available at https://youtube.com/watch?v={video.id}')\n", " else:\n", " video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n", " height=H, fs=fs, autoplay=False)\n", " if video_ids[i][0] == 'Bilibili':\n", " print(f'Video available at https://www.bilibili.com/video/{video.id}')\n", " elif video_ids[i][0] == 'Osf':\n", " print(f'Video available at https://osf.io/{video.id}')\n", " display(video)\n", " tab_contents.append(out)\n", " return tab_contents\n", "\n", "\n", "video_ids = [('Youtube', '5Xpb0YaN5Ms'), ('Bilibili', 'BV14Z4y1u7HG')]\n", "tab_contents = display_videos(video_ids, W=730, H=410)\n", "tabs = widgets.Tab()\n", "tabs.children = tab_contents\n", "for i in range(len(tab_contents)):\n", " tabs.set_title(i, video_ids[i][0])\n", "display(tabs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_Nonlinear_methods_Video\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "Next we will analyze the same data using t-SNE, a nonlinear dimensionality reduction method that is useful for visualizing high dimensional data in 2D or 3D. Run the cell below to get started." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "tsne_model = TSNE(n_components=2, perplexity=30, random_state=2020)" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Coding Exercise 2.1: Apply t-SNE on MNIST\n", "First, we'll run t-SNE on the data to explore whether we can see more structure. The cell above defined the parameters that we will use to find our embedding (i.e, the low-dimensional representation of the data) and stored them in `model`. To run t-SNE on our data, use the function `model.fit_transform`.\n", "\n", "**Suggestions:**\n", "- Run t-SNE using the function `model.fit_transform`.\n", "- Plot the result data using `visualize_components`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "help(tsne_model.fit_transform)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "```python\n", "#################################################\n", "## TODO for students\n", "# Comment once you've completed the code\n", "raise NotImplementedError(\"Student exercise: perform t-SNE\")\n", "#################################################\n", "\n", "# Perform t-SNE\n", "embed = ...\n", "\n", "# Visualize the data\n", "visualize_components(..., ..., labels)\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# to_remove solution\n", "\n", "# Perform t-SNE\n", "embed = tsne_model.fit_transform(X)\n", "\n", "# Visualize the data\n", "with plt.xkcd():\n", " visualize_components(embed[:, 0], embed[:, 1], labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_Apply_tSNE_on_MNIST_Exercise\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Coding Exercise 2.2: Run t-SNE with different perplexities\n", "\n", "Unlike PCA, t-SNE has a free parameter (the perplexity) that roughly determines how global vs. local information is weighted. Here we'll take a look at how the perplexity affects our interpretation of the results.\n", "\n", "**Steps:**\n", "- Rerun t-SNE (don't forget to re-initialize using the function `TSNE` as above) with a perplexity of 50, 5 and 2." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "execution": {} }, "source": [ "```python\n", "def explore_perplexity(values, X, labels):\n", " \"\"\"\n", " Plots a 2D representation of the data for visualization with categories\n", " labeled as different colors using different perplexities.\n", "\n", " Args:\n", " values (list of floats) : list with perplexities to be visualized\n", " X (np.ndarray of floats) : matrix with the dataset\n", " labels (np.ndarray of int) : array with the labels\n", "\n", " Returns:\n", " Nothing.\n", "\n", " \"\"\"\n", " for perp in values:\n", "\n", " #################################################\n", " ## TO DO for students: Insert your code here to redefine the t-SNE \"model\"\n", " ## while setting the perplexity perform t-SNE on the data and plot the\n", " ## results for perplexity = 50, 5, and 2 (set random_state to 2020\n", " # Comment these lines when you complete the function\n", " raise NotImplementedError(\"Student Exercise! Explore t-SNE with different perplexity\")\n", " #################################################\n", "\n", " # Perform t-SNE\n", " tsne_model = ...\n", "\n", " embed = tsne_model.fit_transform(X)\n", " visualize_components(embed[:, 0], embed[:, 1], labels, show=False)\n", " plt.title(f\"perplexity: {perp}\")\n", "\n", "\n", "# Visualize\n", "values = [50, 5, 2]\n", "explore_perplexity(values, X, labels)\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": {} }, "outputs": [], "source": [ "# to_remove solution\n", "def explore_perplexity(values, X, labels):\n", " \"\"\"\n", " Plots a 2D representation of the data for visualization with categories\n", " labeled as different colors using different perplexities.\n", "\n", " Args:\n", " values (list of floats) : list with perplexities to be visualized\n", " X (np.ndarray of floats) : matrix with the dataset\n", " labels (np.ndarray of int) : array with the labels\n", "\n", " Returns:\n", " Nothing.\n", "\n", " \"\"\"\n", "\n", " for perp in values:\n", "\n", " # Perform t-SNE\n", " tsne_model = TSNE(n_components=2, perplexity=perp, random_state=2020)\n", "\n", " embed = tsne_model.fit_transform(X)\n", " visualize_components(embed[:, 0], embed[:, 1], labels, show=False)\n", " plt.title(f\"perplexity: {perp}\")\n", " plt.show()\n", "\n", "\n", "# Visualize\n", "values = [50, 5, 2]\n", "with plt.xkcd():\n", " explore_perplexity(values, X, labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_Run_tSNE_with_different_perplexities_Exercise\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "## Think! 2: t-SNE Visualization\n", "\n", "1. What changed compared to your previous results using perplexity equal to 50? Do you see any clusters that have a different structure than before?\n", "2. What changed in the embedding structure for perplexity equals to 5 or 2?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit your feedback\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "execution": {}, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# @title Submit your feedback\n", "content_review(f\"{feedback_prefix}_tSNE_Visualization_Discussion\")" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "---\n", "# Summary" ] }, { "cell_type": "markdown", "metadata": { "execution": {} }, "source": [ "*Estimated timing of tutorial: 35 minutes*\n", "\n", "* We learned the difference between linear and nonlinear dimensionality reduction. While nonlinear methods can be more powerful, they can also be sensitive to noise. In contrast, linear methods are useful for their simplicity and robustness.\n", "* We compared PCA and t-SNE for data visualization. Using t-SNE, we could visualize clusters in the data corresponding to different digits. While PCA was able to separate some clusters (e.g., 0 vs 1), it performed poorly overall.\n", "* However, the results of t-SNE can change depending on the choice of perplexity. To learn more, we recommend this Distill paper by [Wattenberg, _et al._, 2016](http://doi.org/10.23915/distill.00002).\n" ] } ], "metadata": { "colab": { "collapsed_sections": [], "include_colab_link": true, "name": "W1D4_Tutorial4", "provenance": [], "toc_visible": true }, "kernel": { "display_name": "Python 3", "language": "python", "name": "python3" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" } }, "nbformat": 4, "nbformat_minor": 0 }