{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Unsupervised Machine Learning Exercise\n", "\n", "**Lecturer**: Ashish Mahabal
\n", "**Jupyter Notebook Author**: Umaa Rebbapragada, modified by Ashish Mahabal.\n", "\n", "This is a Jupyter notebook lesson extending the LSSTC Data Science Fellowship Program (https://ciera.northwestern.edu/programs/lsstc-data-science-fellowship-program/) Nov 2018 edition (session 7) and adapted for the NARIT-EACOA 2019 summer workshop.\n", "\n", "## Objective\n", "\n", "* Become familiar with the ZTF data, the examination of some of its features\n", "* Cluster a set of candidate sources from the Zwicky Transient Facility's (ZTF) image subtraction pipeline. All candidate features and postage stamps were extracted from ZTF's public alert stream. \n", "* Run sklearn's [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm on 2 or more features. \n", "\n", "## Key steps\n", "\n", "1. Load data\n", "2. Plot Features 'elong' and 'chipsf'\n", "3. Run KMeans on 2 Features\n", "4. Feature Scaling\n", "4. Evaluation Results Quantitatively\n", "5. Evaluate Results by Examining Postage Stamps\n", "6. Clustering in a Dimensionally-Reduced Space\n", "\n", "\n", "## Required dependencies\n", "\n", "Nominally, you should be able to install the python modules with `pip install `. The external astromatic packages are easiest installed using package managers (e.g., `rpm`, `apt-get`).\n", "\n", "### Python modules\n", "* python 3\n", "* astropy\n", "* numpy\n", "* matplotlib\n", "* scikit-learn\n", "\n", "### External packages\n", "None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0a. Imports\n", "\n", "These are all the imports that will be used in this notebook. All should be available in the DSFP conda environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import glob\n", "import os\n", "from time import time\n", "from matplotlib.pyplot import imshow\n", "from matplotlib.image import imread\n", "from sklearn.cluster import KMeans\n", "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n", "from sklearn import metrics\n", "from sklearn.metrics.pairwise import euclidean_distances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0b. Data Location\n", "\n", "You will need the following files in a subdir called data:\n", "- dsfp_ztf_meta.npy\n", "- dsfp_ztf_feats.npy\n", "- dsfp_ztf_png_stamps.tgz\n", "\n", "You will need to unzip and unpack this last file (a \"tarball\") called `dsfp_ztf_png_stamps.tar.gz`. Run the following commands in the data subdirectory of this notebook to unpack the png stamps:\n", "\n", " - tar -xzvf dsfp_ztf_png_stamps.tgz\n", " \n", "You should now have a subdirectory in your data directory called dsfp_ztf_png_stamps.\n", "\n", "Please specify the following file locations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "F_META = # complete\n", "F_FEATS = # complete\n", "D_STAMPS = # complete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 1. Load Data\n", "\n", "We are ready to get started! :) Start by loading the data and confirming that feats has the same number of columns as COL_NAMES. Please note that the last columns is a class label with values {0, 1}, where 0=bogus, and 1=real. Today we are doing unsupervised learning, but some clustering evaluation methods use labels to quantitatively measure the quality of the clustering result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "meta = np.load(F_META)\n", "feats = np.load(F_FEATS)\n", "\n", "COL_NAMES = ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap',\n", " 'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky',\n", " 'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio',\n", " 'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg',\n", " 'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig',\n", " 'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']\n", " \n", "# INSTRUCTION: Verify that feats has the same number of columns as COL_NAMES\n", "#\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Plot Features\n", "\n", "We will perform K-means clustering using two features: 'chipsf' and 'elong'. Chipsf is the uncertainty associated with performing PSF-fit photometry. The higher the chi values, the more uncertainty associated with the source's PSF fit. Elong is a measure of how elongated the source is. A transient point source should have a spherical point spread function. An elongated point source may be a sign of a problem with image subtraction.\n", "\n", "Extract features chipsf and along from the data. Scatter plot them together, and also plot their histograms. \n", "\n", "#### Question: What do you notice about these features?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "featnames_to_select = ['chipsf', 'elong']\n", "\n", "# Extract the Correct Features\n", "# \n", "featidxs_to_select_indices = [ COL_NAMES.index(x) for x in featnames_to_select]\n", "feats_selected = feats[:,featidxs_to_select_indices]\n", "\n", "# Scatter Plot the Two Features\n", "#\n", "def plot_scatter(dat, xlabel, ylabel, xscale='linear', yscale='linear'):\n", " plt.plot(dat[:,0], dat[:,1], 'k.')\n", " plt.xlabel(xlabel)\n", " plt.ylabel(ylabel)\n", " plt.xscale(xscale)\n", " plt.yscale(yscale)\n", " plt.show()\n", " \n", "# Scatter Plot the Two Features\n", "#\n", "def plot_histogram(dat, bins, title, xscale='linear', yscale='linear'):\n", " plt.hist(dat, bins)\n", " plt.xscale(xscale)\n", " plt.yscale(yscale)\n", " plt.title(title)\n", " plt.show()\n", "\n", "# INSTRUCTION: Scatter Plot the Data\n", "# \n", "\n", "# INSTRUCTION: Plot the Histograms for both features. Hint, it may be helpful to plot some features on a log scale.\n", "# " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. KMeans Using Two Features\n", "\n", "We rarely ever cluster only two features from a dataset. However, the advantage of doing so is that we can readily visualize two-dimensional data. Let's start off by clustering features elong and chipsf with KMeans. The plotKMeans function below implements a visualization of KMean's partitioning that was used in sklearn's [KMean's demo](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html). \n", "\n", "#### Question: What do you think about the quality of the clusterings produced?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "def runKMeans(dat, n_clusters=2, seed=0):\n", " return KMeans(n_clusters, random_state=seed).fit(dat) \n", "\n", "def plotKMeans(kmeans_res, reduced_dat, xlabel, ylabel, xscale='linear', yscale='linear'):\n", " \n", " # Plot the decision boundary. For that, we will assign a color to each\n", " h = .02 # point in the mesh [x_min, x_max]x[y_min, y_max].\n", " x_min, x_max = reduced_dat[:, 0].min() - 1, reduced_dat[:, 0].max() + 1\n", " y_min, y_max = reduced_dat[:, 1].min() - 1, reduced_dat[:, 1].max() + 1\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n", " # Obtain labels for each point in mesh. Use last trained model.\n", " Z = kmeans_res.predict(np.c_[xx.ravel(), yy.ravel()])\n", "\n", " # Put the result into a color plot\n", " Z = Z.reshape(xx.shape)\n", " plt.imshow(Z, interpolation='nearest',\n", " extent=(xx.min(), xx.max(), yy.min(), yy.max()),\n", " cmap=plt.cm.Paired,\n", " aspect='auto', origin='lower')\n", " plt.plot(reduced_dat[:,0], reduced_dat[:,1], 'k.')\n", " plt.scatter(kmeans_res.cluster_centers_[:, 0], kmeans_res.cluster_centers_[:, 1],\n", " marker='x', s=169, linewidths=3,\n", " color='w', zorder=10)\n", " plt.xlabel(xlabel)\n", " plt.ylabel(ylabel)\n", " plt.xscale(xscale)\n", " plt.yscale(yscale)\n", " plt.show()\n", "\n", "# INSTRUCTION: Use the runKMeans and plotKMeans functions to cluster the data (feats_selected)\n", "# with several values of k. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Feature Scaling\n", "\n", "We just discovered that distance metrics can be sensitive to the scale of your data (e.g., some features span large numeric ranges, but others don't). For machine learning methods that calculate similiarty between feature vectors, it is important to normalize data within a standard range such as (0, 1) or with z-score normalization (scaling to unit mean and variance). Fortunately, sklearn also makes this quite easy. Please review sklearn's [preprocessing](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) module options, specifically StandardScaler which corresponds to z-score normalization and MinMaxScaler. Please implement one. \n", "\n", "After your data has been scaled, scatter plot your rescaled features, and run KMeans with the transformed data. Compare the results on the transformed data with those above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# INSTRUCTION: Re-scale your data using either the MinMaxScaler or StandardScaler from sklearn\n", "#\n", "\n", "# INSTRUCTION: Scatter plot your rescaled data\n", "#\n", "\n", "# INSTRUCTION: Retry KMeans with the same values of k used above.\n", "#" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Quantitative Cluster Evaluation\n", "\n", "So far, we've been visually verifying our clusters. Let's use quantitative methods to verify our results. \n", "\n", "The following is a score that does not require labels:\n", "- inertia: \"Sum of squared distances of samples to their closest cluster center.\"\n", "- Silhouette coefficient: Measures minimal inertia in ratio to distance to next nearest cluster. The score is higher are clusters become more compact and well-separated.\n", "\n", "The following scores do require labels, and are documented [here](http://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation).\n", "\n", "- ARI, AMI measure the similarity between ground_truth labels and predicted_labels. ARI measure similarity, and AMI measures in terms of mutual information. Random assignments score close to 0, correct assignments close to 1.\n", "- homogeneity: purity of the cluster (did all cluster members have the same label?). Scores in [0,1] where 0 is bad.\n", "- completeness: did all labels cluster together in a single cluster? Scores in [0,1] where 0 is bad.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_size = 300\n", "\n", "def bench_k_means(estimator, name, data, labels):\n", " t0 = time()\n", " estimator.fit(data)\n", " print('%-9s\\t%.2fs\\t%i\\t%.3f\\t%.3f\\t%.3f\\t%.3f\\t%.3f\\t%.3f'\n", " % (name, (time() - t0), estimator.inertia_,\n", " metrics.homogeneity_score(labels, estimator.labels_),\n", " metrics.completeness_score(labels, estimator.labels_),\n", " metrics.v_measure_score(labels, estimator.labels_),\n", " metrics.adjusted_rand_score(labels, estimator.labels_),\n", " metrics.adjusted_mutual_info_score(labels, estimator.labels_),\n", " metrics.silhouette_score(data, estimator.labels_,\n", " metric='euclidean',\n", " sample_size=sample_size)))\n", "\n", "labels = feats[:,-1]\n", "print(82 * '_')\n", "print('init\\t\\ttime\\tinertia\\thomo\\tcompl\\tv-meas\\tARI\\tAMI\\tsilhouette')\n", "\n", "# INSTRUCTIONS: Use the bench_k_means method to compare your clustering results\n", "#\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Cluster Evaluation by Visual Inspection\n", "\n", "## This time with postage stamps!\n", "\n", "It can be tempting to let yourself be guided by metrics alone, and the metrics are useful guideposts that can help determine whether you're moving in the right direction. However, the goal of clustering is to reveal structure in your dataset. Fortunately, because the features were extracted from sources that were extracted from images, we can view the cutouts from each source to visually verify whether our clusters contain homogeneous objects. \n", "\n", "The display methods below give you an opportunity to display random candidates from each cluster, or the candidates that are closest to the cluster center." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "def display_stamps(candids, fig_title):\n", " \n", " # display five across\n", " num_per_row = 5\n", " \n", " for i, candid in enumerate(candids):\n", " f_stamp = glob.glob(os.path.join(D_STAMPS, 'candid{}*.png'.format(candid)))[0] # there should only be one file returned!\n", " if (i % num_per_row) == 0:\n", " fig = plt.figure(figsize=(18, 3))\n", " fig.suptitle(fig_title) \n", "\n", " ax = fig.add_subplot(1, num_per_row, i%num_per_row + 1)\n", " ax.set_axis_off()\n", " ax.set_title(candid)\n", " stamp = imread(f_stamp)\n", " imshow(stamp)\n", " return\n", "\n", "def closest_to_centroid(centroid, cluster_feats, cluster_candids):\n", " \n", " dists = euclidean_distances(cluster_feats, centroid.reshape(1, -1))[:,0]\n", " closest_indices = np.argsort(dists)[:10]\n", " return cluster_candids[closest_indices]\n", "\n", "def show_cluster_stamps(kmeans_res, displayMode='closest', num_to_display=10):\n", " # spits out a random selection of stamps from each cluster\n", " \n", " \n", " for i in range(kmeans_res.n_clusters):\n", " centroid = kmeans_res.cluster_centers_[i, :]\n", " mask = kmeans_res.labels_ == i\n", " cluster_candids = meta[mask]['candid']\n", " cluster_feats = feats_selected_scaled[mask]\n", " if displayMode == 'near_centroid':\n", " selected_candids = closest_to_centroid(centroid, cluster_feats, cluster_candids)\n", " if displayMode == 'random':\n", " np.random.shuffle(cluster_candids)\n", " selected_candids = cluster_candids[:num_to_display]\n", " display_stamps(selected_candids, 'Cluster {}'.format(i))\n", "\n", "# INSTRUCTION: Use the show_cluster_stamps method to display cutouts associated with each cluster.\n", "# Do you see similar objects in each cluster?\n", "#\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. Clustering in a Dimensionally-Reduced Space\n", "\n", "Given the tools seen above, starting clustering more than 2 features at a time. This work is free-form. I'll start you off with some suggested features. After plotting the feature distributions, you may choose to down-select further.\n", "\n", "Because we're now working with more than 2 features, use PCA to project the feature space onto its first two principal components. You may use the methods above to run KMeans in that reduced feature space and evaluate your results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "featnames_to_select = ['chipsf', 'elong', 'diffmaglim', 'magpsf', 'sigmapsf', \n", " 'chipsf', 'magap', 'sigmagap', 'sky', 'magdiff', 'fwhm', \n", " 'mindtoedge', 'magfromlim', 'seeratio', 'aimage', 'bimage',\n", " 'aimagerat', 'bimagerat', 'elong', 'nneg', 'nbad', 'sumrat', 'magapbig', 'sigmagapbig']\n", "\n", "# INSTRUCTION: Visualize these features. Discard any you consider to be problematic.\n", "\n", "# INSTRUCTION: Filter the feature space\n", "\n", "# INSTRUCTION: Run PCA on this feature space to reduce it to 2 principal components\n", "\n", "# INSTRUCTION: Run KMeans on this 2-dimensional PCA space, and evaluate your results both quantatively and qualitatively." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }