{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Supervised Classification Module (exercise)\n",
"\n",
"**Lecturer:** Ashish Mahabal
\n",
"**Jupyter Notebook Authors:** Ashish Mahabal & Yuhan Yao\n",
"\n",
"This is a Jupyter notebook lesson extending the GROWTH Summer School 2019 and adapted for the NARIT-EACOA 2019 summer workshop.\n",
"\n",
"## Objective\n",
"Classify different classes using (a) decision trees and (b) random forest \n",
"\n",
"## Key steps\n",
"- Pick variable types\n",
"- Select training sample\n",
"- Select method\n",
"- Look at confusion matrix and details \n",
"\n",
"## Required dependencies\n",
"\n",
"See GROWTH school webpage for detailed instructions on how to install these modules and packages. Nominally, you should be able to install the python modules with `pip install `. The external astromatic packages are easiest installed using package managers (e.g., `rpm`, `apt-get`).\n",
"\n",
"### Python modules\n",
"* python 3\n",
"* astropy\n",
"* numpy\n",
"* astroquery\n",
"* pandas\n",
"* matplotlib\n",
"* pydotplus\n",
"* IPython.display\n",
"* sklearn\n",
"\n",
"### External packages\n",
"None\n",
"\n",
"### Partial Credits\n",
"Pavlos Protopapas (LSSDS notebook)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Now you will do a similar set of analyses on a different dataset.\n",
"### Here you will use the light curves file to derive features\n",
"### And then use the resulting file to run decision trees and random forest on that for classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### import the required modules (exercise)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"#import ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### read the lightcurves file"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"datadir = 'data'\n",
"lightcurves = datadir + '/CRTS_6varclasses.csv.gz'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" MJD | \n",
" Mag | \n",
" magerr | \n",
" RA | \n",
" Dec | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1109065026725 | \n",
" 53705.501925 | \n",
" 16.943797 | \n",
" 0.082004 | \n",
" 182.25871 | \n",
" 9.76580 | \n",
"
\n",
" \n",
" 1 | \n",
" 1109065026725 | \n",
" 53731.483314 | \n",
" 16.645102 | \n",
" 0.075203 | \n",
" 182.25867 | \n",
" 9.76585 | \n",
"
\n",
" \n",
" 2 | \n",
" 1109065026725 | \n",
" 53731.491406 | \n",
" 16.693791 | \n",
" 0.076497 | \n",
" 182.25870 | \n",
" 9.76574 | \n",
"
\n",
" \n",
" 3 | \n",
" 1109065026725 | \n",
" 53731.499465 | \n",
" 16.793651 | \n",
" 0.078755 | \n",
" 182.25869 | \n",
" 9.76576 | \n",
"
\n",
" \n",
" 4 | \n",
" 1109065026725 | \n",
" 53731.507529 | \n",
" 16.767817 | \n",
" 0.077436 | \n",
" 182.25878 | \n",
" 9.76581 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID MJD Mag magerr RA Dec\n",
"0 1109065026725 53705.501925 16.943797 0.082004 182.25871 9.76580\n",
"1 1109065026725 53731.483314 16.645102 0.075203 182.25867 9.76585\n",
"2 1109065026725 53731.491406 16.693791 0.076497 182.25870 9.76574\n",
"3 1109065026725 53731.499465 16.793651 0.078755 182.25869 9.76576\n",
"4 1109065026725 53731.507529 16.767817 0.077436 182.25878 9.76581"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lcs = pd.read_csv(lightcurves,\n",
" compression='gzip',\n",
" header=1,\n",
" sep=',',\n",
" skipinitialspace=True,\n",
" nrows=100000)\n",
" #skiprows=[4,5])\n",
" #,nrows=100000)\n",
"\n",
"lcs.columns = ['ID', 'MJD', 'Mag', 'magerr', 'RA', 'Dec']\n",
"lcs.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100000"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(lcs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We need classes, so load the catalog file too"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"catalog = datadir + '/CatalinaVars.tbl.gz'"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Catalina_Surveys_ID | \n",
" Numerical_ID | \n",
" RA_J2000 | \n",
" Dec | \n",
" V_mag | \n",
" Period_days | \n",
" Amplitude | \n",
" Number_Obs | \n",
" Var_Type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" CSS_J000020.4+103118 | \n",
" 1109001041232 | \n",
" 00:00:20.41 | \n",
" +10:31:18.9 | \n",
" 14.62 | \n",
" 1.491758 | \n",
" 2.39 | \n",
" 223 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" CSS_J000031.5-084652 | \n",
" 1009001044997 | \n",
" 00:00:31.50 | \n",
" -08:46:52.3 | \n",
" 14.14 | \n",
" 0.404185 | \n",
" 0.12 | \n",
" 163 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" CSS_J000036.9+412805 | \n",
" 1140001063366 | \n",
" 00:00:36.94 | \n",
" +41:28:05.7 | \n",
" 17.39 | \n",
" 0.274627 | \n",
" 0.73 | \n",
" 158 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" CSS_J000037.5+390308 | \n",
" 1138001069849 | \n",
" 00:00:37.55 | \n",
" +39:03:08.1 | \n",
" 17.74 | \n",
" 0.30691 | \n",
" 0.23 | \n",
" 219 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" CSS_J000103.3+105724 | \n",
" 1109001050739 | \n",
" 00:01:03.37 | \n",
" +10:57:24.4 | \n",
" 15.25 | \n",
" 1.5837582 | \n",
" 0.11 | \n",
" 223 | \n",
" 8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Catalina_Surveys_ID Numerical_ID RA_J2000 Dec V_mag \\\n",
"0 CSS_J000020.4+103118 1109001041232 00:00:20.41 +10:31:18.9 14.62 \n",
"1 CSS_J000031.5-084652 1009001044997 00:00:31.50 -08:46:52.3 14.14 \n",
"2 CSS_J000036.9+412805 1140001063366 00:00:36.94 +41:28:05.7 17.39 \n",
"3 CSS_J000037.5+390308 1138001069849 00:00:37.55 +39:03:08.1 17.74 \n",
"4 CSS_J000103.3+105724 1109001050739 00:01:03.37 +10:57:24.4 15.25 \n",
"\n",
" Period_days Amplitude Number_Obs Var_Type \n",
"0 1.491758 2.39 223 2 \n",
"1 0.404185 0.12 163 1 \n",
"2 0.274627 0.73 158 1 \n",
"3 0.30691 0.23 219 1 \n",
"4 1.5837582 0.11 223 8 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat = pd.read_csv(catalog,\n",
" compression='gzip',\n",
" header=5,\n",
" sep=' ',\n",
" skipinitialspace=True,\n",
" )\n",
"\n",
"columns = cat.columns[1:]\n",
"cat = cat[cat.columns[:-1]]\n",
"cat.columns = columns\n",
"\n",
"cat.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"RRd = cat[ cat['Var_Type'].isin([6]) & (cat['Number_Obs']>100) ]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Catalina_Surveys_ID | \n",
" Numerical_ID | \n",
" RA_J2000 | \n",
" Dec | \n",
" V_mag | \n",
" Period_days | \n",
" Amplitude | \n",
" Number_Obs | \n",
" Var_Type | \n",
"
\n",
" \n",
" \n",
" \n",
" 115 | \n",
" CSS_J001420.8+031214 | \n",
" 1104002007409 | \n",
" 00:14:20.84 | \n",
" +03:12:14.0 | \n",
" 17.45 | \n",
" 0.3871100 | \n",
" 0.56 | \n",
" 174 | \n",
" 6 | \n",
"
\n",
" \n",
" 198 | \n",
" CSS_J001724.9+200542 | \n",
" 1121002007726 | \n",
" 00:17:24.90 | \n",
" +20:05:42.2 | \n",
" 16.64 | \n",
" 0.3571291 | \n",
" 0.39 | \n",
" 224 | \n",
" 6 | \n",
"
\n",
" \n",
" 214 | \n",
" CSS_J001812.9+210201 | \n",
" 1121002027610 | \n",
" 00:18:12.97 | \n",
" +21:02:01.5 | \n",
" 14.54 | \n",
" 0.41616 | \n",
" 0.34 | \n",
" 224 | \n",
" 6 | \n",
"
\n",
" \n",
" 531 | \n",
" CSS_J003001.7+094947 | \n",
" 1109003028079 | \n",
" 00:30:01.71 | \n",
" +09:49:47.6 | \n",
" 16.91 | \n",
" 0.3729404 | \n",
" 0.36 | \n",
" 212 | \n",
" 6 | \n",
"
\n",
" \n",
" 640 | \n",
" CSS_J003359.4+022609 | \n",
" 1101004049971 | \n",
" 00:33:59.48 | \n",
" +02:26:09.0 | \n",
" 15.87 | \n",
" 0.3601025 | \n",
" 0.27 | \n",
" 195 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Catalina_Surveys_ID Numerical_ID RA_J2000 Dec V_mag \\\n",
"115 CSS_J001420.8+031214 1104002007409 00:14:20.84 +03:12:14.0 17.45 \n",
"198 CSS_J001724.9+200542 1121002007726 00:17:24.90 +20:05:42.2 16.64 \n",
"214 CSS_J001812.9+210201 1121002027610 00:18:12.97 +21:02:01.5 14.54 \n",
"531 CSS_J003001.7+094947 1109003028079 00:30:01.71 +09:49:47.6 16.91 \n",
"640 CSS_J003359.4+022609 1101004049971 00:33:59.48 +02:26:09.0 15.87 \n",
"\n",
" Period_days Amplitude Number_Obs Var_Type \n",
"115 0.3871100 0.56 174 6 \n",
"198 0.3571291 0.39 224 6 \n",
"214 0.41616 0.34 224 6 \n",
"531 0.3729404 0.36 212 6 \n",
"640 0.3601025 0.27 195 6 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"RRd.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get numerical ids of objects belonging to the RRd class - call them RRds"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"RRds = RRd['Numerical_ID']"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"115 1104002007409\n",
"198 1121002007726\n",
"214 1121002027610\n",
"531 1109003028079\n",
"640 1101004049971\n",
"Name: Numerical_ID, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"RRds.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let us extract some features from the mags (lets ignore the mag errors for now)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### For a given id you could do it as follows. isin() accepts a list so you could use the entire RRds there\n",
"#### you will lose the id infor if you do that in a single step, so you could break it up"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 16.943797\n",
"1 16.645102\n",
"2 16.693791\n",
"3 16.793651\n",
"4 16.767817\n",
"5 16.885437\n",
"6 16.845561\n",
"7 16.888531\n",
"8 16.941978\n",
"9 16.822148\n",
"10 16.847925\n",
"11 16.878077\n",
"12 16.879755\n",
"13 16.889322\n",
"14 16.943497\n",
"15 16.883803\n",
"16 16.900262\n",
"17 16.742775\n",
"18 16.877542\n",
"19 16.952343\n",
"20 16.906344\n",
"21 16.713005\n",
"22 16.842459\n",
"23 16.895991\n",
"24 16.738557\n",
"25 16.860171\n",
"26 16.881559\n",
"27 16.753773\n",
"28 16.930406\n",
"29 16.613862\n",
" ... \n",
"369 16.476747\n",
"370 16.497292\n",
"371 16.866625\n",
"372 16.930688\n",
"373 16.877139\n",
"374 16.933141\n",
"375 16.885875\n",
"376 16.729198\n",
"377 16.851957\n",
"378 16.861757\n",
"379 16.702795\n",
"380 16.797014\n",
"381 16.781394\n",
"382 16.806467\n",
"383 16.517148\n",
"384 16.556145\n",
"385 16.533153\n",
"386 16.561263\n",
"387 16.462612\n",
"388 16.483990\n",
"389 16.489717\n",
"390 16.485392\n",
"391 16.487658\n",
"392 16.490796\n",
"393 16.474579\n",
"394 16.461535\n",
"395 16.891826\n",
"396 16.824519\n",
"397 16.787664\n",
"398 16.816125\n",
"Name: Mag, Length: 399, dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lcs[lcs['ID'].isin(['1109065026725'])]['Mag']"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1212 14.998325\n",
"1213 14.984769\n",
"1214 15.010683\n",
"1215 14.984963\n",
"1216 14.817003\n",
"1217 14.818442\n",
"1218 14.824113\n",
"1219 14.816897\n",
"1220 14.828202\n",
"1221 14.826791\n",
"1222 14.808851\n",
"1223 14.838217\n",
"1224 14.800337\n",
"1225 14.813361\n",
"1226 14.785231\n",
"1227 14.799678\n",
"1228 14.925248\n",
"1229 14.925225\n",
"1230 14.914077\n",
"1231 14.905855\n",
"1232 14.830322\n",
"1233 14.843997\n",
"1234 14.837703\n",
"1235 14.818362\n",
"1236 14.961875\n",
"1237 14.988142\n",
"1238 14.979043\n",
"1239 14.979344\n",
"1240 14.869105\n",
"1241 14.854159\n",
" ... \n",
"98414 15.953603\n",
"98415 15.881508\n",
"98416 16.260605\n",
"98417 16.210558\n",
"98418 16.280644\n",
"98419 16.316527\n",
"98420 16.273537\n",
"98421 16.271487\n",
"98422 16.217838\n",
"98423 16.203245\n",
"98424 15.981133\n",
"98425 15.970665\n",
"98426 16.038118\n",
"98427 16.001187\n",
"98428 16.236077\n",
"98429 16.322878\n",
"98430 16.288100\n",
"98431 16.317787\n",
"98432 16.290396\n",
"98433 16.325196\n",
"98434 16.305619\n",
"98435 16.260267\n",
"98436 16.192046\n",
"98437 16.181945\n",
"98438 16.175815\n",
"98439 16.241040\n",
"98440 16.298755\n",
"98441 16.282934\n",
"98442 16.270512\n",
"98443 16.301170\n",
"Name: Mag, Length: 6340, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lcs[lcs['ID'].isin(RRds)]['Mag']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Lets assign mags for '1109065026725' to mags (a dictionary)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"mags = {}\n",
"mags['1109065026725'] = lcs[lcs['ID'].isin(['1109065026725'])]['Mag']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let us get the mean of mags for this one particular object: '1109065026725'"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16.717012834586466"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(mags['1109065026725'].values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Assign it to another dictionary with the same key"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"means = {}\n",
"means['1109065026725'] = np.mean(mags['1109065026725'].values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get mean, median, skew, kurtosis for all ids in our light curves set"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16.729198"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.stats import skew, kurtosis\n",
"skew(mags['1109065026725'])\n",
"kurtosis(mags['1109065026725'])\n",
"np.median(mags['1109065026725'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define the dictionaries that you need"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "can't assign to function call (, line 2)",
"output_type": "error",
"traceback": [
"\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m2\u001b[0m\n\u001b[0;31m mean(id) = ...\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m can't assign to function call\n"
]
}
],
"source": [
"for id in lcs['ID']:\n",
" mean(id) = ...\n",
" median(id) = ...\n",
" s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now create a CSV file with the following columns\n",
"### ID, mean, median, skew, Kurtosis, Class"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Now run decision tree and random forest using these variables by picking a couple of classes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}