{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Principal component analysis (PCA)\n", "\n", "PCA is an unsupervised machine learning algorithm that helps to reduce the dimension of your data. The dimension of your data is the number of input features. This algorithm finds a reduced set of input features in the data that account for the majority of the variance in the data. This means that you can work with a smaller set of input features (smaller data) without losing the important information content compared to the full set of input features.\n", "\n", "This can drastically reduce your computing resource requirements, speed up the computation by an order of magnitude, and increase interpretability." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.decomposition import PCA\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "\n", "# Step 1: Load the digits dataset\n", "digits = load_digits()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set up the figure\n", "fig = plt.figure(figsize=(6, 6)) # figure size in inches\n", "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n", "\n", "# plot the digits: each image is 8x8 pixels\n", "for i in range(64):\n", " ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n", " ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')\n", " \n", " # label the image with the target value\n", " ax.text(0, 7, str(digits.target[i]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the feature matrix and target matrix\n", "X, y = digits.data, digits.target # X = Images (flattened), y = Digits ID (target)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# These are all 8 x 8 images of the digits.\n", "# Visualize 64 features/pixels/dimensions of the first image.\n", "X[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "# Apply PCA to transform the feature matrix\n", "pca = PCA()\n", "X_pca = pca.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Plot cumulative explained variance\n", "plt.figure(figsize=(8,5))\n", "plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--')\n", "plt.xlabel(\"Number of Principal Components\")\n", "plt.ylabel(\"Cumulative Explained Variance\")\n", "plt.title(\"Cumulative Explained Variance Plot\")\n", "plt.grid()\n", "plt.show()\n", "\n", "# Print max number of components\n", "print(f\"Maximum number of principal components: {X_pca.shape[1]}\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize PCA components 1 and 2\n", "plt.figure()\n", "colors = [\"navy\", \"turquoise\", \"darkorange\"]\n", "lw = 2\n", "\n", "for color, i, target_name in zip(colors, [0, 1, 2], y):\n", " plt.scatter(\n", " X_pca[y == i, 0], X_pca[y == i, 1], color=color, alpha=0.8, lw=lw, label=y\n", " )\n", "plt.legend(loc=\"best\", shadow=False, scatterpoints=1)\n", "plt.xlabel(\"Principal Component 1\")\n", "plt.ylabel(\"Principal Component 2\")\n", "plt.title(\"PCA of Digits dataset\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Split the data into training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Apply PCA for Dimensionality Reduction\n", "pca = PCA(n_components=20) # Reduce to 20 principal components\n", "X_train_pca = pca.fit_transform(X_train)\n", "X_test_pca = pca.transform(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "\n", "# Train Random Forest with PCA-transformed data\n", "rf_pca_model = RandomForestClassifier(n_estimators=100, random_state=42)\n", "rf_pca_model.fit(X_train_pca, y_train)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict and Evaluate with PCA\n", "y_pred_pca = rf_pca_model.predict(X_test_pca)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay\n", "\n", "print(\"\\nClassification Report (With PCA):\")\n", "print(classification_report(y_test, y_pred_pca))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize the confusion matrix\n", "print(\"\\nConfusion Matrix (With PCA):\")\n", "cm = confusion_matrix(y_test, y_pred_pca, labels=rf_pca_model.classes_)\n", "\n", "disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=digits.target_names)\n", "disp.plot(cmap=plt.cm.Blues)\n", "plt.title(\"Confusion Matrix\")\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignments" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assignment: choose 1 PCA component and tell me how much is the accuracy?\n", "# Accuracy = \n", "\n", "# Post your accuracy and image of the confusion matrix in the chat." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assignment: choose 2 PCA components and tell me how much is the accuracy?\n", "# Accuracy = \n", "\n", "# Post your accuracy and image of the confusion matrix in the chat." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assignment: choose 5 PCA component and tell me how much is the accuracy?\n", "# Accuracy = \n", "\n", "# Post your accuracy and image of the confusion matrix in the chat." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 4 }