{ "cells": [ { "cell_type": "markdown", "id": "97e78e04", "metadata": {}, "source": [ "# Logistic Regression for Breast Cancer Classification\n", "\n", "In this notebook we will see how to solve a classification problem using logistic regression.\n", "\n", "We will use\n", "- The Python library `sckit-learn`\n", "- Using the `datasets` submodule we will import the breast cancer data set\n", "- Using the `model_selection` submodule we will use the method `test_train_split` to split the dataset into training and testing subsets\n", "- Using the `linear_model` submodule create a `LogistricRegression` object to train logistic regression classifier\n", "- We will train this model using the training set\n", "- Predict and evaluate the results of our LogisticRegression model on the test set using `metrics`\n", "- Use `seaborn` to plot relevant model metrics" ] }, { "cell_type": "code", "execution_count": null, "id": "fdbb7235", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from sklearn import metrics\n", "from sklearn.datasets import load_breast_cancer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "markdown", "id": "8bed4f25", "metadata": {}, "source": [ "The breast cancer dataset is available in `scikit-learn`. Many machine learning libraries come with built-in datasets or expose an API with which you can download datasets to train and test your model. \n", "\n", "We set the input parameter `as_frame=True` in the `load_breast_cancer()` fuction to return the data as a Pandas dataframe. All of the `sklearn.datasets` behave in a similar fashion." ] }, { "cell_type": "code", "execution_count": null, "id": "8244f2df", "metadata": {}, "outputs": [], "source": [ "# Load the breast cancer dataset as a dataframe\n", "bc_dataset = load_breast_cancer(as_frame=True)" ] }, { "cell_type": "markdown", "id": "db6a6fb7", "metadata": {}, "source": [ "The `bc_dataset` is an object. \n", "\n", "To obtain the input features we need to call `bc_dataset[\"data\"]`.\n", "\n", "To obtain the output target we need to call `bc_dataset[\"target\"]`." ] }, { "cell_type": "code", "execution_count": null, "id": "e7f6d92c", "metadata": {}, "outputs": [], "source": [ "# X is a Pandas dataframe\n", "# The columns are the features \n", "X = bc_dataset[\"data\"]\n", "\n", "# y is a Pandas series with the target class labels (0 - negative, 1 - positive)\n", "y = bc_dataset[\"target\"]\n", "\n", "# Explore these objects with the .head() method" ] }, { "cell_type": "code", "execution_count": null, "id": "7002c172", "metadata": { "scrolled": false }, "outputs": [], "source": [ "X.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "0bd8856a", "metadata": {}, "outputs": [], "source": [ "X.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "e8a477e6", "metadata": {}, "outputs": [], "source": [ "y.head(20)" ] }, { "cell_type": "code", "execution_count": null, "id": "1056429b", "metadata": {}, "outputs": [], "source": [ "# Using the train_test_split method we split 80% of the data into the X_train, y_train numpy arrays\n", "# The remaining 20% is our X_test and y_test \n", "X_train, X_test, y_train, y_test = train_test_split(X.to_numpy(), y.to_numpy(), test_size=0.20, random_state=10)" ] }, { "cell_type": "code", "execution_count": null, "id": "b6f28d1a", "metadata": {}, "outputs": [], "source": [ "# Create a StandardScaler object\n", "sc = StandardScaler()\n", "\n", "# The StandardScaler standardizes features by removing the mean and scaling to unit variance\n", "# Prevents features with larger variances to dominate\n", "# We only need to apply this to our training/testing input data\n", "X_train = sc.fit_transform(X_train)\n", "X_test = sc.fit_transform(X_test)" ] }, { "cell_type": "code", "execution_count": null, "id": "1cbd60be", "metadata": {}, "outputs": [], "source": [ "# Create our logistic regression object\n", "logistic_regr = LogisticRegression()" ] }, { "cell_type": "code", "execution_count": null, "id": "18139f78", "metadata": { "scrolled": true }, "outputs": [], "source": [ "logistic_regr.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "id": "d927eaa9", "metadata": {}, "outputs": [], "source": [ "predictions = logistic_regr.predict(X_test)\n", "print(predictions)" ] }, { "cell_type": "code", "execution_count": null, "id": "fabee930", "metadata": {}, "outputs": [], "source": [ "score = logistic_regr.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": null, "id": "627c1ef8", "metadata": {}, "outputs": [], "source": [ "print(\"The accuracy of the model is: \", score)" ] }, { "cell_type": "code", "execution_count": null, "id": "279dd49f", "metadata": {}, "outputs": [], "source": [ "# Using the metrics submodule we can compute the \n", "cm = metrics.confusion_matrix(y_test, predictions)" ] }, { "cell_type": "code", "execution_count": null, "id": "0d026048", "metadata": {}, "outputs": [], "source": [ "# Using matplotlib and seaborn we can display a heatmap of the\n", "plt.figure(figsize=(9,9))\n", "sns.heatmap(cm, annot=True, fmt=\".3f\", linewidths=.5, square = True, cmap = 'Blues_r');\n", "plt.ylabel('Actual label');\n", "plt.xlabel('Predicted label');\n", "all_sample_title = 'Accuracy Score: {0}'.format(score)\n", "plt.title(all_sample_title, size = 15);" ] }, { "cell_type": "code", "execution_count": null, "id": "1a6fd8a0", "metadata": {}, "outputs": [], "source": [ "# True positives / (True positives + False positives)\n", "# Quality of a positive prediction\n", "# Answers what proportion of positive identifications were actually correct.\n", "# High precision means not a lot of false positives\n", "precision = metrics.precision_score(y_test, predictions)\n", "print(\"Precision score: \", precision)" ] }, { "cell_type": "code", "execution_count": null, "id": "ffa8b0a8", "metadata": {}, "outputs": [], "source": [ "# True positives / (True positives + False Negatives)\n", "# Answers what proportion of the actual positives was correct.\n", "# Higher recall means not a lot of False Negatives\n", "recall = metrics.recall_score(y_test, predictions)\n", "print(\"Recall score: \", recall)" ] }, { "cell_type": "code", "execution_count": null, "id": "fb1f7f93", "metadata": {}, "outputs": [], "source": [ "# ROC curve\n", "# fpr = FP / (FP + TN)\n", "# tpr = TP / (TP + FN)\n", "lr_probs = logistic_regr.predict_proba(X_test)\n", "lr_preds = lr_probs[:,1]\n", "fpr, tpr, threshold = metrics.roc_curve(y_test, lr_preds)\n", "print(threshold)\n", "\n", "plt.title('Receiver Operating Characteristic')\n", "plt.plot(fpr, tpr, 'b')\n", "plt.plot([0, 1], [0, 1],'r--')\n", "plt.xlim([-0.01, 1])\n", "plt.ylim([0, 1.01])\n", "plt.ylabel('True Positive Rate')\n", "plt.xlabel('False Positive Rate')\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }