{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "97e78e04",
   "metadata": {},
   "source": [
    "# Logistic Regression for Breast Cancer Classification\n",
    "\n",
    "In this notebook we will see how to solve a classification problem using logistic regression.\n",
    "\n",
    "We will use\n",
    "- The Python library `sckit-learn`\n",
    "- Using the `datasets` submodule we will import the breast cancer data set\n",
    "- Using the `model_selection` submodule we will use the method `test_train_split` to split the dataset into training and testing subsets\n",
    "- Using the `linear_model` submodule create a `LogistricRegression` object to train logistic regression classifier\n",
    "- We will train this model using the training set\n",
    "- Predict and evaluate the results of our LogisticRegression model on the test set using `metrics`\n",
    "- Use `seaborn` to plot relevant model metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdbb7235",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn import metrics\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bed4f25",
   "metadata": {},
   "source": [
    "The breast cancer dataset is available in `scikit-learn`. Many machine learning libraries come with built-in datasets or expose an API with which you can download datasets to train and test your model. \n",
    "\n",
    "We set the input parameter `as_frame=True` in the `load_breast_cancer()` fuction to return the data as a Pandas dataframe. All of the `sklearn.datasets` behave in a similar fashion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8244f2df",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the breast cancer dataset as a dataframe\n",
    "bc_dataset = load_breast_cancer(as_frame=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db6a6fb7",
   "metadata": {},
   "source": [
    "The `bc_dataset` is an object. \n",
    "\n",
    "To obtain the input features we need to call `bc_dataset[\"data\"]`.\n",
    "\n",
    "To obtain the output target we need to call `bc_dataset[\"target\"]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7f6d92c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# X is a Pandas dataframe\n",
    "# The columns are the features \n",
    "X = bc_dataset[\"data\"]\n",
    "\n",
    "# y is a Pandas series with the target class labels (0 - negative, 1 - positive)\n",
    "y = bc_dataset[\"target\"]\n",
    "\n",
    "# Explore these objects with the .head() method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7002c172",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bd8856a",
   "metadata": {},
   "outputs": [],
   "source": [
    "X.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8a477e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "y.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1056429b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using the train_test_split method we split 80% of the data into the X_train, y_train numpy arrays\n",
    "# The remaining 20% is our X_test and y_test \n",
    "X_train, X_test, y_train, y_test = train_test_split(X.to_numpy(), y.to_numpy(), test_size=0.20, random_state=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6f28d1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a StandardScaler object\n",
    "sc = StandardScaler()\n",
    "\n",
    "# The StandardScaler standardizes features by removing the mean and scaling to unit variance\n",
    "# Prevents features with larger variances to dominate\n",
    "# We only need to apply this to our training/testing input data\n",
    "X_train = sc.fit_transform(X_train)\n",
    "X_test = sc.fit_transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cbd60be",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create our logistic regression object\n",
    "logistic_regr = LogisticRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18139f78",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "logistic_regr.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d927eaa9",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = logistic_regr.predict(X_test)\n",
    "print(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fabee930",
   "metadata": {},
   "outputs": [],
   "source": [
    "score = logistic_regr.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "627c1ef8",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"The accuracy of the model is: \", score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "279dd49f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using the metrics submodule we can compute the \n",
    "cm = metrics.confusion_matrix(y_test, predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d026048",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using matplotlib and seaborn we can display a heatmap of the\n",
    "plt.figure(figsize=(9,9))\n",
    "sns.heatmap(cm, annot=True, fmt=\".3f\", linewidths=.5, square = True, cmap = 'Blues_r');\n",
    "plt.ylabel('Actual label');\n",
    "plt.xlabel('Predicted label');\n",
    "all_sample_title = 'Accuracy Score: {0}'.format(score)\n",
    "plt.title(all_sample_title, size = 15);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a6fd8a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# True positives / (True positives + False positives)\n",
    "# Quality of a positive prediction\n",
    "# Answers what proportion of positive identifications were actually correct.\n",
    "# High precision means not a lot of false positives\n",
    "precision = metrics.precision_score(y_test, predictions)\n",
    "print(\"Precision score: \", precision)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffa8b0a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# True positives / (True positives + False Negatives)\n",
    "# Answers what proportion of the actual positives was correct.\n",
    "# Higher recall means not a lot of False Negatives\n",
    "recall = metrics.recall_score(y_test, predictions)\n",
    "print(\"Recall score: \", recall)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb1f7f93",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ROC curve\n",
    "# fpr = FP / (FP + TN)\n",
    "# tpr = TP / (TP + FN)\n",
    "lr_probs = logistic_regr.predict_proba(X_test)\n",
    "lr_preds = lr_probs[:,1]\n",
    "fpr, tpr, threshold = metrics.roc_curve(y_test, lr_preds)\n",
    "print(threshold)\n",
    "\n",
    "plt.title('Receiver Operating Characteristic')\n",
    "plt.plot(fpr, tpr, 'b')\n",
    "plt.plot([0, 1], [0, 1],'r--')\n",
    "plt.xlim([-0.01, 1])\n",
    "plt.ylim([0, 1.01])\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}