CSC3105_Project/Project.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# CSC 3105 Project"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "cda961ffb493d00c"
  },
  {
   "cell_type": "code",
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T05:32:51.223079500Z",
     "start_time": "2024-02-25T05:32:51.218515100Z"
    }
   },
   "id": "bcd6cbaa5df10ce8",
   "execution_count": 20
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Load the data into a pandas dataframe\n",
    "\n",
    "The first step in any data analysis project is to load the data into a suitable data structure. In this case, we will use the `pandas` library to load the data into a dataframe.\n",
    "\n",
    "We then clean the data by handling missing values, removing duplicates, converting data types, and performing outlier detection and removal. "
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "bab890d7b05e347e"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original data shape: (42000, 1031)\n",
      "Total number of missing values: 0\n",
      "Cleaned data shape: (42000, 1031)\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from scipy import stats\n",
    "\n",
    "\n",
    "def load_data(dataset_dir):\n",
    "    # Load the data\n",
    "    file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n",
    "    data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))\n",
    "    print(f\"Original data shape: {data.shape}\")\n",
    "    return data\n",
    "\n",
    "\n",
    "def clean_data(data):\n",
    "    # Calculate total number of missing values\n",
    "    total_missing = data.isnull().sum().sum()\n",
    "    print(f\"Total number of missing values: {total_missing}\")\n",
    "    \n",
    "    # Handle missing values\n",
    "    data = data.dropna()\n",
    "\n",
    "    # Remove duplicates\n",
    "    data = data.drop_duplicates()\n",
    "\n",
    "    # Convert data types\n",
    "    data['NLOS'] = data['NLOS'].astype(int)\n",
    "\n",
    "    # Outlier detection and removal\n",
    "    z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))\n",
    "    data = data[(z_scores < 3).any(axis=1)]\n",
    "\n",
    "    print(f\"Cleaned data shape: {data.shape}\")\n",
    "    return data\n",
    "\n",
    "\n",
    "# Use the functions\n",
    "data = load_data(DATASET_DIR)\n",
    "data = clean_data(data)\n",
    "\n",
    "# print(data.head())\n",
    "\n",
    "# Print Headers\n",
    "# print(data.columns)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T05:32:56.753045Z",
     "start_time": "2024-02-25T05:32:52.416237400Z"
    }
   },
   "id": "dd9657f5ec6d7754",
   "execution_count": 21
  },
  {
   "cell_type": "markdown",
   "source": [
    "The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n",
    "\n",
    "The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n",
    "\n",
    "The mathematical formulas used in this process are as follows:\n",
    "\n",
    "1. Calculate the mean (μ) of the data:\n",
    "\n",
    "$$\n",
    "\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n",
    "$$\n",
    "Where:\n",
    "- $n$ is the number of observations in the data\n",
    "- $x_i$ is the value of the $i$-th observation\n",
    "- $\\sum$ denotes the summation over all observations\n",
    "\n",
    "2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n",
    "\n",
    "$$\n",
    "\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n",
    "$$\n",
    "Where:\n",
    "- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n",
    "- $\\sigma$ is the standard deviation of the data\n",
    "- $x_i$ is the value of the $i$-th observation\n",
    "- $\\mu$ is the mean of the data\n",
    "\n",
    "The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n",
    "\n",
    "**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "2c13064e20601717"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The number of principle components after PCA is 868\n"
     ]
    }
   ],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# Standardize the data\n",
    "numerical_cols = data.select_dtypes(include=[np.number]).columns\n",
    "scaler = StandardScaler()\n",
    "data[numerical_cols] = scaler.fit_transform(data[numerical_cols])\n",
    "\n",
    "# Initialize PCA with the desired explained variance\n",
    "pca = PCA(0.95)\n",
    "\n",
    "# Fit PCA to your data\n",
    "pca.fit(data)\n",
    "\n",
    "# Get the number of components\n",
    "num_components = pca.n_components_\n",
    "\n",
    "print(f\"The number of principle components after PCA is {num_components}\")"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T05:33:01.393540600Z",
     "start_time": "2024-02-25T05:32:57.831719Z"
    }
   },
   "id": "7f9bec73a42f7bca",
   "execution_count": 22
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Perform Dimensionality Reduction with PCA\n",
    "\n",
    "We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "dc9f8c0e194dd07d"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original number of components: 1031\n",
      "Number of components after PCA: 868\n",
      "PCA has successfully reduced the number of components.\n"
     ]
    }
   ],
   "source": [
    "# Project original data to PC with the highest eigenvalue\n",
    "data_pca = pca.transform(data)\n",
    "\n",
    "# Create a dataframe with the principal components\n",
    "data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n",
    "\n",
    "# Print the number of components in the original and PCA transformed data\n",
    "print(f\"Original number of components: {data.shape[1]}\")\n",
    "print(f\"Number of components after PCA: {num_components}\")\n",
    "\n",
    "# Compare the number of components in the original and PCA transformed data\n",
    "if data.shape[1] > num_components:\n",
    "    print(\"PCA has successfully reduced the number of components.\")\n",
    "elif data.shape[1] < num_components:\n",
    "    print(\"Unexpectedly, PCA has increased the number of components.\")\n",
    "else:\n",
    "    print(\"The number of components remains unchanged after PCA.\")"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T05:47:02.511321300Z",
     "start_time": "2024-02-25T05:47:01.989494800Z"
    }
   },
   "id": "96c62c50f8734a01",
   "execution_count": 29
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Data Mining / Machine Learning\n",
    "\n",
    "### I. Supervised Learning\n",
    "- **Decision**: Supervised learning is used due to the labeled dataset.\n",
    "- **Algorithm**: Random Forest Classifier is preferred for its performance in classification tasks.\n",
    "\n",
    "### II. Training/Test Split Ratio\n",
    "- **Decision**: 70:30 split is chosen for training/test dataset.\n",
    "- **Reasoning**: This split ensures sufficient data for training and testing.\n",
    "\n",
    "### III. Performance Metrics\n",
    "- **Classification Accuracy**: Measures the proportion of correctly classified instances.\n",
    "- **Confusion Matrix**: Provides a summary of predicted and actual classes.\n",
    "- **Classification Report**: Provides detailed metrics such as precision, recall, F1-score, and support for each class.\n",
    "\n",
    "The Random Forest Classifier is trained on the training set and evaluated on the test set using accuracy and classification report metrics.\n"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "47d5cb383ce1f7ba"
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Split the data into training and testing sets\n",
    "\n",
    "The next step is to split the data into training and testing sets. This is a common practice in machine learning, where the training set is used to train the model, and the testing set is used to evaluate its performance.\n",
    "\n",
    "We will use the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets. We will use 70% of the data for training and 30% for testing, which is a common split ratio."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "576a6a92fc7fdbfd"
  },
  {
   "cell_type": "code",
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Split the data into training and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(data_pca_df, data['NLOS'], test_size=0.3, random_state=42)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T05:33:06.047014800Z",
     "start_time": "2024-02-25T05:33:05.983534100Z"
    }
   },
   "id": "7db852fafd187d5a",
   "execution_count": 24
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Train a Random Forest Classifier\n",
    "\n",
    "The next step is to train a machine learning model on the training data. We will use the `RandomForestClassifier` class from the `sklearn.ensemble` module to train a random forest classifier.\n",
    "\n",
    "The random forest classifier is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n",
    "\n",
    "We will use the `fit` method of the `RandomForestClassifier` object to train the model on the training data."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "5753cc6db18bac73"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "data": {
      "text/plain": "RandomForestClassifier(random_state=42)",
      "text/html": "<style>#sk-container-id-1 {\n  /* Definition of color scheme common for light and dark mode */\n  --sklearn-color-text: black;\n  --sklearn-color-line: gray;\n  /* Definition of color scheme for unfitted estimators */\n  --sklearn-color-unfitted-level-0: #fff5e6;\n  --sklearn-color-unfitted-level-1: #f6e4d2;\n  --sklearn-color-unfitted-level-2: #ffe0b3;\n  --sklearn-color-unfitted-level-3: chocolate;\n  /* Definition of color scheme for fitted estimators */\n  --sklearn-color-fitted-level-0: #f0f8ff;\n  --sklearn-color-fitted-level-1: #d4ebff;\n  --sklearn-color-fitted-level-2: #b3dbfd;\n  --sklearn-color-fitted-level-3: cornflowerblue;\n\n  /* Specific color for light theme */\n  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n  --sklearn-color-icon: #696969;\n\n  @media (prefers-color-scheme: dark) {\n    /* Redefinition of color scheme for dark theme */\n    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n    --sklearn-color-icon: #878787;\n  }\n}\n\n#sk-container-id-1 {\n  color: var(--sklearn-color-text);\n}\n\n#sk-container-id-1 pre {\n  padding: 0;\n}\n\n#sk-container-id-1 input.sk-hidden--visually {\n  border: 0;\n  clip: rect(1px 1px 1px 1px);\n  clip: rect(1px, 1px, 1px, 1px);\n  height: 1px;\n  margin: -1px;\n  overflow: hidden;\n  padding: 0;\n  position: absolute;\n  width: 1px;\n}\n\n#sk-container-id-1 div.sk-dashed-wrapped {\n  border: 1px dashed var(--sklearn-color-line);\n  margin: 0 0.4em 0.5em 0.4em;\n  box-sizing: border-box;\n  padding-bottom: 0.4em;\n  background-color: var(--sklearn-color-background);\n}\n\n#sk-container-id-1 div.sk-container {\n  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n     but bootstrap.min.css set `[hidden] { display: none !important; }`\n     so we also need the `!important` here to be able to override the\n     default hidden behavior on the sphinx rendered scikit-learn.org.\n     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n  display: inline-block !important;\n  position: relative;\n}\n\n#sk-container-id-1 div.sk-text-repr-fallback {\n  display: none;\n}\n\ndiv.sk-parallel-item,\ndiv.sk-serial,\ndiv.sk-item {\n  /* draw centered vertical line to link estimators */\n  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n  background-size: 2px 100%;\n  background-repeat: no-repeat;\n  background-position: center center;\n}\n\n/* Parallel-specific style estimator block */\n\n#sk-container-id-1 div.sk-parallel-item::after {\n  content: \"\";\n  width: 100%;\n  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n  flex-grow: 1;\n}\n\n#sk-container-id-1 div.sk-parallel {\n  display: flex;\n  align-items: stretch;\n  justify-content: center;\n  background-color: var(--sklearn-color-background);\n  position: relative;\n}\n\n#sk-container-id-1 div.sk-parallel-item {\n  display: flex;\n  flex-direction: column;\n}\n\n#sk-container-id-1 div.sk-parallel-item:first-child::after {\n  align-self: flex-end;\n  width: 50%;\n}\n\n#sk-container-id-1 div.sk-parallel-item:last-child::after {\n  align-self: flex-start;\n  width: 50%;\n}\n\n#sk-container-id-1 div.sk-parallel-item:only-child::after {\n  width: 0;\n}\n\n/* Serial-specific style estimator block */\n\n#sk-container-id-1 div.sk-serial {\n  display: flex;\n  flex-direction: column;\n  align-items: center;\n  background-color: var(--sklearn-color-background);\n  padding-right: 1em;\n  padding-left: 1em;\n}\n\n\n/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\nclickable and can be expanded/collapsed.\n- Pipeline and ColumnTransformer use this feature and define the default style\n- Estimators will overwrite some part of the style using the `sk-estimator` class\n*/\n\n/* Pipeline and ColumnTransformer style (default) */\n\n#sk-container-id-1 div.sk-toggleable {\n  /* Default theme specific background. It is overwritten whether we have a\n  specific estimator or a Pipeline/ColumnTransformer */\n  background-color: var(--sklearn-color-background);\n}\n\n/* Toggleable label */\n#sk-container-id-1 label.sk-toggleable__label {\n  cursor: pointer;\n  display: block;\n  width: 100%;\n  margin-bottom: 0;\n  padding: 0.5em;\n  box-sizing: border-box;\n  text-align: center;\n}\n\n#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n  /* Arrow on the left of the label */\n  content: \"▸\";\n  float: left;\n  margin-right: 0.25em;\n  color: var(--sklearn-color-icon);\n}\n\n#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n  color: var(--sklearn-color-text);\n}\n\n/* Toggleable content - dropdown */\n\n#sk-container-id-1 div.sk-toggleable__content {\n  max-height: 0;\n  max-width: 0;\n  overflow: hidden;\n  text-align: left;\n  /* unfitted */\n  background-color: var(--sklearn-color-unfitted-level-0);\n}\n\n#sk-container-id-1 div.sk-toggleable__content.fitted {\n  /* fitted */\n  background-color: var(--sklearn-color-fitted-level-0);\n}\n\n#sk-container-id-1 div.sk-toggleable__content pre {\n  margin: 0.2em;\n  border-radius: 0.25em;\n  color: var(--sklearn-color-text);\n  /* unfitted */\n  background-color: var(--sklearn-color-unfitted-level-0);\n}\n\n#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n  /* unfitted */\n  background-color: var(--sklearn-color-fitted-level-0);\n}\n\n#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n  /* Expand drop-down */\n  max-height: 200px;\n  max-width: 100%;\n  overflow: auto;\n}\n\n#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n  content: \"▾\";\n}\n\n/* Pipeline/ColumnTransformer-specific style */\n\n#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n  color: var(--sklearn-color-text);\n  background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n  background-color: var(--sklearn-color-fitted-level-2);\n}\n\n/* Estimator-specific style */\n\n/* Colorize estimator box */\n#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n  /* unfitted */\n  background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n  /* fitted */\n  background-color: var(--sklearn-color-fitted-level-2);\n}\n\n#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n#sk-container-id-1 div.sk-label label {\n  /* The background is the default theme color */\n  color: var(--sklearn-color-text-on-default-background);\n}\n\n/* On hover, darken the color of the background */\n#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n  color: var(--sklearn-color-text);\n  background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n/* Label box, darken color on hover, fitted */\n#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n  color: var(--sklearn-color-text);\n  background-color: var(--sklearn-color-fitted-level-2);\n}\n\n/* Estimator label */\n\n#sk-container-id-1 div.sk-label label {\n  font-family: monospace;\n  font-weight: bold;\n  display: inline-block;\n  line-height: 1.2em;\n}\n\n#sk-container-id-1 div.sk-label-container {\n  text-align: center;\n}\n\n/* Estimator-specific */\n#sk-container-id-1 div.sk-estimator {\n  font-family: monospace;\n  border: 1px dotted var(--sklearn-color-border-box);\n  border-radius: 0.25em;\n  box-sizing: border-box;\n  margin-bottom: 0.5em;\n  /* unfitted */\n  background-color: var(--sklearn-color-unfitted-level-0);\n}\n\n#sk-container-id-1 div.sk-estimator.fitted {\n  /* fitted */\n  background-color: var(--sklearn-color-fitted-level-0);\n}\n\n/* on hover */\n#sk-container-id-1 div.sk-estimator:hover {\n  /* unfitted */\n  background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n#sk-container-id-1 div.sk-estimator.fitted:hover {\n  /* fitted */\n  background-color: var(--sklearn-color-fitted-level-2);\n}\n\n/* Specification for estimator info (e.g. \"i\" and \"?\") */\n\n/* Common style for \"i\" and \"?\" */\n\n.sk-estimator-doc-link,\na:link.sk-estimator-doc-link,\na:visited.sk-estimator-doc-link {\n  float: right;\n  font-size: smaller;\n  line-height: 1em;\n  font-family: monospace;\n  background-color: var(--sklearn-color-background);\n  border-radius: 1em;\n  height: 1em;\n  width: 1em;\n  text-decoration: none !important;\n  margin-left: 1ex;\n  /* unfitted */\n  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n  color: var(--sklearn-color-unfitted-level-1);\n}\n\n.sk-estimator-doc-link.fitted,\na:link.sk-estimator-doc-link.fitted,\na:visited.sk-estimator-doc-link.fitted {\n  /* fitted */\n  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n  color: var(--sklearn-color-fitted-level-1);\n}\n\n/* On hover */\ndiv.sk-estimator:hover .sk-estimator-doc-link:hover,\n.sk-estimator-doc-link:hover,\ndiv.sk-label-container:hover .sk-estimator-doc-link:hover,\n.sk-estimator-doc-link:hover {\n  /* unfitted */\n  background-color: var(--sklearn-color-unfitted-level-3);\n  color: var(--sklearn-color-background);\n  text-decoration: none;\n}\n\ndiv.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n.sk-estimator-doc-link.fitted:hover,\ndiv.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n.sk-estimator-doc-link.fitted:hover {\n  /* fitted */\n  background-color: var(--sklearn-color-fitted-level-3);\n  color: var(--sklearn-color-background);\n  text-decoration: none;\n}\n\n/* Span, style for the box shown on hovering the info icon */\n.sk-estimator-doc-link span {\n  display: none;\n  z-index: 9999;\n  position: relative;\n  font-weight: normal;\n  right: .2ex;\n  padding: .5ex;\n  margin: .5ex;\n  width: min-content;\n  min-width: 20ex;\n  max-width: 50ex;\n  color: var(--sklearn-color-text);\n  box-shadow: 2pt 2pt 4pt #999;\n  /* unfitted */\n  background: var(--sklearn-color-unfitted-level-0);\n  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n}\n\n.sk-estimator-doc-link.fitted span {\n  /* fitted */\n  background: var(--sklearn-color-fitted-level-0);\n  border: var(--sklearn-color-fitted-level-3);\n}\n\n.sk-estimator-doc-link:hover span {\n  display: block;\n}\n\n/* \"?\"-specific style due to the `<a>` HTML tag */\n\n#sk-container-id-1 a.estimator_doc_link {\n  float: right;\n  font-size: 1rem;\n  line-height: 1em;\n  font-family: monospace;\n  background-color: var(--sklearn-color-background);\n  border-radius: 1rem;\n  height: 1rem;\n  width: 1rem;\n  text-decoration: none;\n  /* unfitted */\n  color: var(--sklearn-color-unfitted-level-1);\n  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n}\n\n#sk-container-id-1 a.estimator_doc_link.fitted {\n  /* fitted */\n  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n  color: var(--sklearn-color-fitted-level-1);\n}\n\n/* On hover */\n#sk-container-id-1 a.estimator_doc_link:hover {\n  /* unfitted */\n  background-color: var(--sklearn-color-unfitted-level-3);\n  color: var(--sklearn-color-background);\n  text-decoration: none;\n}\n\n#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n  /* fitted */\n  background-color: var(--sklearn-color-fitted-level-3);\n}\n</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>RandomForestClassifier(random_state=42)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;&nbsp;RandomForestClassifier<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.ensemble.RandomForestClassifier.html\">?<span>Documentation for RandomForestClassifier</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>RandomForestClassifier(random_state=42)</pre></div> </div></div></div></div>"
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "# Initialize the classifier\n",
    "classifier = RandomForestClassifier(n_estimators=100, random_state=42)\n",
    "\n",
    "# Train the classifier\n",
    "classifier.fit(X_train, y_train)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T05:34:51.931355900Z",
     "start_time": "2024-02-25T05:33:08.961973100Z"
    }
   },
   "id": "b3617711d95450fb",
   "execution_count": 25
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Evaluate the Model\n",
    "\n",
    "To evaluate the performance of the trained model on the testing data, we will use the `predict` method of the `RandomForestClassifier` object to make predictions on the testing data. We will then use the `accuracy_score` and `classification_report` functions from the `sklearn.metrics` module to calculate the accuracy and generate a classification report.\n",
    "\n",
    "- **Accuracy:** The accuracy score function calculates the proportion of correctly classified instances.\n",
    "\n",
    "- **Precision:** The ratio of correctly predicted positive observations to the total predicted positive observations. It is calculated as:\n",
    "\n",
    "  $$\n",
    "  \\text{Precision} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Positives}}\n",
    "  $$\n",
    "\n",
    "- **Recall:** The ratio of correctly predicted positive observations to all observations in the actual class. It is calculated as:\n",
    "\n",
    "  $$\n",
    "  \\text{Recall} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Negatives}}\n",
    "  $$\n",
    "\n",
    "- **F1 Score:** The weighted average of precision and recall. It is calculated as:\n",
    "\n",
    "  $$\n",
    "  \\text{F1 Score} = 2 \\times \\frac{\\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}\n",
    "  $$\n",
    "\n",
    "- **Support:** The number of actual occurrences of the class in the dataset.\n",
    "\n",
    "The classification report provides a summary of the precision, recall, F1-score, and support for each class in the testing data, giving insight into how well the model is performing for each class.\n"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "b63c56956f2f9620"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.8491269841269842\n",
      "Classification Report:\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "        -1.0       0.81      0.90      0.86      6250\n",
      "         1.0       0.89      0.80      0.84      6350\n",
      "\n",
      "    accuracy                           0.85     12600\n",
      "   macro avg       0.85      0.85      0.85     12600\n",
      "weighted avg       0.85      0.85      0.85     12600\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score, classification_report\n",
    "\n",
    "# Make predictions on the test set\n",
    "y_pred = classifier.predict(X_test)\n",
    "\n",
    "# Evaluate the classifier\n",
    "accuracy = accuracy_score(y_test, y_pred)\n",
    "classification_rep = classification_report(y_test, y_pred)\n",
    "\n",
    "print(f\"Accuracy: {accuracy}\")\n",
    "print(f\"Classification Report:\\n{classification_rep}\")\n"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T05:34:57.850908700Z",
     "start_time": "2024-02-25T05:34:57.294830200Z"
    }
   },
   "id": "1255f5a45a95e482",
   "execution_count": 26
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}