456 lines
28 KiB
Plaintext
456 lines
28 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# CSC 3105 Project"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "cda961ffb493d00c"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T05:32:51.223079500Z",
|
|
"start_time": "2024-02-25T05:32:51.218515100Z"
|
|
}
|
|
},
|
|
"id": "bcd6cbaa5df10ce8",
|
|
"execution_count": 20
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Load the data into a pandas dataframe\n",
|
|
"\n",
|
|
"The first step in any data analysis project is to load the data into a suitable data structure. In this case, we will use the `pandas` library to load the data into a dataframe.\n",
|
|
"\n",
|
|
"We then clean the data by handling missing values, removing duplicates, converting data types, and performing outlier detection and removal. "
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "bab890d7b05e347e"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Original data shape: (42000, 1031)\n",
|
|
"Total number of missing values: 0\n",
|
|
"Cleaned data shape: (42000, 1031)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from scipy import stats\n",
|
|
"\n",
|
|
"\n",
|
|
"def load_data(dataset_dir):\n",
|
|
" # Load the data\n",
|
|
" file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n",
|
|
" data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))\n",
|
|
" print(f\"Original data shape: {data.shape}\")\n",
|
|
" return data\n",
|
|
"\n",
|
|
"\n",
|
|
"def clean_data(data):\n",
|
|
" # Calculate total number of missing values\n",
|
|
" total_missing = data.isnull().sum().sum()\n",
|
|
" print(f\"Total number of missing values: {total_missing}\")\n",
|
|
" \n",
|
|
" # Handle missing values\n",
|
|
" data = data.dropna()\n",
|
|
"\n",
|
|
" # Remove duplicates\n",
|
|
" data = data.drop_duplicates()\n",
|
|
"\n",
|
|
" # Convert data types\n",
|
|
" data['NLOS'] = data['NLOS'].astype(int)\n",
|
|
"\n",
|
|
" # Outlier detection and removal\n",
|
|
" z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))\n",
|
|
" data = data[(z_scores < 3).any(axis=1)]\n",
|
|
"\n",
|
|
" print(f\"Cleaned data shape: {data.shape}\")\n",
|
|
" return data\n",
|
|
"\n",
|
|
"\n",
|
|
"# Use the functions\n",
|
|
"data = load_data(DATASET_DIR)\n",
|
|
"data = clean_data(data)\n",
|
|
"\n",
|
|
"# print(data.head())\n",
|
|
"\n",
|
|
"# Print Headers\n",
|
|
"# print(data.columns)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T05:32:56.753045Z",
|
|
"start_time": "2024-02-25T05:32:52.416237400Z"
|
|
}
|
|
},
|
|
"id": "dd9657f5ec6d7754",
|
|
"execution_count": 21
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n",
|
|
"\n",
|
|
"The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n",
|
|
"\n",
|
|
"The mathematical formulas used in this process are as follows:\n",
|
|
"\n",
|
|
"1. Calculate the mean (μ) of the data:\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n",
|
|
"$$\n",
|
|
"Where:\n",
|
|
"- $n$ is the number of observations in the data\n",
|
|
"- $x_i$ is the value of the $i$-th observation\n",
|
|
"- $\\sum$ denotes the summation over all observations\n",
|
|
"\n",
|
|
"2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n",
|
|
"$$\n",
|
|
"Where:\n",
|
|
"- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n",
|
|
"- $\\sigma$ is the standard deviation of the data\n",
|
|
"- $x_i$ is the value of the $i$-th observation\n",
|
|
"- $\\mu$ is the mean of the data\n",
|
|
"\n",
|
|
"The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n",
|
|
"\n",
|
|
"**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "2c13064e20601717"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"The number of principle components after PCA is 868\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.decomposition import PCA\n",
|
|
"\n",
|
|
"from sklearn.preprocessing import StandardScaler\n",
|
|
"\n",
|
|
"# Standardize the data\n",
|
|
"numerical_cols = data.select_dtypes(include=[np.number]).columns\n",
|
|
"scaler = StandardScaler()\n",
|
|
"data[numerical_cols] = scaler.fit_transform(data[numerical_cols])\n",
|
|
"\n",
|
|
"# Initialize PCA with the desired explained variance\n",
|
|
"pca = PCA(0.95)\n",
|
|
"\n",
|
|
"# Fit PCA to your data\n",
|
|
"pca.fit(data)\n",
|
|
"\n",
|
|
"# Get the number of components\n",
|
|
"num_components = pca.n_components_\n",
|
|
"\n",
|
|
"print(f\"The number of principle components after PCA is {num_components}\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T05:33:01.393540600Z",
|
|
"start_time": "2024-02-25T05:32:57.831719Z"
|
|
}
|
|
},
|
|
"id": "7f9bec73a42f7bca",
|
|
"execution_count": 22
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Perform Dimensionality Reduction with PCA\n",
|
|
"\n",
|
|
"We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "dc9f8c0e194dd07d"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Original number of components: 1031\n",
|
|
"Number of components after PCA: 868\n",
|
|
"PCA has successfully reduced the number of components.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Project original data to PC with the highest eigenvalue\n",
|
|
"data_pca = pca.transform(data)\n",
|
|
"\n",
|
|
"# Create a dataframe with the principal components\n",
|
|
"data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n",
|
|
"\n",
|
|
"# Print the number of components in the original and PCA transformed data\n",
|
|
"print(f\"Original number of components: {data.shape[1]}\")\n",
|
|
"print(f\"Number of components after PCA: {num_components}\")\n",
|
|
"\n",
|
|
"# Compare the number of components in the original and PCA transformed data\n",
|
|
"if data.shape[1] > num_components:\n",
|
|
" print(\"PCA has successfully reduced the number of components.\")\n",
|
|
"elif data.shape[1] < num_components:\n",
|
|
" print(\"Unexpectedly, PCA has increased the number of components.\")\n",
|
|
"else:\n",
|
|
" print(\"The number of components remains unchanged after PCA.\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T05:47:02.511321300Z",
|
|
"start_time": "2024-02-25T05:47:01.989494800Z"
|
|
}
|
|
},
|
|
"id": "96c62c50f8734a01",
|
|
"execution_count": 29
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Data Mining / Machine Learning\n",
|
|
"\n",
|
|
"### I. Supervised Learning\n",
|
|
"- **Decision**: Supervised learning is used due to the labeled dataset.\n",
|
|
"- **Algorithm**: Random Forest Classifier is preferred for its performance in classification tasks.\n",
|
|
"\n",
|
|
"### II. Training/Test Split Ratio\n",
|
|
"- **Decision**: 70:30 split is chosen for training/test dataset.\n",
|
|
"- **Reasoning**: This split ensures sufficient data for training and testing.\n",
|
|
"\n",
|
|
"### III. Performance Metrics\n",
|
|
"- **Classification Accuracy**: Measures the proportion of correctly classified instances.\n",
|
|
"- **Confusion Matrix**: Provides a summary of predicted and actual classes.\n",
|
|
"- **Classification Report**: Provides detailed metrics such as precision, recall, F1-score, and support for each class.\n",
|
|
"\n",
|
|
"The Random Forest Classifier is trained on the training set and evaluated on the test set using accuracy and classification report metrics.\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "47d5cb383ce1f7ba"
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Split the data into training and testing sets\n",
|
|
"\n",
|
|
"The next step is to split the data into training and testing sets. This is a common practice in machine learning, where the training set is used to train the model, and the testing set is used to evaluate its performance.\n",
|
|
"\n",
|
|
"We will use the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets. We will use 70% of the data for training and 30% for testing, which is a common split ratio."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "576a6a92fc7fdbfd"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"# Split the data into training and test sets\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(data_pca_df, data['NLOS'], test_size=0.3, random_state=42)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T05:33:06.047014800Z",
|
|
"start_time": "2024-02-25T05:33:05.983534100Z"
|
|
}
|
|
},
|
|
"id": "7db852fafd187d5a",
|
|
"execution_count": 24
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Train a Random Forest Classifier\n",
|
|
"\n",
|
|
"The next step is to train a machine learning model on the training data. We will use the `RandomForestClassifier` class from the `sklearn.ensemble` module to train a random forest classifier.\n",
|
|
"\n",
|
|
"The random forest classifier is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n",
|
|
"\n",
|
|
"We will use the `fit` method of the `RandomForestClassifier` object to train the model on the training data."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "5753cc6db18bac73"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": "RandomForestClassifier(random_state=42)",
|
|
"text/html": "<style>#sk-container-id-1 {\n /* Definition of color scheme common for light and dark mode */\n --sklearn-color-text: black;\n --sklearn-color-line: gray;\n /* Definition of color scheme for unfitted estimators */\n --sklearn-color-unfitted-level-0: #fff5e6;\n --sklearn-color-unfitted-level-1: #f6e4d2;\n --sklearn-color-unfitted-level-2: #ffe0b3;\n --sklearn-color-unfitted-level-3: chocolate;\n /* Definition of color scheme for fitted estimators */\n --sklearn-color-fitted-level-0: #f0f8ff;\n --sklearn-color-fitted-level-1: #d4ebff;\n --sklearn-color-fitted-level-2: #b3dbfd;\n --sklearn-color-fitted-level-3: cornflowerblue;\n\n /* Specific color for light theme */\n --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n --sklearn-color-icon: #696969;\n\n @media (prefers-color-scheme: dark) {\n /* Redefinition of color scheme for dark theme */\n --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n --sklearn-color-icon: #878787;\n }\n}\n\n#sk-container-id-1 {\n color: var(--sklearn-color-text);\n}\n\n#sk-container-id-1 pre {\n padding: 0;\n}\n\n#sk-container-id-1 input.sk-hidden--visually {\n border: 0;\n clip: rect(1px 1px 1px 1px);\n clip: rect(1px, 1px, 1px, 1px);\n height: 1px;\n margin: -1px;\n overflow: hidden;\n padding: 0;\n position: absolute;\n width: 1px;\n}\n\n#sk-container-id-1 div.sk-dashed-wrapped {\n border: 1px dashed var(--sklearn-color-line);\n margin: 0 0.4em 0.5em 0.4em;\n box-sizing: border-box;\n padding-bottom: 0.4em;\n background-color: var(--sklearn-color-background);\n}\n\n#sk-container-id-1 div.sk-container {\n /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n but bootstrap.min.css set `[hidden] { display: none !important; }`\n so we also need the `!important` here to be able to override the\n default hidden behavior on the sphinx rendered scikit-learn.org.\n See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n display: inline-block !important;\n position: relative;\n}\n\n#sk-container-id-1 div.sk-text-repr-fallback {\n display: none;\n}\n\ndiv.sk-parallel-item,\ndiv.sk-serial,\ndiv.sk-item {\n /* draw centered vertical line to link estimators */\n background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n background-size: 2px 100%;\n background-repeat: no-repeat;\n background-position: center center;\n}\n\n/* Parallel-specific style estimator block */\n\n#sk-container-id-1 div.sk-parallel-item::after {\n content: \"\";\n width: 100%;\n border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n flex-grow: 1;\n}\n\n#sk-container-id-1 div.sk-parallel {\n display: flex;\n align-items: stretch;\n justify-content: center;\n background-color: var(--sklearn-color-background);\n position: relative;\n}\n\n#sk-container-id-1 div.sk-parallel-item {\n display: flex;\n flex-direction: column;\n}\n\n#sk-container-id-1 div.sk-parallel-item:first-child::after {\n align-self: flex-end;\n width: 50%;\n}\n\n#sk-container-id-1 div.sk-parallel-item:last-child::after {\n align-self: flex-start;\n width: 50%;\n}\n\n#sk-container-id-1 div.sk-parallel-item:only-child::after {\n width: 0;\n}\n\n/* Serial-specific style estimator block */\n\n#sk-container-id-1 div.sk-serial {\n display: flex;\n flex-direction: column;\n align-items: center;\n background-color: var(--sklearn-color-background);\n padding-right: 1em;\n padding-left: 1em;\n}\n\n\n/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\nclickable and can be expanded/collapsed.\n- Pipeline and ColumnTransformer use this feature and define the default style\n- Estimators will overwrite some part of the style using the `sk-estimator` class\n*/\n\n/* Pipeline and ColumnTransformer style (default) */\n\n#sk-container-id-1 div.sk-toggleable {\n /* Default theme specific background. It is overwritten whether we have a\n specific estimator or a Pipeline/ColumnTransformer */\n background-color: var(--sklearn-color-background);\n}\n\n/* Toggleable label */\n#sk-container-id-1 label.sk-toggleable__label {\n cursor: pointer;\n display: block;\n width: 100%;\n margin-bottom: 0;\n padding: 0.5em;\n box-sizing: border-box;\n text-align: center;\n}\n\n#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n /* Arrow on the left of the label */\n content: \"▸\";\n float: left;\n margin-right: 0.25em;\n color: var(--sklearn-color-icon);\n}\n\n#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n color: var(--sklearn-color-text);\n}\n\n/* Toggleable content - dropdown */\n\n#sk-container-id-1 div.sk-toggleable__content {\n max-height: 0;\n max-width: 0;\n overflow: hidden;\n text-align: left;\n /* unfitted */\n background-color: var(--sklearn-color-unfitted-level-0);\n}\n\n#sk-container-id-1 div.sk-toggleable__content.fitted {\n /* fitted */\n background-color: var(--sklearn-color-fitted-level-0);\n}\n\n#sk-container-id-1 div.sk-toggleable__content pre {\n margin: 0.2em;\n border-radius: 0.25em;\n color: var(--sklearn-color-text);\n /* unfitted */\n background-color: var(--sklearn-color-unfitted-level-0);\n}\n\n#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n /* unfitted */\n background-color: var(--sklearn-color-fitted-level-0);\n}\n\n#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n /* Expand drop-down */\n max-height: 200px;\n max-width: 100%;\n overflow: auto;\n}\n\n#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n content: \"▾\";\n}\n\n/* Pipeline/ColumnTransformer-specific style */\n\n#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n color: var(--sklearn-color-text);\n background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n background-color: var(--sklearn-color-fitted-level-2);\n}\n\n/* Estimator-specific style */\n\n/* Colorize estimator box */\n#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n /* unfitted */\n background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n /* fitted */\n background-color: var(--sklearn-color-fitted-level-2);\n}\n\n#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n#sk-container-id-1 div.sk-label label {\n /* The background is the default theme color */\n color: var(--sklearn-color-text-on-default-background);\n}\n\n/* On hover, darken the color of the background */\n#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n color: var(--sklearn-color-text);\n background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n/* Label box, darken color on hover, fitted */\n#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n color: var(--sklearn-color-text);\n background-color: var(--sklearn-color-fitted-level-2);\n}\n\n/* Estimator label */\n\n#sk-container-id-1 div.sk-label label {\n font-family: monospace;\n font-weight: bold;\n display: inline-block;\n line-height: 1.2em;\n}\n\n#sk-container-id-1 div.sk-label-container {\n text-align: center;\n}\n\n/* Estimator-specific */\n#sk-container-id-1 div.sk-estimator {\n font-family: monospace;\n border: 1px dotted var(--sklearn-color-border-box);\n border-radius: 0.25em;\n box-sizing: border-box;\n margin-bottom: 0.5em;\n /* unfitted */\n background-color: var(--sklearn-color-unfitted-level-0);\n}\n\n#sk-container-id-1 div.sk-estimator.fitted {\n /* fitted */\n background-color: var(--sklearn-color-fitted-level-0);\n}\n\n/* on hover */\n#sk-container-id-1 div.sk-estimator:hover {\n /* unfitted */\n background-color: var(--sklearn-color-unfitted-level-2);\n}\n\n#sk-container-id-1 div.sk-estimator.fitted:hover {\n /* fitted */\n background-color: var(--sklearn-color-fitted-level-2);\n}\n\n/* Specification for estimator info (e.g. \"i\" and \"?\") */\n\n/* Common style for \"i\" and \"?\" */\n\n.sk-estimator-doc-link,\na:link.sk-estimator-doc-link,\na:visited.sk-estimator-doc-link {\n float: right;\n font-size: smaller;\n line-height: 1em;\n font-family: monospace;\n background-color: var(--sklearn-color-background);\n border-radius: 1em;\n height: 1em;\n width: 1em;\n text-decoration: none !important;\n margin-left: 1ex;\n /* unfitted */\n border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n color: var(--sklearn-color-unfitted-level-1);\n}\n\n.sk-estimator-doc-link.fitted,\na:link.sk-estimator-doc-link.fitted,\na:visited.sk-estimator-doc-link.fitted {\n /* fitted */\n border: var(--sklearn-color-fitted-level-1) 1pt solid;\n color: var(--sklearn-color-fitted-level-1);\n}\n\n/* On hover */\ndiv.sk-estimator:hover .sk-estimator-doc-link:hover,\n.sk-estimator-doc-link:hover,\ndiv.sk-label-container:hover .sk-estimator-doc-link:hover,\n.sk-estimator-doc-link:hover {\n /* unfitted */\n background-color: var(--sklearn-color-unfitted-level-3);\n color: var(--sklearn-color-background);\n text-decoration: none;\n}\n\ndiv.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n.sk-estimator-doc-link.fitted:hover,\ndiv.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n.sk-estimator-doc-link.fitted:hover {\n /* fitted */\n background-color: var(--sklearn-color-fitted-level-3);\n color: var(--sklearn-color-background);\n text-decoration: none;\n}\n\n/* Span, style for the box shown on hovering the info icon */\n.sk-estimator-doc-link span {\n display: none;\n z-index: 9999;\n position: relative;\n font-weight: normal;\n right: .2ex;\n padding: .5ex;\n margin: .5ex;\n width: min-content;\n min-width: 20ex;\n max-width: 50ex;\n color: var(--sklearn-color-text);\n box-shadow: 2pt 2pt 4pt #999;\n /* unfitted */\n background: var(--sklearn-color-unfitted-level-0);\n border: .5pt solid var(--sklearn-color-unfitted-level-3);\n}\n\n.sk-estimator-doc-link.fitted span {\n /* fitted */\n background: var(--sklearn-color-fitted-level-0);\n border: var(--sklearn-color-fitted-level-3);\n}\n\n.sk-estimator-doc-link:hover span {\n display: block;\n}\n\n/* \"?\"-specific style due to the `<a>` HTML tag */\n\n#sk-container-id-1 a.estimator_doc_link {\n float: right;\n font-size: 1rem;\n line-height: 1em;\n font-family: monospace;\n background-color: var(--sklearn-color-background);\n border-radius: 1rem;\n height: 1rem;\n width: 1rem;\n text-decoration: none;\n /* unfitted */\n color: var(--sklearn-color-unfitted-level-1);\n border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n}\n\n#sk-container-id-1 a.estimator_doc_link.fitted {\n /* fitted */\n border: var(--sklearn-color-fitted-level-1) 1pt solid;\n color: var(--sklearn-color-fitted-level-1);\n}\n\n/* On hover */\n#sk-container-id-1 a.estimator_doc_link:hover {\n /* unfitted */\n background-color: var(--sklearn-color-unfitted-level-3);\n color: var(--sklearn-color-background);\n text-decoration: none;\n}\n\n#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n /* fitted */\n background-color: var(--sklearn-color-fitted-level-3);\n}\n</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>RandomForestClassifier(random_state=42)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\"> RandomForestClassifier<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.ensemble.RandomForestClassifier.html\">?<span>Documentation for RandomForestClassifier</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>RandomForestClassifier(random_state=42)</pre></div> </div></div></div></div>"
|
|
},
|
|
"execution_count": 25,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.ensemble import RandomForestClassifier\n",
|
|
"\n",
|
|
"# Initialize the classifier\n",
|
|
"classifier = RandomForestClassifier(n_estimators=100, random_state=42)\n",
|
|
"\n",
|
|
"# Train the classifier\n",
|
|
"classifier.fit(X_train, y_train)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T05:34:51.931355900Z",
|
|
"start_time": "2024-02-25T05:33:08.961973100Z"
|
|
}
|
|
},
|
|
"id": "b3617711d95450fb",
|
|
"execution_count": 25
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Evaluate the Model\n",
|
|
"\n",
|
|
"To evaluate the performance of the trained model on the testing data, we will use the `predict` method of the `RandomForestClassifier` object to make predictions on the testing data. We will then use the `accuracy_score` and `classification_report` functions from the `sklearn.metrics` module to calculate the accuracy and generate a classification report.\n",
|
|
"\n",
|
|
"- **Accuracy:** The accuracy score function calculates the proportion of correctly classified instances.\n",
|
|
"\n",
|
|
"- **Precision:** The ratio of correctly predicted positive observations to the total predicted positive observations. It is calculated as:\n",
|
|
"\n",
|
|
" $$\n",
|
|
" \\text{Precision} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Positives}}\n",
|
|
" $$\n",
|
|
"\n",
|
|
"- **Recall:** The ratio of correctly predicted positive observations to all observations in the actual class. It is calculated as:\n",
|
|
"\n",
|
|
" $$\n",
|
|
" \\text{Recall} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Negatives}}\n",
|
|
" $$\n",
|
|
"\n",
|
|
"- **F1 Score:** The weighted average of precision and recall. It is calculated as:\n",
|
|
"\n",
|
|
" $$\n",
|
|
" \\text{F1 Score} = 2 \\times \\frac{\\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}\n",
|
|
" $$\n",
|
|
"\n",
|
|
"- **Support:** The number of actual occurrences of the class in the dataset.\n",
|
|
"\n",
|
|
"The classification report provides a summary of the precision, recall, F1-score, and support for each class in the testing data, giving insight into how well the model is performing for each class.\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "b63c56956f2f9620"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Accuracy: 0.8491269841269842\n",
|
|
"Classification Report:\n",
|
|
" precision recall f1-score support\n",
|
|
"\n",
|
|
" -1.0 0.81 0.90 0.86 6250\n",
|
|
" 1.0 0.89 0.80 0.84 6350\n",
|
|
"\n",
|
|
" accuracy 0.85 12600\n",
|
|
" macro avg 0.85 0.85 0.85 12600\n",
|
|
"weighted avg 0.85 0.85 0.85 12600\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.metrics import accuracy_score, classification_report\n",
|
|
"\n",
|
|
"# Make predictions on the test set\n",
|
|
"y_pred = classifier.predict(X_test)\n",
|
|
"\n",
|
|
"# Evaluate the classifier\n",
|
|
"accuracy = accuracy_score(y_test, y_pred)\n",
|
|
"classification_rep = classification_report(y_test, y_pred)\n",
|
|
"\n",
|
|
"print(f\"Accuracy: {accuracy}\")\n",
|
|
"print(f\"Classification Report:\\n{classification_rep}\")\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T05:34:57.850908700Z",
|
|
"start_time": "2024-02-25T05:34:57.294830200Z"
|
|
}
|
|
},
|
|
"id": "1255f5a45a95e482",
|
|
"execution_count": 26
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 2
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython2",
|
|
"version": "2.7.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|