{ "cells": [ { "cell_type": "markdown", "source": [ "# CSC 3105 Project" ], "metadata": { "collapsed": false }, "id": "cda961ffb493d00c" }, { "cell_type": "code", "outputs": [], "source": [ "import os\n", "\n", "DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T07:13:38.496447973Z", "start_time": "2024-02-25T07:13:38.427854435Z" } }, "id": "bcd6cbaa5df10ce8", "execution_count": 10 }, { "cell_type": "markdown", "source": [ "# Load the data into a pandas dataframe\n", "\n", "The first step in any data analysis project is to load the data into a suitable data structure. In this case, we will use the `pandas` library to load the data into a dataframe.\n", "\n", "We then clean the data by handling missing values, removing duplicates, converting data types, and performing outlier detection and removal. " ], "metadata": { "collapsed": false }, "id": "bab890d7b05e347e" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original data shape: (42000, 1031)\n", "Total number of missing values: 0\n", "Cleaned data shape: (42000, 1031)\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "from scipy import stats\n", "\n", "\n", "def load_data(dataset_dir):\n", " # Load the data\n", " file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n", " data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))\n", " print(f\"Original data shape: {data.shape}\")\n", " return data\n", "\n", "\n", "def clean_data(data):\n", " # Calculate total number of missing values\n", " total_missing = data.isnull().sum().sum()\n", " print(f\"Total number of missing values: {total_missing}\")\n", " \n", " # Handle missing values\n", " data = data.dropna()\n", "\n", " # Remove duplicates\n", " data = data.drop_duplicates()\n", "\n", " # Convert data types\n", " data['NLOS'] = data['NLOS'].astype(int)\n", "\n", " # Outlier detection and removal\n", " z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))\n", " data = data[(z_scores < 3).any(axis=1)]\n", "\n", " print(f\"Cleaned data shape: {data.shape}\")\n", " return data\n", "\n", "\n", "# Use the functions\n", "data = load_data(DATASET_DIR)\n", "data = clean_data(data)\n", "\n", "# print(data.head())\n", "\n", "# Print Headers\n", "# print(data.columns)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T07:13:45.772199567Z", "start_time": "2024-02-25T07:13:38.503776674Z" } }, "id": "dd9657f5ec6d7754", "execution_count": 11 }, { "cell_type": "markdown", "source": [ "The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n", "\n", "The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n", "\n", "The mathematical formulas used in this process are as follows:\n", "\n", "1. Calculate the mean (μ) of the data:\n", "\n", "$$\n", "\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n", "$$\n", "Where:\n", "- $n$ is the number of observations in the data\n", "- $x_i$ is the value of the $i$-th observation\n", "- $\\sum$ denotes the summation over all observations\n", "\n", "2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n", "\n", "$$\n", "\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n", "$$\n", "Where:\n", "- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n", "- $\\sigma$ is the standard deviation of the data\n", "- $x_i$ is the value of the $i$-th observation\n", "- $\\mu$ is the mean of the data\n", "\n", "The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n", "\n", "**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n" ], "metadata": { "collapsed": false }, "id": "2c13064e20601717" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of principle components after PCA is 868\n" ] } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Standardize the data\n", "numerical_cols = data.select_dtypes(include=[np.number]).columns\n", "scaler = StandardScaler()\n", "data[numerical_cols] = scaler.fit_transform(data[numerical_cols])\n", "\n", "# Initialize PCA with the desired explained variance\n", "pca = PCA(0.95)\n", "\n", "# Fit PCA to your data\n", "pca.fit(data)\n", "\n", "# Get the number of components\n", "num_components = pca.n_components_\n", "\n", "print(f\"The number of principle components after PCA is {num_components}\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T07:13:52.965454157Z", "start_time": "2024-02-25T07:13:45.777004441Z" } }, "id": "7f9bec73a42f7bca", "execution_count": 12 }, { "cell_type": "markdown", "source": [ "# Perform Dimensionality Reduction with PCA\n", "\n", "We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components." ], "metadata": { "collapsed": false }, "id": "dc9f8c0e194dd07d" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original number of components: 1031\n", "Number of components after PCA: 868\n", "PCA has successfully reduced the number of components.\n" ] } ], "source": [ "# Project original data to PC with the highest eigenvalue\n", "data_pca = pca.transform(data)\n", "\n", "# Create a dataframe with the principal components\n", "data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n", "\n", "# Print the number of components in the original and PCA transformed data\n", "print(f\"Original number of components: {data.shape[1]}\")\n", "print(f\"Number of components after PCA: {num_components}\")\n", "\n", "# Compare the number of components in the original and PCA transformed data\n", "if data.shape[1] > num_components:\n", " print(\"PCA has successfully reduced the number of components.\")\n", "elif data.shape[1] < num_components:\n", " print(\"Unexpectedly, PCA has increased the number of components.\")\n", "else:\n", " print(\"The number of components remains unchanged after PCA.\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T07:13:53.900567276Z", "start_time": "2024-02-25T07:13:52.962847354Z" } }, "id": "96c62c50f8734a01", "execution_count": 13 }, { "cell_type": "markdown", "source": [ "## Data Mining / Machine Learning\n", "\n", "### I. Supervised Learning\n", "- **Decision**: Supervised learning is used due to the labeled dataset.\n", "- **Algorithm**: Random Forest Classifier is preferred for its performance in classification tasks.\n", "\n", "### II. Training/Test Split Ratio\n", "- **Decision**: 70:30 split is chosen for training/test dataset.\n", "- **Reasoning**: This split ensures sufficient data for training and testing.\n", "\n", "### III. Performance Metrics\n", "- **Classification Accuracy**: Measures the proportion of correctly classified instances.\n", "- **Confusion Matrix**: Provides a summary of predicted and actual classes.\n", "- **Classification Report**: Provides detailed metrics such as precision, recall, F1-score, and support for each class.\n", "\n", "The Random Forest Classifier is trained on the training set and evaluated on the test set using accuracy and classification report metrics.\n" ], "metadata": { "collapsed": false }, "id": "47d5cb383ce1f7ba" }, { "cell_type": "markdown", "source": [ "# Split the data into training and testing sets\n", "\n", "The next step is to split the data into training and testing sets. This is a common practice in machine learning, where the training set is used to train the model, and the testing set is used to evaluate its performance.\n", "\n", "We will use the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets. We will use 70% of the data for training and 30% for testing, which is a common split ratio." ], "metadata": { "collapsed": false }, "id": "576a6a92fc7fdbfd" }, { "cell_type": "code", "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Split the data into training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(data_pca_df, data['NLOS'], test_size=0.3, random_state=42)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T07:13:54.013838929Z", "start_time": "2024-02-25T07:13:53.894714031Z" } }, "id": "7db852fafd187d5a", "execution_count": 14 }, { "cell_type": "markdown", "source": [ "# Train a Random Forest Classifier\n", "\n", "The next step is to train a machine learning model on the training data. We will use the `RandomForestClassifier` class from the `sklearn.ensemble` module to train a random forest classifier.\n", "\n", "The random forest classifier is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n", "\n", "We will use the `fit` method of the `RandomForestClassifier` object to train the model on the training data." ], "metadata": { "collapsed": false }, "id": "5753cc6db18bac73" }, { "cell_type": "code", "outputs": [ { "data": { "text/plain": "RandomForestClassifier(random_state=42)", "text/html": "
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=42)
SVC(kernel='linear', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(kernel='linear', random_state=42)
LogisticRegression(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(random_state=42)
GradientBoostingClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=42)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)