{ "cells": [ { "cell_type": "markdown", "source": [ "# CSC 3105 Project" ], "metadata": { "collapsed": false }, "id": "cda961ffb493d00c" }, { "cell_type": "code", "outputs": [], "source": [ "import os\n", "\n", "DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-27T07:53:58.605705507Z", "start_time": "2024-02-27T07:53:58.539206652Z" } }, "id": "bcd6cbaa5df10ce8", "execution_count": 1 }, { "cell_type": "markdown", "source": [ "# Load and Clean the Data\n", "\n", "This code block performs the following operations:\n", "\n", "1. Imports necessary libraries for data handling and cleaning.\n", "2. Defines a function `load_data` to load the data from a given directory into a pandas dataframe.\n", "3. Defines a function `clean_data` to clean the loaded data. The cleaning process includes:\n", " - Handling missing values by dropping them.\n", " - Removing duplicate rows.\n", " - Converting the 'NLOS' column to integer data type.\n", " - Normalizing the 'Measured range (time of flight)' column.\n", " - Creating new features 'FP_SUM' and 'SNR'.\n", " - One-hot encoding categorical features.\n", " - Performing feature extraction on 'CIR' columns.\n", " - Dropping the original 'CIR' columns.\n", " - Checking for columns with only one unique value and dropping them.\n", "4. Checks if a pickle file with the cleaned data exists. If it does, it loads the data from the file. If it doesn't, it loads and cleans the data using the defined functions.\n", "5. Prints the first few rows of the cleaned data and its column headers." ], "metadata": { "collapsed": false }, "id": "73fe8802e95a784f" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " NLOS RANGE FP_IDX FP_AMP1 FP_AMP2 FP_AMP3 STDEV_NOISE CIR_PWR \\\n", "0 1 0.220557 749.0 4889.0 13876.0 10464.0 240.0 9048.0 \n", "1 1 0.162027 741.0 2474.0 2002.0 1593.0 68.0 6514.0 \n", "2 1 0.156674 744.0 1934.0 2615.0 4114.0 52.0 2880.0 \n", "3 1 0.045325 748.0 16031.0 17712.0 10420.0 64.0 12855.0 \n", "4 0 0.041399 743.0 20070.0 19886.0 15727.0 76.0 11607.0 \n", "\n", " MAX_NOISE RXPACC FRAME_LEN PREAM_LEN FP_SUM SNR CIR_MEAN \\\n", "0 3668.0 1024.0 2 0 29229.0 37.700000 768.607283 \n", "1 1031.0 1024.0 2 0 6069.0 95.794118 416.879921 \n", "2 796.0 1024.0 0 0 8663.0 55.384615 378.266732 \n", "3 1529.0 323.0 2 0 44163.0 200.859375 333.926181 \n", "4 2022.0 296.0 0 0 55683.0 152.723684 391.251969 \n", "\n", " CIR_STD CIR_SKEW CIR_KURT \n", "0 1122.978435 10.293579 125.637500 \n", "1 903.090169 9.653334 114.460327 \n", "2 592.272425 6.944884 60.312664 \n", "3 1304.198732 11.892028 150.305512 \n", "4 1305.050753 11.689941 151.846488 \n", "Index(['NLOS', 'RANGE', 'FP_IDX', 'FP_AMP1', 'FP_AMP2', 'FP_AMP3',\n", " 'STDEV_NOISE', 'CIR_PWR', 'MAX_NOISE', 'RXPACC', 'FRAME_LEN',\n", " 'PREAM_LEN', 'FP_SUM', 'SNR', 'CIR_MEAN', 'CIR_STD', 'CIR_SKEW',\n", " 'CIR_KURT'],\n", " dtype='object')\n" ] } ], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "from scipy import stats\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.preprocessing import LabelEncoder\n", "import pickle\n", "\n", "def load_data(dataset_dir):\n", " # Load the data\n", " file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n", " data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))\n", " print(f\"Original data shape: {data.shape}\")\n", " return data\n", "\n", "def clean_data(data):\n", " # Calculate total number of missing values in the data\n", " # This is important to understand the quality of the data\n", " total_missing = data.isnull().sum().sum()\n", " print(f\"Total number of missing values: {total_missing}\")\n", "\n", " # Data has no missing values\n", " data = data.dropna()\n", "\n", " # Data has no duplicate rows\n", " data = data.drop_duplicates()\n", "\n", " # Convert 'NLOS' column to integer data type (0 for LOS, 1 for NLOS)\n", " # This is necessary for further analysis as some algorithms can only handle numeric data\n", " data['NLOS'] = data['NLOS'].astype(int)\n", "\n", " # Normalize 'Measured range (time of flight)' column using MinMaxScaler\n", " # Normalization ensures that all features have the same scale\n", " scaler = MinMaxScaler()\n", " data['RANGE'] = scaler.fit_transform(data[['RANGE']])\n", "\n", " # Create new feature 'FP_SUM' by adding 'FP_AMP1', 'FP_AMP2', and 'FP_AMP3'\n", " # This can potentially enhance the model's performance by introducing new meaningful information\n", " data['FP_SUM'] = data['FP_AMP1'] + data['FP_AMP2'] + data['FP_AMP3']\n", "\n", " # Create new feature 'SNR' by dividing 'CIR_PWR' by 'STDEV_NOISE'\n", " # This can potentially enhance the model's performance by introducing new meaningful information\n", " data['SNR'] = data['CIR_PWR'] / data['STDEV_NOISE']\n", "\n", " # One-hot encode categorical features\n", " # This is necessary as many machine learning algorithms cannot handle categorical data directly\n", " categorical_features = ['CH', 'FRAME_LEN', 'PREAM_LEN', 'BITRATE', 'PRFR']\n", " encoder = LabelEncoder()\n", " for feature in categorical_features:\n", " data[feature] = encoder.fit_transform(data[feature])\n", "\n", " # List of CIR columns\n", " cir_columns = [f\"CIR{i}\" for i in range(1016)]\n", "\n", " # Feature extraction on 'CIR' columns\n", " # This can potentially enhance the model's performance by introducing new meaningful information\n", " data['CIR_MEAN'] = data[cir_columns].mean(axis=1)\n", " data['CIR_STD'] = data[cir_columns].std(axis=1)\n", " data['CIR_SKEW'] = data[cir_columns].apply(stats.skew, axis=1)\n", " data['CIR_KURT'] = data[cir_columns].apply(stats.kurtosis, axis=1)\n", "\n", " # Drop the original 'CIR' columns\n", " # This is done to reduce the dimensionality of the data after extracting the necessary information\n", " data = data.drop(columns=cir_columns)\n", "\n", " # List of columns to check for unique values\n", " columns_to_check = ['CH', 'PREAM_LEN', 'BITRATE', 'PRFR']\n", "\n", " # Iterate over the columns\n", " for column in columns_to_check:\n", " # If the column has only one unique value, drop it\n", " # Columns with only one unique value do not contribute to the model's performance\n", " if data[column].nunique() == 1:\n", " data = data.drop(column, axis=1)\n", "\n", " # Print the shape of the cleaned data\n", " print(f\"Cleaned data shape: {data.shape}\")\n", " \n", " # Return the cleaned data\n", " return data\n", "\n", "# Check if the file exists\n", "if os.path.exists('data.pkl'):\n", " # If the file exists, load it\n", " with open('data.pkl', 'rb') as f:\n", " data = pickle.load(f)\n", "else:\n", " # If the file doesn't exist, load and clean the data\n", " data = load_data(DATASET_DIR)\n", " data = clean_data(data)\n", " with open('data.pkl', 'wb') as f:\n", " pickle.dump(data, f)\n", "\n", "print(data.head())\n", "\n", "# Print Headers\n", "print(data.columns)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-27T07:53:59.485922432Z", "start_time": "2024-02-27T07:53:58.561389045Z" } }, "id": "e01fe23e950f89a", "execution_count": 2 }, { "cell_type": "markdown", "source": [ "The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n", "\n", "The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n", "\n", "The mathematical formulas used in this process are as follows:\n", "\n", "1. Calculate the mean (μ) of the data:\n", "\n", "$$\n", "\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n", "$$\n", "Where:\n", "- $n$ is the number of observations in the data\n", "- $x_i$ is the value of the $i$-th observation\n", "- $\\sum$ denotes the summation over all observations\n", "\n", "2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n", "\n", "$$\n", "\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n", "$$\n", "Where:\n", "- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n", "- $\\sigma$ is the standard deviation of the data\n", "- $x_i$ is the value of the $i$-th observation\n", "- $\\mu$ is the mean of the data\n", "\n", "The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n", "\n", "**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n" ], "metadata": { "collapsed": false }, "id": "2c13064e20601717" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of principle components after PCA is 10\n" ] } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Standardize the data\n", "numerical_cols = data.select_dtypes(include=[np.number]).columns\n", "scaler = StandardScaler()\n", "data[numerical_cols] = scaler.fit_transform(data[numerical_cols])\n", "\n", "# Initialize PCA with the desired explained variance\n", "pca = PCA(0.95)\n", "\n", "# Fit PCA to your data\n", "pca.fit(data)\n", "\n", "# Get the number of components\n", "num_components = pca.n_components_\n", "\n", "print(f\"The number of principle components after PCA is {num_components}\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-27T07:53:59.786081548Z", "start_time": "2024-02-27T07:53:59.426356025Z" } }, "id": "7f9bec73a42f7bca", "execution_count": 3 }, { "cell_type": "markdown", "source": [ "# Perform Dimensionality Reduction with PCA\n", "\n", "We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components." ], "metadata": { "collapsed": false }, "id": "dc9f8c0e194dd07d" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original number of components: 18\n", "Number of components after PCA: 10\n", "PCA has successfully reduced the number of components.\n" ] } ], "source": [ "# Project original data to PC with the highest eigenvalue\n", "data_pca = pca.transform(data)\n", "\n", "# Create a dataframe with the principal components\n", "data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n", "\n", "# Print the number of components in the original and PCA transformed data\n", "print(f\"Original number of components: {data.shape[1]}\")\n", "print(f\"Number of components after PCA: {num_components}\")\n", "\n", "# Compare the number of components in the original and PCA transformed data\n", "if data.shape[1] > num_components:\n", " print(\"PCA has successfully reduced the number of components.\")\n", "elif data.shape[1] < num_components:\n", " print(\"Unexpectedly, PCA has increased the number of components.\")\n", "else:\n", " print(\"The number of components remains unchanged after PCA.\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-27T07:53:59.879678825Z", "start_time": "2024-02-27T07:53:59.583880502Z" } }, "id": "96c62c50f8734a01", "execution_count": 4 }, { "cell_type": "markdown", "source": [ "## Data Mining / Machine Learning\n", "\n", "### I. Supervised Learning\n", "- **Decision**: Supervised learning is used due to the labeled dataset.\n", "- **Algorithm**: Random Forest Classifier is preferred for its performance in classification tasks.\n", "\n", "### II. Training/Test Split Ratio\n", "- **Decision**: 70:30 split is chosen for training/test dataset.\n", "- **Reasoning**: This split ensures sufficient data for training and testing.\n", "\n", "### III. Performance Metrics\n", "- **Classification Accuracy**: Measures the proportion of correctly classified instances.\n", "- **Confusion Matrix**: Provides a summary of predicted and actual classes.\n", "- **Classification Report**: Provides detailed metrics such as precision, recall, F1-score, and support for each class.\n", "\n", "The Random Forest Classifier is trained on the training set and evaluated on the test set using accuracy and classification report metrics.\n" ], "metadata": { "collapsed": false }, "id": "47d5cb383ce1f7ba" }, { "cell_type": "markdown", "source": [ "# Split the data into training and testing sets\n", "\n", "The next step is to split the data into training and testing sets. This is a common practice in machine learning, where the training set is used to train the model, and the testing set is used to evaluate its performance.\n", "\n", "We will use the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets. We will use 70% of the data for training and 30% for testing, which is a common split ratio." ], "metadata": { "collapsed": false }, "id": "576a6a92fc7fdbfd" }, { "cell_type": "code", "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Split the data into training and test sets\n", "# X_train, X_test, y_train, y_test = train_test_split(data_pca_df, data['NLOS'], test_size=0.3, random_state=42)\n", "X_train, X_test, y_train, y_test = train_test_split(data_pca_df, data['NLOS'], test_size=0.3, random_state=42)\n" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-27T07:53:59.913309014Z", "start_time": "2024-02-27T07:53:59.614411810Z" } }, "id": "7db852fafd187d5a", "execution_count": 5 }, { "cell_type": "markdown", "source": [ "# Train a Random Forest Classifier\n", "\n", "The next step is to train a machine learning model on the training data. We will use the `RandomForestClassifier` class from the `sklearn.ensemble` module to train a random forest classifier.\n", "\n", "The random forest classifier is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n", "\n", "We will use the `fit` method of the `RandomForestClassifier` object to train the model on the training data." ], "metadata": { "collapsed": false }, "id": "5753cc6db18bac73" }, { "cell_type": "code", "outputs": [ { "data": { "text/plain": "RandomForestClassifier(random_state=42)", "text/html": "
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=42)
SVC(kernel='linear', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(kernel='linear', random_state=42)
LogisticRegression(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(random_state=42)
GradientBoostingClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=42)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)