{ "cells": [ { "cell_type": "markdown", "source": [ "# CSC 3105 Project" ], "metadata": { "collapsed": false }, "id": "cda961ffb493d00c" }, { "cell_type": "code", "outputs": [], "source": [ "import os\n", "\n", "DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T03:26:58.464846949Z", "start_time": "2024-02-25T03:26:58.415028614Z" } }, "id": "bcd6cbaa5df10ce8", "execution_count": 73 }, { "cell_type": "markdown", "source": [ "# Load the data into a pandas dataframe\n", "\n", "The first step in any data analysis project is to load the data into a suitable data structure. In this case, we will use the `pandas` library to load the data into a dataframe.\n", "\n", "We then clean the data by handling missing values, removing duplicates, converting data types, and performing outlier detection and removal. " ], "metadata": { "collapsed": false }, "id": "bab890d7b05e347e" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original data shape: (42000, 1031)\n", "Cleaned data shape: (42000, 1031)\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "from scipy import stats\n", "\n", "\n", "def load_data(dataset_dir):\n", " # Load the data\n", " file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n", " data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))\n", " print(f\"Original data shape: {data.shape}\")\n", " return data\n", "\n", "\n", "def clean_data(data):\n", " # Handle missing values\n", " data = data.dropna()\n", "\n", " # Remove duplicates\n", " data = data.drop_duplicates()\n", "\n", " # Convert data types\n", " data['NLOS'] = data['NLOS'].astype(int)\n", "\n", " # Outlier detection and removal\n", " z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))\n", " data = data[(z_scores < 3).any(axis=1)]\n", "\n", " print(f\"Cleaned data shape: {data.shape}\")\n", " return data\n", "\n", "\n", "# Use the functions\n", "data = load_data(DATASET_DIR)\n", "data = clean_data(data)\n", "\n", "# print(data.head())\n", "\n", "# Print Headers\n", "# print(data.columns)" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T03:27:06.334698247Z", "start_time": "2024-02-25T03:26:58.458307532Z" } }, "id": "dd9657f5ec6d7754", "execution_count": 74 }, { "cell_type": "markdown", "source": [ "The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n", "\n", "The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n", "\n", "The mathematical formulas used in this process are as follows:\n", "\n", "1. Calculate the mean (μ) of the data:\n", "\n", "$$\n", "\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n", "$$\n", "Where:\n", "- $n$ is the number of observations in the data\n", "- $x_i$ is the value of the $i$-th observation\n", "- $\\sum$ denotes the summation over all observations\n", "\n", "2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n", "\n", "$$\n", "\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n", "$$\n", "Where:\n", "- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n", "- $\\sigma$ is the standard deviation of the data\n", "- $x_i$ is the value of the $i$-th observation\n", "- $\\mu$ is the mean of the data\n", "\n", "The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n", "\n", "**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n" ], "metadata": { "collapsed": false }, "id": "2c13064e20601717" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The number of principle components after PCA is 868\n" ] } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Standardize the data\n", "numerical_cols = data.select_dtypes(include=[np.number]).columns\n", "scaler = StandardScaler()\n", "data[numerical_cols] = scaler.fit_transform(data[numerical_cols])\n", "\n", "# Initialize PCA with the desired explained variance\n", "pca = PCA(0.95)\n", "\n", "# Fit PCA to your data\n", "pca.fit(data)\n", "\n", "# Get the number of components\n", "num_components = pca.n_components_\n", "\n", "print(f\"The number of principle components after PCA is {num_components}\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T03:27:13.639843012Z", "start_time": "2024-02-25T03:27:06.336830842Z" } }, "id": "7f9bec73a42f7bca", "execution_count": 75 }, { "cell_type": "markdown", "source": [ "# Perform Dimensionality Reduction with PCA\n", "\n", "We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components." ], "metadata": { "collapsed": false }, "id": "dc9f8c0e194dd07d" }, { "cell_type": "code", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original number of components: 1031\n", "Number of components after PCA: 868\n", "PCA has successfully reduced the number of components.\n" ] } ], "source": [ "# Project original data to PC with the highest eigenvalue\n", "data_pca = pca.transform(data)\n", "\n", "# Create a dataframe with the principal components\n", "data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n", "\n", "# Print the number of components in the original and PCA transformed data\n", "print(f\"Original number of components: {data.shape[1]}\")\n", "print(f\"Number of components after PCA: {num_components}\")\n", "\n", "# Compare the number of components in the original and PCA transformed data\n", "if data.shape[1] > num_components:\n", " print(\"PCA has successfully reduced the number of components.\")\n", "elif data.shape[1] < num_components:\n", " print(\"Unexpectedly, PCA has increased the number of components.\")\n", "else:\n", " print(\"The number of components remains unchanged after PCA.\")" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-02-25T03:27:14.422886263Z", "start_time": "2024-02-25T03:27:13.660170622Z" } }, "id": "96c62c50f8734a01", "execution_count": 76 } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }