CSC3105_Project/Project.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# CSC 3105 Project"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "cda961ffb493d00c"
  },
  {
   "cell_type": "code",
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T03:26:58.464846949Z",
     "start_time": "2024-02-25T03:26:58.415028614Z"
    }
   },
   "id": "bcd6cbaa5df10ce8",
   "execution_count": 73
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Load the data into a pandas dataframe\n",
    "\n",
    "The first step in any data analysis project is to load the data into a suitable data structure. In this case, we will use the `pandas` library to load the data into a dataframe.\n",
    "\n",
    "We then clean the data by handling missing values, removing duplicates, converting data types, and performing outlier detection and removal. "
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "bab890d7b05e347e"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original data shape: (42000, 1031)\n",
      "Cleaned data shape: (42000, 1031)\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from scipy import stats\n",
    "\n",
    "\n",
    "def load_data(dataset_dir):\n",
    "    # Load the data\n",
    "    file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n",
    "    data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))\n",
    "    print(f\"Original data shape: {data.shape}\")\n",
    "    return data\n",
    "\n",
    "\n",
    "def clean_data(data):\n",
    "    # Handle missing values\n",
    "    data = data.dropna()\n",
    "\n",
    "    # Remove duplicates\n",
    "    data = data.drop_duplicates()\n",
    "\n",
    "    # Convert data types\n",
    "    data['NLOS'] = data['NLOS'].astype(int)\n",
    "\n",
    "    # Outlier detection and removal\n",
    "    z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))\n",
    "    data = data[(z_scores < 3).any(axis=1)]\n",
    "\n",
    "    print(f\"Cleaned data shape: {data.shape}\")\n",
    "    return data\n",
    "\n",
    "\n",
    "# Use the functions\n",
    "data = load_data(DATASET_DIR)\n",
    "data = clean_data(data)\n",
    "\n",
    "# print(data.head())\n",
    "\n",
    "# Print Headers\n",
    "# print(data.columns)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T03:27:06.334698247Z",
     "start_time": "2024-02-25T03:26:58.458307532Z"
    }
   },
   "id": "dd9657f5ec6d7754",
   "execution_count": 74
  },
  {
   "cell_type": "markdown",
   "source": [
    "The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n",
    "\n",
    "The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n",
    "\n",
    "The mathematical formulas used in this process are as follows:\n",
    "\n",
    "1. Calculate the mean (μ) of the data:\n",
    "\n",
    "$$\n",
    "\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n",
    "$$\n",
    "Where:\n",
    "- $n$ is the number of observations in the data\n",
    "- $x_i$ is the value of the $i$-th observation\n",
    "- $\\sum$ denotes the summation over all observations\n",
    "\n",
    "2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n",
    "\n",
    "$$\n",
    "\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n",
    "$$\n",
    "Where:\n",
    "- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n",
    "- $\\sigma$ is the standard deviation of the data\n",
    "- $x_i$ is the value of the $i$-th observation\n",
    "- $\\mu$ is the mean of the data\n",
    "\n",
    "The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n",
    "\n",
    "**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "2c13064e20601717"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The number of principle components after PCA is 868\n"
     ]
    }
   ],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# Standardize the data\n",
    "numerical_cols = data.select_dtypes(include=[np.number]).columns\n",
    "scaler = StandardScaler()\n",
    "data[numerical_cols] = scaler.fit_transform(data[numerical_cols])\n",
    "\n",
    "# Initialize PCA with the desired explained variance\n",
    "pca = PCA(0.95)\n",
    "\n",
    "# Fit PCA to your data\n",
    "pca.fit(data)\n",
    "\n",
    "# Get the number of components\n",
    "num_components = pca.n_components_\n",
    "\n",
    "print(f\"The number of principle components after PCA is {num_components}\")"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T03:27:13.639843012Z",
     "start_time": "2024-02-25T03:27:06.336830842Z"
    }
   },
   "id": "7f9bec73a42f7bca",
   "execution_count": 75
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Perform Dimensionality Reduction with PCA\n",
    "\n",
    "We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "dc9f8c0e194dd07d"
  },
  {
   "cell_type": "code",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original number of components: 1031\n",
      "Number of components after PCA: 868\n",
      "PCA has successfully reduced the number of components.\n"
     ]
    }
   ],
   "source": [
    "# Project original data to PC with the highest eigenvalue\n",
    "data_pca = pca.transform(data)\n",
    "\n",
    "# Create a dataframe with the principal components\n",
    "data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n",
    "\n",
    "# Print the number of components in the original and PCA transformed data\n",
    "print(f\"Original number of components: {data.shape[1]}\")\n",
    "print(f\"Number of components after PCA: {num_components}\")\n",
    "\n",
    "# Compare the number of components in the original and PCA transformed data\n",
    "if data.shape[1] > num_components:\n",
    "    print(\"PCA has successfully reduced the number of components.\")\n",
    "elif data.shape[1] < num_components:\n",
    "    print(\"Unexpectedly, PCA has increased the number of components.\")\n",
    "else:\n",
    "    print(\"The number of components remains unchanged after PCA.\")"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-02-25T03:27:14.422886263Z",
     "start_time": "2024-02-25T03:27:13.660170622Z"
    }
   },
   "id": "96c62c50f8734a01",
   "execution_count": 76
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}