CSC3105_Project/Project.ipynb

270 lines
8.9 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# CSC 3105 Project"
],
"metadata": {
"collapsed": false
},
"id": "cda961ffb493d00c"
},
{
"cell_type": "code",
"outputs": [],
"source": [
"import os\n",
"\n",
"# Importing the libraries\n",
"import pandas as pd\n",
"\n",
"DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T02:29:46.399745088Z",
"start_time": "2024-02-25T02:29:46.386566147Z"
}
},
"id": "bcd6cbaa5df10ce8",
"execution_count": 39
},
{
"cell_type": "markdown",
"source": [
"# Load the data into a pandas dataframe"
],
"metadata": {
"collapsed": false
},
"id": "bab890d7b05e347e"
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" NLOS RANGE FP_IDX FP_AMP1 FP_AMP2 FP_AMP3 STDEV_NOISE CIR_PWR \\\n",
"0 1.0 6.18 749.0 4889.0 13876.0 10464.0 240.0 9048.0 \n",
"1 1.0 4.54 741.0 2474.0 2002.0 1593.0 68.0 6514.0 \n",
"2 1.0 4.39 744.0 1934.0 2615.0 4114.0 52.0 2880.0 \n",
"3 1.0 1.27 748.0 16031.0 17712.0 10420.0 64.0 12855.0 \n",
"4 0.0 1.16 743.0 20070.0 19886.0 15727.0 76.0 11607.0 \n",
"\n",
" MAX_NOISE RXPACC ... CIR1006 CIR1007 CIR1008 CIR1009 CIR1010 \\\n",
"0 3668.0 1024.0 ... 818.0 938.0 588.0 277.0 727.0 \n",
"1 1031.0 1024.0 ... 289.0 228.0 107.0 487.0 491.0 \n",
"2 796.0 1024.0 ... 123.0 281.0 483.0 97.0 272.0 \n",
"3 1529.0 323.0 ... 169.0 138.0 219.0 94.0 225.0 \n",
"4 2022.0 296.0 ... 87.0 43.0 358.0 308.0 132.0 \n",
"\n",
" CIR1011 CIR1012 CIR1013 CIR1014 CIR1015 \n",
"0 367.0 803.0 819.0 467.0 768.0 \n",
"1 404.0 334.0 210.0 102.0 0.0 \n",
"2 73.0 125.0 169.0 182.0 0.0 \n",
"3 155.0 172.0 278.0 318.0 0.0 \n",
"4 131.0 102.0 126.0 163.0 0.0 \n",
"\n",
"[5 rows x 1031 columns]\n",
"Index(['NLOS', 'RANGE', 'FP_IDX', 'FP_AMP1', 'FP_AMP2', 'FP_AMP3',\n",
" 'STDEV_NOISE', 'CIR_PWR', 'MAX_NOISE', 'RXPACC',\n",
" ...\n",
" 'CIR1006', 'CIR1007', 'CIR1008', 'CIR1009', 'CIR1010', 'CIR1011',\n",
" 'CIR1012', 'CIR1013', 'CIR1014', 'CIR1015'],\n",
" dtype='object', length=1031)\n",
"No missing values\n"
]
}
],
"source": [
"def load_data(dataset_dir):\n",
" # Get all file paths in the directory\n",
" file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n",
"\n",
" # Load and concatenate all dataframes\n",
" data = pd.concat([pd.read_csv(file_path) for file_path in file_paths])\n",
"\n",
" return data\n",
"\n",
"\n",
"data = load_data(DATASET_DIR)\n",
"\n",
"print(data.head())\n",
"\n",
"# Print Headers\n",
"print(data.columns)\n",
"\n",
"# Check that there are no missing values\n",
"assert data.isnull().sum().sum() == 0\n",
"print(\"No missing values\")\n"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T02:29:51.084821398Z",
"start_time": "2024-02-25T02:29:46.405675293Z"
}
},
"id": "dd9657f5ec6d7754",
"execution_count": 40
},
{
"cell_type": "markdown",
"source": [
"The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n",
"\n",
"The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n",
"\n",
"The mathematical formulas used in this process are as follows:\n",
"\n",
"1. Calculate the mean (μ) of the data:\n",
"\n",
"$$\n",
"\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n",
"$$\n",
"Where:\n",
"- $n$ is the number of observations in the data\n",
"- $x_i$ is the value of the $i$-th observation\n",
"- $\\sum$ denotes the summation over all observations\n",
"\n",
"2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n",
"\n",
"$$\n",
"\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n",
"$$\n",
"Where:\n",
"- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n",
"- $\\sigma$ is the standard deviation of the data\n",
"- $x_i$ is the value of the $i$-th observation\n",
"- $\\mu$ is the mean of the data\n",
"\n",
"The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n",
"\n",
"**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n"
],
"metadata": {
"collapsed": false
},
"id": "2c13064e20601717"
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The number of principle components after PCA is 868\n"
]
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Standardize the data\n",
"data_std = StandardScaler().fit_transform(data)\n",
"\n",
"# Initialize PCA with the desired explained variance\n",
"pca = PCA(0.95)\n",
"\n",
"# Fit PCA to your data\n",
"pca.fit(data_std)\n",
"\n",
"# Get the number of components\n",
"num_components = pca.n_components_\n",
"\n",
"print(f\"The number of principle components after PCA is {num_components}\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T02:29:58.267018142Z",
"start_time": "2024-02-25T02:29:51.084440279Z"
}
},
"id": "7f9bec73a42f7bca",
"execution_count": 41
},
{
"cell_type": "markdown",
"source": [
"# Perform Dimensionality Reduction with PCA\n",
"\n",
"We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components."
],
"metadata": {
"collapsed": false
},
"id": "dc9f8c0e194dd07d"
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original number of components: 1031\n",
"Number of components after PCA: 868\n",
"PCA has successfully reduced the number of components.\n"
]
}
],
"source": [
"# Project original data to PC with the highest eigenvalue\n",
"data_pca = pca.transform(data_std)\n",
"\n",
"# Create a dataframe with the principal components\n",
"data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n",
"\n",
"# Print the number of components in the original and PCA transformed data\n",
"print(f\"Original number of components: {data.shape[1]}\")\n",
"print(f\"Number of components after PCA: {num_components}\")\n",
"\n",
"# Compare the number of components in the original and PCA transformed data\n",
"if data.shape[1] > num_components:\n",
" print(\"PCA has successfully reduced the number of components.\")\n",
"elif data.shape[1] < num_components:\n",
" print(\"Unexpectedly, PCA has increased the number of components.\")\n",
"else:\n",
" print(\"The number of components remains unchanged after PCA.\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T02:29:59.029369440Z",
"start_time": "2024-02-25T02:29:58.266576678Z"
}
},
"id": "96c62c50f8734a01",
"execution_count": 42
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}