270 lines
8.9 KiB
Plaintext
270 lines
8.9 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# CSC 3105 Project"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "cda961ffb493d00c"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"# Importing the libraries\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T02:29:46.399745088Z",
|
|
"start_time": "2024-02-25T02:29:46.386566147Z"
|
|
}
|
|
},
|
|
"id": "bcd6cbaa5df10ce8",
|
|
"execution_count": 39
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Load the data into a pandas dataframe"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "bab890d7b05e347e"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" NLOS RANGE FP_IDX FP_AMP1 FP_AMP2 FP_AMP3 STDEV_NOISE CIR_PWR \\\n",
|
|
"0 1.0 6.18 749.0 4889.0 13876.0 10464.0 240.0 9048.0 \n",
|
|
"1 1.0 4.54 741.0 2474.0 2002.0 1593.0 68.0 6514.0 \n",
|
|
"2 1.0 4.39 744.0 1934.0 2615.0 4114.0 52.0 2880.0 \n",
|
|
"3 1.0 1.27 748.0 16031.0 17712.0 10420.0 64.0 12855.0 \n",
|
|
"4 0.0 1.16 743.0 20070.0 19886.0 15727.0 76.0 11607.0 \n",
|
|
"\n",
|
|
" MAX_NOISE RXPACC ... CIR1006 CIR1007 CIR1008 CIR1009 CIR1010 \\\n",
|
|
"0 3668.0 1024.0 ... 818.0 938.0 588.0 277.0 727.0 \n",
|
|
"1 1031.0 1024.0 ... 289.0 228.0 107.0 487.0 491.0 \n",
|
|
"2 796.0 1024.0 ... 123.0 281.0 483.0 97.0 272.0 \n",
|
|
"3 1529.0 323.0 ... 169.0 138.0 219.0 94.0 225.0 \n",
|
|
"4 2022.0 296.0 ... 87.0 43.0 358.0 308.0 132.0 \n",
|
|
"\n",
|
|
" CIR1011 CIR1012 CIR1013 CIR1014 CIR1015 \n",
|
|
"0 367.0 803.0 819.0 467.0 768.0 \n",
|
|
"1 404.0 334.0 210.0 102.0 0.0 \n",
|
|
"2 73.0 125.0 169.0 182.0 0.0 \n",
|
|
"3 155.0 172.0 278.0 318.0 0.0 \n",
|
|
"4 131.0 102.0 126.0 163.0 0.0 \n",
|
|
"\n",
|
|
"[5 rows x 1031 columns]\n",
|
|
"Index(['NLOS', 'RANGE', 'FP_IDX', 'FP_AMP1', 'FP_AMP2', 'FP_AMP3',\n",
|
|
" 'STDEV_NOISE', 'CIR_PWR', 'MAX_NOISE', 'RXPACC',\n",
|
|
" ...\n",
|
|
" 'CIR1006', 'CIR1007', 'CIR1008', 'CIR1009', 'CIR1010', 'CIR1011',\n",
|
|
" 'CIR1012', 'CIR1013', 'CIR1014', 'CIR1015'],\n",
|
|
" dtype='object', length=1031)\n",
|
|
"No missing values\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"def load_data(dataset_dir):\n",
|
|
" # Get all file paths in the directory\n",
|
|
" file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n",
|
|
"\n",
|
|
" # Load and concatenate all dataframes\n",
|
|
" data = pd.concat([pd.read_csv(file_path) for file_path in file_paths])\n",
|
|
"\n",
|
|
" return data\n",
|
|
"\n",
|
|
"\n",
|
|
"data = load_data(DATASET_DIR)\n",
|
|
"\n",
|
|
"print(data.head())\n",
|
|
"\n",
|
|
"# Print Headers\n",
|
|
"print(data.columns)\n",
|
|
"\n",
|
|
"# Check that there are no missing values\n",
|
|
"assert data.isnull().sum().sum() == 0\n",
|
|
"print(\"No missing values\")\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T02:29:51.084821398Z",
|
|
"start_time": "2024-02-25T02:29:46.405675293Z"
|
|
}
|
|
},
|
|
"id": "dd9657f5ec6d7754",
|
|
"execution_count": 40
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n",
|
|
"\n",
|
|
"The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n",
|
|
"\n",
|
|
"The mathematical formulas used in this process are as follows:\n",
|
|
"\n",
|
|
"1. Calculate the mean (μ) of the data:\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n",
|
|
"$$\n",
|
|
"Where:\n",
|
|
"- $n$ is the number of observations in the data\n",
|
|
"- $x_i$ is the value of the $i$-th observation\n",
|
|
"- $\\sum$ denotes the summation over all observations\n",
|
|
"\n",
|
|
"2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n",
|
|
"\n",
|
|
"$$\n",
|
|
"\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n",
|
|
"$$\n",
|
|
"Where:\n",
|
|
"- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n",
|
|
"- $\\sigma$ is the standard deviation of the data\n",
|
|
"- $x_i$ is the value of the $i$-th observation\n",
|
|
"- $\\mu$ is the mean of the data\n",
|
|
"\n",
|
|
"The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n",
|
|
"\n",
|
|
"**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "2c13064e20601717"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"The number of principle components after PCA is 868\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.decomposition import PCA\n",
|
|
"from sklearn.preprocessing import StandardScaler\n",
|
|
"\n",
|
|
"# Standardize the data\n",
|
|
"data_std = StandardScaler().fit_transform(data)\n",
|
|
"\n",
|
|
"# Initialize PCA with the desired explained variance\n",
|
|
"pca = PCA(0.95)\n",
|
|
"\n",
|
|
"# Fit PCA to your data\n",
|
|
"pca.fit(data_std)\n",
|
|
"\n",
|
|
"# Get the number of components\n",
|
|
"num_components = pca.n_components_\n",
|
|
"\n",
|
|
"print(f\"The number of principle components after PCA is {num_components}\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T02:29:58.267018142Z",
|
|
"start_time": "2024-02-25T02:29:51.084440279Z"
|
|
}
|
|
},
|
|
"id": "7f9bec73a42f7bca",
|
|
"execution_count": 41
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Perform Dimensionality Reduction with PCA\n",
|
|
"\n",
|
|
"We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "dc9f8c0e194dd07d"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Original number of components: 1031\n",
|
|
"Number of components after PCA: 868\n",
|
|
"PCA has successfully reduced the number of components.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Project original data to PC with the highest eigenvalue\n",
|
|
"data_pca = pca.transform(data_std)\n",
|
|
"\n",
|
|
"# Create a dataframe with the principal components\n",
|
|
"data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n",
|
|
"\n",
|
|
"# Print the number of components in the original and PCA transformed data\n",
|
|
"print(f\"Original number of components: {data.shape[1]}\")\n",
|
|
"print(f\"Number of components after PCA: {num_components}\")\n",
|
|
"\n",
|
|
"# Compare the number of components in the original and PCA transformed data\n",
|
|
"if data.shape[1] > num_components:\n",
|
|
" print(\"PCA has successfully reduced the number of components.\")\n",
|
|
"elif data.shape[1] < num_components:\n",
|
|
" print(\"Unexpectedly, PCA has increased the number of components.\")\n",
|
|
"else:\n",
|
|
" print(\"The number of components remains unchanged after PCA.\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2024-02-25T02:29:59.029369440Z",
|
|
"start_time": "2024-02-25T02:29:58.266576678Z"
|
|
}
|
|
},
|
|
"id": "96c62c50f8734a01",
|
|
"execution_count": 42
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 2
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython2",
|
|
"version": "2.7.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|