CSC3105_Project/Project.ipynb

266 lines
8.2 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# CSC 3105 Project"
],
"metadata": {
"collapsed": false
},
"id": "cda961ffb493d00c"
},
{
"cell_type": "code",
"outputs": [],
"source": [
"import os\n",
"\n",
"DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T03:26:58.464846949Z",
"start_time": "2024-02-25T03:26:58.415028614Z"
}
},
"id": "bcd6cbaa5df10ce8",
"execution_count": 73
},
{
"cell_type": "markdown",
"source": [
"# Load the data into a pandas dataframe\n",
"\n",
"The first step in any data analysis project is to load the data into a suitable data structure. In this case, we will use the `pandas` library to load the data into a dataframe.\n",
"\n",
"We then clean the data by handling missing values, removing duplicates, converting data types, and performing outlier detection and removal. "
],
"metadata": {
"collapsed": false
},
"id": "bab890d7b05e347e"
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original data shape: (42000, 1031)\n",
"Cleaned data shape: (42000, 1031)\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from scipy import stats\n",
"\n",
"\n",
"def load_data(dataset_dir):\n",
" # Load the data\n",
" file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]\n",
" data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))\n",
" print(f\"Original data shape: {data.shape}\")\n",
" return data\n",
"\n",
"\n",
"def clean_data(data):\n",
" # Handle missing values\n",
" data = data.dropna()\n",
"\n",
" # Remove duplicates\n",
" data = data.drop_duplicates()\n",
"\n",
" # Convert data types\n",
" data['NLOS'] = data['NLOS'].astype(int)\n",
"\n",
" # Outlier detection and removal\n",
" z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))\n",
" data = data[(z_scores < 3).any(axis=1)]\n",
"\n",
" print(f\"Cleaned data shape: {data.shape}\")\n",
" return data\n",
"\n",
"\n",
"# Use the functions\n",
"data = load_data(DATASET_DIR)\n",
"data = clean_data(data)\n",
"\n",
"# print(data.head())\n",
"\n",
"# Print Headers\n",
"# print(data.columns)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T03:27:06.334698247Z",
"start_time": "2024-02-25T03:26:58.458307532Z"
}
},
"id": "dd9657f5ec6d7754",
"execution_count": 74
},
{
"cell_type": "markdown",
"source": [
"The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. \n",
"\n",
"The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.\n",
"\n",
"The mathematical formulas used in this process are as follows:\n",
"\n",
"1. Calculate the mean (μ) of the data:\n",
"\n",
"$$\n",
"\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n",
"$$\n",
"Where:\n",
"- $n$ is the number of observations in the data\n",
"- $x_i$ is the value of the $i$-th observation\n",
"- $\\sum$ denotes the summation over all observations\n",
"\n",
"2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:\n",
"\n",
"$$\n",
"\\text{Data}_i = \\frac{x_i - \\mu}{\\sigma}\n",
"$$\n",
"Where:\n",
"- $\\text{Data}_i$ is the standardized value of the $i$-th observation\n",
"- $\\sigma$ is the standard deviation of the data\n",
"- $x_i$ is the value of the $i$-th observation\n",
"- $\\mu$ is the mean of the data\n",
"\n",
"The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.\n",
"\n",
"**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.\n"
],
"metadata": {
"collapsed": false
},
"id": "2c13064e20601717"
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The number of principle components after PCA is 868\n"
]
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Standardize the data\n",
"numerical_cols = data.select_dtypes(include=[np.number]).columns\n",
"scaler = StandardScaler()\n",
"data[numerical_cols] = scaler.fit_transform(data[numerical_cols])\n",
"\n",
"# Initialize PCA with the desired explained variance\n",
"pca = PCA(0.95)\n",
"\n",
"# Fit PCA to your data\n",
"pca.fit(data)\n",
"\n",
"# Get the number of components\n",
"num_components = pca.n_components_\n",
"\n",
"print(f\"The number of principle components after PCA is {num_components}\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T03:27:13.639843012Z",
"start_time": "2024-02-25T03:27:06.336830842Z"
}
},
"id": "7f9bec73a42f7bca",
"execution_count": 75
},
{
"cell_type": "markdown",
"source": [
"# Perform Dimensionality Reduction with PCA\n",
"\n",
"We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components."
],
"metadata": {
"collapsed": false
},
"id": "dc9f8c0e194dd07d"
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original number of components: 1031\n",
"Number of components after PCA: 868\n",
"PCA has successfully reduced the number of components.\n"
]
}
],
"source": [
"# Project original data to PC with the highest eigenvalue\n",
"data_pca = pca.transform(data)\n",
"\n",
"# Create a dataframe with the principal components\n",
"data_pca_df = pd.DataFrame(data_pca, columns=[f\"PC{i}\" for i in range(1, num_components + 1)])\n",
"\n",
"# Print the number of components in the original and PCA transformed data\n",
"print(f\"Original number of components: {data.shape[1]}\")\n",
"print(f\"Number of components after PCA: {num_components}\")\n",
"\n",
"# Compare the number of components in the original and PCA transformed data\n",
"if data.shape[1] > num_components:\n",
" print(\"PCA has successfully reduced the number of components.\")\n",
"elif data.shape[1] < num_components:\n",
" print(\"Unexpectedly, PCA has increased the number of components.\")\n",
"else:\n",
" print(\"The number of components remains unchanged after PCA.\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-02-25T03:27:14.422886263Z",
"start_time": "2024-02-25T03:27:13.660170622Z"
}
},
"id": "96c62c50f8734a01",
"execution_count": 76
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}