# CSC 3105 Project

In [73]:
import os

DATASET_DIR = './UWB-LOS-NLOS-Data-Set/dataset'

# Load the data into a pandas dataframe

The first step in any data analysis project is to load the data into a suitable data structure. In this case, we will use the `pandas` library to load the data into a dataframe.

We then clean the data by handling missing values, removing duplicates, converting data types, and performing outlier detection and removal. 

In [74]:
import pandas as pd
import numpy as np
from scipy import stats


def load_data(dataset_dir):
 # Load the data
 file_paths = [os.path.join(dirpath, file) for dirpath, _, filenames in os.walk(dataset_dir) for file in filenames]
 data = pd.concat((pd.read_csv(file_path) for file_path in file_paths))
 print(f"Original data shape: {data.shape}")
 return data


def clean_data(data):
 # Handle missing values
 data = data.dropna()

 # Remove duplicates
 data = data.drop_duplicates()

 # Convert data types
 data['NLOS'] = data['NLOS'].astype(int)

 # Outlier detection and removal
 z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))
 data = data[(z_scores < 3).any(axis=1)]

 print(f"Cleaned data shape: {data.shape}")
 return data


# Use the functions
data = load_data(DATASET_DIR)
data = clean_data(data)

# print(data.head())

# Print Headers
# print(data.columns)

Original data shape: (42000, 1031)
Cleaned data shape: (42000, 1031)


The selected code is performing data standardization, which is a common preprocessing step in many machine learning workflows. 

The purpose of standardization is to transform the data such that it has a mean of 0 and a standard deviation of 1. This is done to ensure that all features have the same scale, which is a requirement for many machine learning algorithms.

The mathematical formulas used in this process are as follows:

1. Calculate the mean (μ) of the data:

$$
\mu = \frac{1}{n} \sum_{i=1}^{n} x_i
$$
Where:
- $n$ is the number of observations in the data
- $x_i$ is the value of the $i$-th observation
- $\sum$ denotes the summation over all observations

2. Standardize the data by subtracting the mean from each observation and dividing by the standard deviation:

$$
\text{Data}_i = \frac{x_i - \mu}{\sigma}
$$
Where:
- $\text{Data}_i$ is the standardized value of the $i$-th observation
- $\sigma$ is the standard deviation of the data
- $x_i$ is the value of the $i$-th observation
- $\mu$ is the mean of the data

The `StandardScaler` class from the `sklearn.preprocessing` module is used to perform this standardization. The `fit_transform` method is used to calculate the mean and standard deviation of the data and then perform the standardization.

**Note:** By setting the explained variance to 0.95, we are saying that we want to choose the smallest number of principal components such that 95% of the variance in the original data is retained. This means that the transformed data will retain 95% of the information of the original data, while potentially having fewer dimensions.


In [75]:
from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Standardize the data
numerical_cols = data.select_dtypes(include=[np.number]).columns
scaler = StandardScaler()
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

# Initialize PCA with the desired explained variance
pca = PCA(0.95)

# Fit PCA to your data
pca.fit(data)

# Get the number of components
num_components = pca.n_components_

print(f"The number of principle components after PCA is {num_components}")

The number of principle components after PCA is 868


# Perform Dimensionality Reduction with PCA

We can use the `transform` method of the `PCA` object to project the original data onto the principal components. This will give us the transformed data with the desired number of components.

In [76]:
# Project original data to PC with the highest eigenvalue
data_pca = pca.transform(data)

# Create a dataframe with the principal components
data_pca_df = pd.DataFrame(data_pca, columns=[f"PC{i}" for i in range(1, num_components + 1)])

# Print the number of components in the original and PCA transformed data
print(f"Original number of components: {data.shape[1]}")
print(f"Number of components after PCA: {num_components}")

# Compare the number of components in the original and PCA transformed data
if data.shape[1] > num_components:
 print("PCA has successfully reduced the number of components.")
elif data.shape[1] < num_components:
 print("Unexpectedly, PCA has increased the number of components.")
else:
 print("The number of components remains unchanged after PCA.")

Original number of components: 1031
Number of components after PCA: 868
PCA has successfully reduced the number of components.
