diff --git a/Project.ipynb b/Project.ipynb index 89d48a9..681cd39 100644 --- a/Project.ipynb +++ b/Project.ipynb @@ -21,12 +21,12 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:26:58.464846949Z", - "start_time": "2024-02-25T03:26:58.415028614Z" + "end_time": "2024-02-25T05:32:51.223079500Z", + "start_time": "2024-02-25T05:32:51.218515100Z" } }, "id": "bcd6cbaa5df10ce8", - "execution_count": 73 + "execution_count": 20 }, { "cell_type": "markdown", @@ -50,6 +50,7 @@ "output_type": "stream", "text": [ "Original data shape: (42000, 1031)\n", + "Total number of missing values: 0\n", "Cleaned data shape: (42000, 1031)\n" ] } @@ -69,6 +70,10 @@ "\n", "\n", "def clean_data(data):\n", + " # Calculate total number of missing values\n", + " total_missing = data.isnull().sum().sum()\n", + " print(f\"Total number of missing values: {total_missing}\")\n", + " \n", " # Handle missing values\n", " data = data.dropna()\n", "\n", @@ -98,12 +103,12 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:27:06.334698247Z", - "start_time": "2024-02-25T03:26:58.458307532Z" + "end_time": "2024-02-25T05:32:56.753045Z", + "start_time": "2024-02-25T05:32:52.416237400Z" } }, "id": "dd9657f5ec6d7754", - "execution_count": 74 + "execution_count": 21 }, { "cell_type": "markdown", @@ -179,12 +184,12 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:27:13.639843012Z", - "start_time": "2024-02-25T03:27:06.336830842Z" + "end_time": "2024-02-25T05:33:01.393540600Z", + "start_time": "2024-02-25T05:32:57.831719Z" } }, "id": "7f9bec73a42f7bca", - "execution_count": 75 + "execution_count": 22 }, { "cell_type": "markdown", @@ -233,12 +238,197 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:27:14.422886263Z", - "start_time": "2024-02-25T03:27:13.660170622Z" + "end_time": "2024-02-25T05:47:02.511321300Z", + "start_time": "2024-02-25T05:47:01.989494800Z" } }, "id": "96c62c50f8734a01", - "execution_count": 76 + "execution_count": 29 + }, + { + "cell_type": "markdown", + "source": [ + "## Data Mining / Machine Learning\n", + "\n", + "### I. Supervised Learning\n", + "- **Decision**: Supervised learning is used due to the labeled dataset.\n", + "- **Algorithm**: Random Forest Classifier is preferred for its performance in classification tasks.\n", + "\n", + "### II. Training/Test Split Ratio\n", + "- **Decision**: 70:30 split is chosen for training/test dataset.\n", + "- **Reasoning**: This split ensures sufficient data for training and testing.\n", + "\n", + "### III. Performance Metrics\n", + "- **Classification Accuracy**: Measures the proportion of correctly classified instances.\n", + "- **Confusion Matrix**: Provides a summary of predicted and actual classes.\n", + "- **Classification Report**: Provides detailed metrics such as precision, recall, F1-score, and support for each class.\n", + "\n", + "The Random Forest Classifier is trained on the training set and evaluated on the test set using accuracy and classification report metrics.\n" + ], + "metadata": { + "collapsed": false + }, + "id": "47d5cb383ce1f7ba" + }, + { + "cell_type": "markdown", + "source": [ + "# Split the data into training and testing sets\n", + "\n", + "The next step is to split the data into training and testing sets. This is a common practice in machine learning, where the training set is used to train the model, and the testing set is used to evaluate its performance.\n", + "\n", + "We will use the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets. We will use 70% of the data for training and 30% for testing, which is a common split ratio." + ], + "metadata": { + "collapsed": false + }, + "id": "576a6a92fc7fdbfd" + }, + { + "cell_type": "code", + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "# Split the data into training and test sets\n", + "X_train, X_test, y_train, y_test = train_test_split(data_pca_df, data['NLOS'], test_size=0.3, random_state=42)" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2024-02-25T05:33:06.047014800Z", + "start_time": "2024-02-25T05:33:05.983534100Z" + } + }, + "id": "7db852fafd187d5a", + "execution_count": 24 + }, + { + "cell_type": "markdown", + "source": [ + "# Train a Random Forest Classifier\n", + "\n", + "The next step is to train a machine learning model on the training data. We will use the `RandomForestClassifier` class from the `sklearn.ensemble` module to train a random forest classifier.\n", + "\n", + "The random forest classifier is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n", + "\n", + "We will use the `fit` method of the `RandomForestClassifier` object to train the model on the training data." + ], + "metadata": { + "collapsed": false + }, + "id": "5753cc6db18bac73" + }, + { + "cell_type": "code", + "outputs": [ + { + "data": { + "text/plain": "RandomForestClassifier(random_state=42)", + "text/html": "
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=42)