From 1154e2641e51ff0d32652e1260f6a3bca01c318d Mon Sep 17 00:00:00 2001 From: Benjamin Loh Date: Sun, 25 Feb 2024 13:48:30 +0800 Subject: [PATCH] ml done using random forest classifier ml done using random forest classifier --- Project.ipynb | 214 +++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 202 insertions(+), 12 deletions(-) diff --git a/Project.ipynb b/Project.ipynb index 89d48a9..681cd39 100644 --- a/Project.ipynb +++ b/Project.ipynb @@ -21,12 +21,12 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:26:58.464846949Z", - "start_time": "2024-02-25T03:26:58.415028614Z" + "end_time": "2024-02-25T05:32:51.223079500Z", + "start_time": "2024-02-25T05:32:51.218515100Z" } }, "id": "bcd6cbaa5df10ce8", - "execution_count": 73 + "execution_count": 20 }, { "cell_type": "markdown", @@ -50,6 +50,7 @@ "output_type": "stream", "text": [ "Original data shape: (42000, 1031)\n", + "Total number of missing values: 0\n", "Cleaned data shape: (42000, 1031)\n" ] } @@ -69,6 +70,10 @@ "\n", "\n", "def clean_data(data):\n", + " # Calculate total number of missing values\n", + " total_missing = data.isnull().sum().sum()\n", + " print(f\"Total number of missing values: {total_missing}\")\n", + " \n", " # Handle missing values\n", " data = data.dropna()\n", "\n", @@ -98,12 +103,12 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:27:06.334698247Z", - "start_time": "2024-02-25T03:26:58.458307532Z" + "end_time": "2024-02-25T05:32:56.753045Z", + "start_time": "2024-02-25T05:32:52.416237400Z" } }, "id": "dd9657f5ec6d7754", - "execution_count": 74 + "execution_count": 21 }, { "cell_type": "markdown", @@ -179,12 +184,12 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:27:13.639843012Z", - "start_time": "2024-02-25T03:27:06.336830842Z" + "end_time": "2024-02-25T05:33:01.393540600Z", + "start_time": "2024-02-25T05:32:57.831719Z" } }, "id": "7f9bec73a42f7bca", - "execution_count": 75 + "execution_count": 22 }, { "cell_type": "markdown", @@ -233,12 +238,197 @@ "metadata": { "collapsed": false, "ExecuteTime": { - "end_time": "2024-02-25T03:27:14.422886263Z", - "start_time": "2024-02-25T03:27:13.660170622Z" + "end_time": "2024-02-25T05:47:02.511321300Z", + "start_time": "2024-02-25T05:47:01.989494800Z" } }, "id": "96c62c50f8734a01", - "execution_count": 76 + "execution_count": 29 + }, + { + "cell_type": "markdown", + "source": [ + "## Data Mining / Machine Learning\n", + "\n", + "### I. Supervised Learning\n", + "- **Decision**: Supervised learning is used due to the labeled dataset.\n", + "- **Algorithm**: Random Forest Classifier is preferred for its performance in classification tasks.\n", + "\n", + "### II. Training/Test Split Ratio\n", + "- **Decision**: 70:30 split is chosen for training/test dataset.\n", + "- **Reasoning**: This split ensures sufficient data for training and testing.\n", + "\n", + "### III. Performance Metrics\n", + "- **Classification Accuracy**: Measures the proportion of correctly classified instances.\n", + "- **Confusion Matrix**: Provides a summary of predicted and actual classes.\n", + "- **Classification Report**: Provides detailed metrics such as precision, recall, F1-score, and support for each class.\n", + "\n", + "The Random Forest Classifier is trained on the training set and evaluated on the test set using accuracy and classification report metrics.\n" + ], + "metadata": { + "collapsed": false + }, + "id": "47d5cb383ce1f7ba" + }, + { + "cell_type": "markdown", + "source": [ + "# Split the data into training and testing sets\n", + "\n", + "The next step is to split the data into training and testing sets. This is a common practice in machine learning, where the training set is used to train the model, and the testing set is used to evaluate its performance.\n", + "\n", + "We will use the `train_test_split` function from the `sklearn.model_selection` module to split the data into training and testing sets. We will use 70% of the data for training and 30% for testing, which is a common split ratio." + ], + "metadata": { + "collapsed": false + }, + "id": "576a6a92fc7fdbfd" + }, + { + "cell_type": "code", + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "# Split the data into training and test sets\n", + "X_train, X_test, y_train, y_test = train_test_split(data_pca_df, data['NLOS'], test_size=0.3, random_state=42)" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2024-02-25T05:33:06.047014800Z", + "start_time": "2024-02-25T05:33:05.983534100Z" + } + }, + "id": "7db852fafd187d5a", + "execution_count": 24 + }, + { + "cell_type": "markdown", + "source": [ + "# Train a Random Forest Classifier\n", + "\n", + "The next step is to train a machine learning model on the training data. We will use the `RandomForestClassifier` class from the `sklearn.ensemble` module to train a random forest classifier.\n", + "\n", + "The random forest classifier is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.\n", + "\n", + "We will use the `fit` method of the `RandomForestClassifier` object to train the model on the training data." + ], + "metadata": { + "collapsed": false + }, + "id": "5753cc6db18bac73" + }, + { + "cell_type": "code", + "outputs": [ + { + "data": { + "text/plain": "RandomForestClassifier(random_state=42)", + "text/html": "
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# Initialize the classifier\n", + "classifier = RandomForestClassifier(n_estimators=100, random_state=42)\n", + "\n", + "# Train the classifier\n", + "classifier.fit(X_train, y_train)" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2024-02-25T05:34:51.931355900Z", + "start_time": "2024-02-25T05:33:08.961973100Z" + } + }, + "id": "b3617711d95450fb", + "execution_count": 25 + }, + { + "cell_type": "markdown", + "source": [ + "# Evaluate the Model\n", + "\n", + "To evaluate the performance of the trained model on the testing data, we will use the `predict` method of the `RandomForestClassifier` object to make predictions on the testing data. We will then use the `accuracy_score` and `classification_report` functions from the `sklearn.metrics` module to calculate the accuracy and generate a classification report.\n", + "\n", + "- **Accuracy:** The accuracy score function calculates the proportion of correctly classified instances.\n", + "\n", + "- **Precision:** The ratio of correctly predicted positive observations to the total predicted positive observations. It is calculated as:\n", + "\n", + " $$\n", + " \\text{Precision} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Positives}}\n", + " $$\n", + "\n", + "- **Recall:** The ratio of correctly predicted positive observations to all observations in the actual class. It is calculated as:\n", + "\n", + " $$\n", + " \\text{Recall} = \\frac{\\text{True Positives}}{\\text{True Positives} + \\text{False Negatives}}\n", + " $$\n", + "\n", + "- **F1 Score:** The weighted average of precision and recall. It is calculated as:\n", + "\n", + " $$\n", + " \\text{F1 Score} = 2 \\times \\frac{\\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}\n", + " $$\n", + "\n", + "- **Support:** The number of actual occurrences of the class in the dataset.\n", + "\n", + "The classification report provides a summary of the precision, recall, F1-score, and support for each class in the testing data, giving insight into how well the model is performing for each class.\n" + ], + "metadata": { + "collapsed": false + }, + "id": "b63c56956f2f9620" + }, + { + "cell_type": "code", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Accuracy: 0.8491269841269842\n", + "Classification Report:\n", + " precision recall f1-score support\n", + "\n", + " -1.0 0.81 0.90 0.86 6250\n", + " 1.0 0.89 0.80 0.84 6350\n", + "\n", + " accuracy 0.85 12600\n", + " macro avg 0.85 0.85 0.85 12600\n", + "weighted avg 0.85 0.85 0.85 12600\n" + ] + } + ], + "source": [ + "from sklearn.metrics import accuracy_score, classification_report\n", + "\n", + "# Make predictions on the test set\n", + "y_pred = classifier.predict(X_test)\n", + "\n", + "# Evaluate the classifier\n", + "accuracy = accuracy_score(y_test, y_pred)\n", + "classification_rep = classification_report(y_test, y_pred)\n", + "\n", + "print(f\"Accuracy: {accuracy}\")\n", + "print(f\"Classification Report:\\n{classification_rep}\")\n" + ], + "metadata": { + "collapsed": false, + "ExecuteTime": { + "end_time": "2024-02-25T05:34:57.850908700Z", + "start_time": "2024-02-25T05:34:57.294830200Z" + } + }, + "id": "1255f5a45a95e482", + "execution_count": 26 } ], "metadata": {