1. Introduction

Mental health is a crucial aspect of overall well-being, yet diagnosing and managing mental health conditions remain challenging. This project explores the use of machine learning models for classifying mental health severity using socio-demographic and behavioral data.

Objective: The aim is to evaluate and compare the performance of different machine learning models in predicting the severity of mental health conditions.

2. Dataset

Source: The dataset used in this study consists of 1,000 data points covering socio-demographic and behavioral attributes.

Target: The target variable represents the severity of mental health conditions categorized into four levels: None, Mild, Moderate, and Severe.

2.1 Dataset Features

Age: Integer value representing the respondent’s age.
Gender: Categorical variable indicating gender identity.
Occupation: Job role of the respondent.
Stress Levels: Self-reported stress levels.
Sleep Patterns: Average daily sleep duration.
Physical Activity: Frequency of physical exercise.
Work Hours: Number of work hours per week.

3. Data Preprocessing

3.1 Data Cleaning

Handling Missing Values: Missing numerical values were filled with the mean, while categorical features were imputed with the mode.
Encoding Categorical Variables: Gender and occupation were converted to numerical data using one-hot encoding.
Normalization: Min-max scaling was applied to continuous variables to ensure uniform distribution.

3.2 Dimensionality Reduction

Principal Component Analysis (PCA): Applied to reduce feature dimensionality while preserving key variance in the data.

3.3 Addressing Class Imbalance

Synthetic Minority Over-sampling Technique (SMOTE): Used to generate synthetic samples for underrepresented classes to improve model balance and fairness.

4. Model Development

4.1 Algorithms Used

Logistic Regression: A simple binary classification model used as a baseline.
LightGBM: A gradient boosting model designed for efficient tree-based learning.
Support Vector Machine (SVM): A model effective in high-dimensional data spaces.

4.2 Model Training

Data Split: The dataset was split into 80% training and 20% testing.
Hyperparameter Tuning: Grid search was applied to optimize hyperparameters for SVM and LightGBM.
Cross-Validation: A 5-fold cross-validation approach was implemented to enhance model generalization.

4.3 Evaluation Metrics

Accuracy: Percentage of correctly classified instances.
Precision: Measures the correctness of positive predictions.
Recall: Measures the ability to identify actual positives.
F1-Score: Harmonic mean of precision and recall.

5. Results and Comparison

Model	Accuracy	Precision	Recall	F1-Score
Logistic Regression	50.3%	0.49	0.50	0.46
LightGBM	51.0%	0.52	0.52	0.48
SVM with PCA+SMOTE	55.5%	0.56	0.56	0.55

Key Observations:

SVM with PCA and SMOTE performed the best, handling high-dimensional and imbalanced data more effectively.
Logistic Regression and LightGBM showed lower predictive power, suggesting that more advanced preprocessing and feature engineering are necessary.

6. Conclusion

This study highlights the effectiveness of machine learning in classifying mental health conditions. The SVM model with PCA and SMOTE outperformed other models, demonstrating the value of dimensionality reduction and class balancing techniques.

Future Work

Feature Engineering: Exploring additional socio-demographic and behavioral variables.
Ensemble Models: Combining multiple models to enhance predictive accuracy.
Larger Dataset: Using a more extensive dataset to improve generalizability.

7. How to Run the Project

Prerequisites

Python 3.x
Libraries: pandas, numpy, scikit-learn, matplotlib, seaborn, lightgbm

Steps

Clone the repository:

git clone https://github.com/username/mental-health-classification.git

Run the Jupyter notebook:

jupyter notebook mental_health_classification.ipynb