Skip to content

This repository contains a model for predicting if an individual earns above or below an income threshold

License

Notifications You must be signed in to change notification settings

rasmodev/Income-Prediction-Challenge-For-Azubian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Income-Prediction-ML-Project

This repository contains a machine learning project focused on predicting income levels and deploying the model into an app frontend using Streamlit and a web application backend using FastAPI.

This project aims to leverage machine learning to predict income levels, addressing the challenges of income inequality and providing insights for policymakers.

Summary

Jupyter Notebook Power BI Dashboard Published Article Deployed App on Hugging Face Deployed FastAPI on Hugging Face
Notebook with analysis and model development Interactive Dashboard Published Article on Medium Link to Deployed Streamlit App Link to Deployed FastAPI

App Interface

After clicking on the link to the working APP, provide the required details, and click on the "PREDICT" button.

App Screenshot

Before Prediction

App Screenshot

After Prediction

App Screenshot

Repository Contents:

Project Overview:

i. Data Collection and Preprocessing: I loaded and preprocessed a comprehensive dataset containing income-related data to train and evaluate the income prediction model.

ii. Machine Learning Model: I implemented a machine learning model tailored for predicting income levels. This model has been fine-tuned to achieve high accuracy in predicting income thresholds.

iii. FAST API Integration: I've seamlessly integrated the trained machine learning model into a web application using FAST API. This web application allows users to input individual data and receive instant predictions regarding income levels.

iv. Usage and Deployment: In this README file, you will find detailed instructions on how to use and deploy this web application, making it user-friendly for both developers and policymakers.

Project Setup:

To set up the project environment, follow these steps:

i. Clone the repository:

git clone https://github.com/your_username/Income-Prediction-ML-Project-with-FastAPI-Deployment.git

ii. Create a virtual environment and install the required dependencies:

  • Windows:

    python -m venv venv; venv\Scripts\activate; python -m pip install -q --upgrade pip; python -m pip install -qr requirements.txt
  • Linux & MacOS:

    python3 -m venv venv; source venv/bin/activate; python -m pip install -q --upgrade pip; python -m pip install -qr requirements.txt  

Data Fields

The data used in this project consists of a diverse collection of income-related attributes obtained from source.

Column Name Data Type Description
Age Numeric Age of the individual
Gender Categorical Gender of the individual
Education Categorical Education level of the individual
Class Of Worker Categorical Class of worker
Education Institute Categorical Enrollment status in an educational institution in the last week
Marital Status Categorical Marital status
Race Categorical Race
Hispanic Origin Categorical Hispanic origin
Employment Commitment Categorical Full or part-time employment status
Unemployment Reason Categorical Reason for unemployment
Employment Stat Categorical Owns a business or is self-employed
Wage Per Hour Numeric Wage per hour
Labor Union Membership Categorical Member of a labor union
Weeks Worked In A Year Numeric Weeks worked in a year
Industry Code Categorical Industry category
Major Industry Code Categorical Major industry category
Occupation Code Categorical Occupation category
Major Occupation Code Categorical Major occupation category
Num Persons Worked For Employer Numeric Number of persons worked for employer
Household and Family Stat Categorical Detailed household and family status
Household Summary Categorical Detailed household summary
Under 18 Family Categorical Family members under 18
Veterans Admin Questionnaire Categorical Filled income questionnaire for Veterans Admin
Vet Benefit Categorical Veteran benefits
Tax Filer Status Categorical Tax filer status
Gains Numeric Gains from financial investments
Losses Numeric Losses from financial investments
Stocks Status Categorical Dividends from stocks
Citizenship Categorical Citizenship status
Migration Year Numeric Year of migration
Country Of Birth - Individual Categorical Individual's birth country
Country Of Birth - Father Categorical Father's birth country
Country Of Birth - Mother Categorical Mother's birth country
Migration Code Change In MSA Categorical Migration code - Change in MSA
Migration Prev Sunbelt Categorical Migration previous Sunbelt
Migration Code Move Within Reg Categorical Migration code - Move within region
Migration Code Change In Reg Categorical Migration code - Change in region
Residence 1 Year Ago Categorical Lived in this house one year ago
Old Residence Region Categorical Region of previous residence
Old Residence State Categorical State of previous residence
Importance Of Record Numeric Weight of the instance
Income Above 50k Categorical Binary indicator if income is above $50,000

Machine Learning Lifecycle

I employed the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology in this project.

Here are the steps I undertook:

Business Understanding:

I began by understanding the problem domain, which involved predicting income levels. I defined the project goals and objectives, such as addressing income inequality through data-driven insights.

Data Understanding:

I collected the dataset from Zindi, which included various income-related attributes. After an overview of the first few columns, I formulated hypotheses and key analytical questions that would guide the understanding of the dataset.

Hypothesis: Null Hypothesis (H0): There is no significant association between the individual's age and income level.

Alternative Hypothesis (H1): There is a significant association between the individual's age and income level.

** Key Analytical Questions and Answers**

Does a higher education level correspond to a higher likelihood of having incomes above the threshold?

The dataset exhibits substantial income inequality, especially at lower education tiers, but higher education is positively correlated with income. Income inequality persists even at higher education levels, suggesting that other factors also contribute to income inequality. image

How does age relate to income levels in the dataset?

Lower-income individuals are younger on average than higher-income individuals, but there is a broader age range in the higher-income group. Older individuals are more likely to have incomes above the threshold, and the higher-income group has a more diverse age distribution. image

Is there a significant gender-based income disparity?

Women are more likely to be below the income threshold than men, indicating a significant gender disparity in income levels. image

Are there differences in employment status between the two income groups?

Income inequality is present across all employment statuses, with individuals in full-time schedules, part-time roles, and unemployment facing financial challenges. image

How do race and ethnicity correlate with income levels in the dataset?

Racial income disparities exist, with White individuals having a higher count above the income threshold than other racial groups. Citizens have a more diverse income distribution than foreigners. image

Is citizenship status associated with income levels?

The majority of foreigners in the dataset are concentrated below the income threshold, indicating a potential association between foreign status and lower income levels. image

What is the relationship between occupation and income categories?

The majority of individuals with income below the threshold are in occupations categorized as "Unknown," indicating potential associations between specific occupations and higher income levels. image

How does tax status correspond to income levels?

Nonfilers seem to have a disproportionately higher representation in the below-income threshold category, indicating a potential income disparity among nonfilers. image

Data Preparation

Feature Engineering

Performed unique value exploration, column renaming, missing value imputation, column dropping, target column extraction, and balancing the target column to address class imbalance.

Balancing The Target Variable

There was a significant class imbalance in the target variable, with a relatively small number of participants in the high-income category compared to the low-income category. This significant disparity in class distribution may have had implications for modeling and predictive accuracy. Class imbalances can lead to models that are biased toward the majority class, potentially impacting the model's ability to accurately predict the minority class (Above Limit). I addressed this class imbalance through oversampling.

Modeling

The training dataset for this income prediction problem contains numerous categorical features, some of which have a large number of unique values. This can pose challenges in terms of encoding and model performance. To address these issues, I opted for the CatBoost classifier as my modeling solution:

1. Automatic Categorical Feature Handling: CatBoost offers a unique advantage by automatically handling categorical features. Unlike traditional models that require extensive feature encoding using techniques like One-Hot Encoding or Label Encoding, CatBoost can directly work with categorical data. This simplifies the preprocessing step and ensures that we can utilize our categorical features without manual intervention.

2. Handling Missing Values: CatBoost excels in handling missing values. It utilizes an algorithm called Symmetric Weighted Quantile Sketch (SWQS) to automatically manage missing data. This not only simplifies the preprocessing process but also reduces the risk of overfitting, contributing to improved overall model performance.

3. Streamlined Feature Scaling: Another benefit of CatBoost is its built-in feature scaling. It takes care of scaling all columns uniformly, saving us the effort of manually converting columns. This helps streamline the data preparation phase.

4. Built-in Cross-Validation: CatBoost includes a built-in cross-validation method, simplifying the task of selecting the best hyperparameters for our model. This ensures that our model's performance is optimized without the need for extensive manual tuning.

5. Regularization Techniques: CatBoost supports both L1 and L2 regularization methods. These techniques are valuable for reducing overfitting and enhancing the model's ability to generalize well to unseen data.

By choosing CatBoost, I aimed to efficiently address the challenges posed by my dataset, particularly the extensive set of categorical features with many unique values which would have posed challenges during encoding. CatBoost not only simplifies the modeling process but also enhances the model's performance. It's a robust solution for the income prediction problem.

Dataset Splitting

Split the preprocessed training dataset into training and evaluation sets (80% training, 20% evaluation) using train_test_split.

Model Training and Evaluation

Achieved an Accuracy of 89.38% and an F1-Score of 0.89.

Saving The Model and Key Components

Saved the model, unique values, encoder, and scaler in a single pickle file for later use.

Deployment

Utilized Streamlit for a user-friendly interface and FAST API for scalable predictions. The architecture allows for flexibility in deployment, scalability, high performance, and easy integration.

Why Streamlit + FastAPI?

  • Asynchronous processing
  • Scalability
  • High performance
  • Easy integration

Streamlit allows for a user-friendly interface, while FastAPI ensures scalability and high performance for global-scale predictions.

Linking The Streamlit App with The FASTAPI

Connected the Streamlit app with the FastAPI backend for seamless integration. Sent a POST request to the FastAPI server, obtained the prediction response, and displayed the prediction result to the user.

App Layout - Homepage, Solution & EDA

The app comprises four pages: Homepage, Solution, EDA, and Prediction Page. Each page serves a specific purpose, from introducing the user to the problem to providing a PowerBI dashboard and allowing for predictions.

App Layout - Prediction Page

The Prediction Page allows users to input data such as age, gender, education, etc. They submit the data and receive an instant prediction response. The page provides descriptions of the different inputs and allows users to view and select them.

FastAPI Backend

The FastAPI backend accepts user input data, preprocesses it, utilizes a trained machine learning model to predict income categories, calculates prediction probability, formats the prediction result, and returns the prediction response.

Author

Rasmo Wanyama

Data Analyst/Data Scientist

Let's connect on LinkedIn:

LinkedIn

Acknowledgments:

I would like to thank the open-source community and the data providers who contributed to the dataset used in this project. Their efforts have made advancements in income prediction possible.

Feel free to explore the code, use the web application, and contribute to the project's development. Data-driven insights can contribute to a more equitable society, and together, we can make a difference.


Feel free to adapt the content, and if you have specific links or screenshots you'd like to include, replace the placeholders accordingly.

About

This repository contains a model for predicting if an individual earns above or below an income threshold

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published