GitHub - Haripprasath-M/translation-and-content-moderation-application: This repository houses a full-stack application designed to translate English text into French while ensuring content safety. The backend is powered by FastAPI, and the frontend is built using Streamlit. A deep learning model acts as a content moderator, preventing the translation of inappropriate content.

Translation and Content Moderation Application

Project Overview

This project builds a full-stack application that translates English text to French while maintaining content safety. It leverages FastAPI for the backend and Streamlit for the user-friendly frontend. A deep learning model acts as a content moderator, preventing the translation of inappropriate content.

Data

Source: Jigsaw Toxic Comment Classification Challenge (Kaggle)
Features: Text data (comment_text) with labels for toxicity, hate speech, threats, etc.
Translation Model: No separate dataset used; a pre-trained transformer model is employed.

Exploratory Data Analysis (EDA)

Performed a comprehensive EDA to understand the comment data:

Identified null and duplicate values.
Generated descriptive statistics.
Visualized class imbalance using count plots.
Analyzed text length distribution with histograms.
Explored feature correlations using a heatmap.
Created word clouds to identify content patterns within each toxicity class.

Data Preprocessing

Cleaned the text data:
- Lowercased characters.
- Removed special characters and stopwords.
- Applied lemmatization.
Filtered out comments exceeding 500 characters.
Enriched the data with:
- Text length.
- Sentiment polarity and subjectivity scores.
- Readability metrics.
Split the data into training and testing sets (X_train, X_test, y_train, y_test).
Standardized numerical features (text length, sentiment, and readability).

Machine Learning Model

Input Features: Cleaned text (TF-IDF vectors), text length, sentiment, and readability metrics.
Models: Random Forest, Gradient Boosting, AdaBoost with OneVsRestClassifier for multi-label classification.
Evaluation: F1-score and accuracy for each model.

Model Performances:

Model	Accuracy	F1-Score
Random Forest	0.906	0.637
Gradient Boosting	0.906	0.637
AdaBoost	0.904	0.647

Deep Learning Model

Utilized the pre-trained FastText Wiki Subwords 300 model for word embeddings.
Built and trained an LSTM model for content moderation, saving it as lstm.h5.

LSTM Model Performance:

Accuracy: 0.916
F1-Score: 0.755

Machine Translation

Employed a pre-trained transformer model Helsinki-NLP/opus-mt-en-fr for English-to-French translation.
Saved a tokenizer and the model locally for text encoding and decoding.

Challenges

Embedding Matrix: FastText couldn't recognize numbers, resulting in zero vectors.
Model Storage: Large pre-trained models are hosted on OneDrive due to GitHub storage limitations.
Docker: Attempted implementation but storage constraints prevented completion.

Deployment (Local)

Not deployed to a server yet. The project includes:
- A FastAPI backend with endpoints for content validation and translation.
- A Streamlit frontend offering a user-friendly interface for text input, validation, and translation.

To run locally:

Clone the repository:

git clone https://github.com/Haripprasath-M/translation-and-content-moderation-application

Install dependencies:
```
pip install -r requirements.txt
```
Download the models from OneDrive and place them in the models/ directory. OneDrive link - https://onedrive.live.com/?redeem=aHR0cHM6Ly8xZHJ2Lm1zL2YvYy84ZDc0M2QzMGMzYzMzZGZmL0VrUlVqLWc4amZsRG5TR2JjODlnUDdJQmVpcVVmSmFTazNZMnVSUmRmbzNsWUE%5FZT1mR1VKZ2M&id=8D743D30C3C33DFF%21se88f54448d3c43f99d219b73cf603fb2&cid=8D743D30C3C33DFF
Run the FastAPI backend:
```
python src/fastapi_app.py
```
Run the Streamlit frontend:
```
streamlit run src/streamlit_app.py
```

Future Work

Complete Docker implementation for deployment.
Further optimize the deep learning model.
Add more translation options

Contributions

Developed entirely by Haripprasath M.

License

This project is open-source under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Data

Exploratory Data Analysis (EDA)

Data Preprocessing

Machine Learning Model

Deep Learning Model

Machine Translation

Challenges

Deployment (Local)

Future Work

Contributions

License

About

Releases

Packages

Languages

License

Haripprasath-M/translation-and-content-moderation-application

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Data

Exploratory Data Analysis (EDA)

Data Preprocessing

Machine Learning Model

Deep Learning Model

Machine Translation

Challenges

Deployment (Local)

Future Work

Contributions

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages