Rare Named Entity Recognition using DistilBERT

This research focuses on leveraging DistilBERT, a smaller and more efficient variant of the BERT model, for Rare Named Entity Recognition (NER). DistilBERT's efficiency and robust transfer learning capabilities make it a promising tool for enhancing the detection of rare entities.

Dataset used - WNUT_17 Dataset

More specifically, the WNUT_17 dataset was chosen from the Hugging Face website because it focuses on less known or unusual named entities often ignored by formal corpora. The dataset contains social media samples that facilitate the collection of various informal idioms including uncommon ones and colloquial expressions.

Tag Categories: B-: Beginning of the Named Entity I-: Inside the Named Entity O: Outside of the Named Entity

Each entity type has a Beginning tag (B) and an Inside tag (I), facilitating the recognition of entities that span multiple tokens. The dataset comprises 13 categories of BIO tags in the final. NER_tags (list of class labels): NER tags of the tokens (using IOB2 format), with possible values:

0: O
1: B-corporation
2: I-corporation
3: B-creative-work
4: I-creative-work
5: B-group
6: I-group
7: B-location
8: I-location
9: B-person
10: I-person
11: B-product
12: I-product

An example of train data { "id": "0", "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0], "tokens": ["@paulwalk", "It", "'s", "the", "view", "from", "where", "I", "'m", "living", "for", "two", "weeks", ".", "Empire", "State", "Building", "=", "ESB", ".", "Pretty", "bad", "storm", "here", "last", "evening", "."] }

Model:

DistilBERT was chosen for its efficiency and robust performance in NLP tasks. The pre-trained DistilBERT model, distilbert-base-uncased, was fine-tuned for the specific task of NER.

Tokenizer and Model Definition: Used the AutoTokenizer and AutoModelForTokenClassification from the Hugging Face Transformers library. The tokenizer converts input text into tokens while the model performs token classification.

Tokenization and Label Alignment: The text data was tokenized, and the NER labels were aligned with the tokenized words. The function tokenize_and_align_labels ensured that the labels corresponded correctly to the subword tokens generated by the tokenizer.

Data Collation: A data collator was defined to batch the data efficiently during training and evaluation.

Training Configuration: The TrainingArguments class was used to set up the training configuration, including the learning rate, batch size, number of epochs, and strategies for evaluation and model saving.

Evaluation Metrics: A custom evaluation function compute_metrics was defined to calculate precision, recall, F1 score, and accuracy using the predictions and true labels.

Model Training and Evaluation: Used the Trainer class from the Transformers library to handle the training and evaluation process. The model was trained on the tokenized training data and evaluated on the tokenized test data.

Hyperparameters:

Learning Rate: 2e-5
Batch Size: 1 per device for both training and evaluation
Number of Epochs: 2
Weight Decay: 0.01
Evaluation Strategy: Evaluation at the end of each epoch
Save Strategy: Save the model at the end of each epoch
Maximum Sequence Length: 128

Evaluation:

Precision: 93.61
Recall: 94.39
F1: 92.75
Accuracy: 94.8

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
ner_wnut.py		ner_wnut.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rare Named Entity Recognition using DistilBERT

Dataset used - WNUT_17 Dataset

Model:

Evaluation:

About

Releases

Packages

Languages

AswinKumar1/Rare_named_entity_recognition

Folders and files

Latest commit

History

Repository files navigation

Rare Named Entity Recognition using DistilBERT

Dataset used - WNUT_17 Dataset

Model:

Evaluation:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages