Skip to content

A comprehensive research project developed as part of the course - Information Extraction curriculum. This project comprises of three members and my contribution is to use DistilBERT for rare NER.

Notifications You must be signed in to change notification settings

AswinKumar1/Rare_named_entity_recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Rare Named Entity Recognition using DistilBERT

This research focuses on leveraging DistilBERT, a smaller and more efficient variant of the BERT model, for Rare Named Entity Recognition (NER). DistilBERT's efficiency and robust transfer learning capabilities make it a promising tool for enhancing the detection of rare entities.

Dataset used - WNUT_17 Dataset

More specifically, the WNUT_17 dataset was chosen from the Hugging Face website because it focuses on less known or unusual named entities often ignored by formal corpora. The dataset contains social media samples that facilitate the collection of various informal idioms including uncommon ones and colloquial expressions.

Tag Categories: B-: Beginning of the Named Entity I-: Inside the Named Entity O: Outside of the Named Entity

Each entity type has a Beginning tag (B) and an Inside tag (I), facilitating the recognition of entities that span multiple tokens. The dataset comprises 13 categories of BIO tags in the final. NER_tags (list of class labels): NER tags of the tokens (using IOB2 format), with possible values:

  • 0: O
  • 1: B-corporation
  • 2: I-corporation
  • 3: B-creative-work
  • 4: I-creative-work
  • 5: B-group
  • 6: I-group
  • 7: B-location
  • 8: I-location
  • 9: B-person
  • 10: I-person
  • 11: B-product
  • 12: I-product

An example of train data { "id": "0", "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0], "tokens": ["@paulwalk", "It", "'s", "the", "view", "from", "where", "I", "'m", "living", "for", "two", "weeks", ".", "Empire", "State", "Building", "=", "ESB", ".", "Pretty", "bad", "storm", "here", "last", "evening", "."] }

Model:

DistilBERT was chosen for its efficiency and robust performance in NLP tasks. The pre-trained DistilBERT model, distilbert-base-uncased, was fine-tuned for the specific task of NER.

Tokenizer and Model Definition: Used the AutoTokenizer and AutoModelForTokenClassification from the Hugging Face Transformers library. The tokenizer converts input text into tokens while the model performs token classification.

Tokenization and Label Alignment: The text data was tokenized, and the NER labels were aligned with the tokenized words. The function tokenize_and_align_labels ensured that the labels corresponded correctly to the subword tokens generated by the tokenizer.

Data Collation: A data collator was defined to batch the data efficiently during training and evaluation.

Training Configuration: The TrainingArguments class was used to set up the training configuration, including the learning rate, batch size, number of epochs, and strategies for evaluation and model saving.

Evaluation Metrics: A custom evaluation function compute_metrics was defined to calculate precision, recall, F1 score, and accuracy using the predictions and true labels.

Model Training and Evaluation: Used the Trainer class from the Transformers library to handle the training and evaluation process. The model was trained on the tokenized training data and evaluated on the tokenized test data.

Hyperparameters:

  • Learning Rate: 2e-5
  • Batch Size: 1 per device for both training and evaluation
  • Number of Epochs: 2
  • Weight Decay: 0.01
  • Evaluation Strategy: Evaluation at the end of each epoch
  • Save Strategy: Save the model at the end of each epoch
  • Maximum Sequence Length: 128

Evaluation:

  • Precision: 93.61
  • Recall: 94.39
  • F1: 92.75
  • Accuracy: 94.8

About

A comprehensive research project developed as part of the course - Information Extraction curriculum. This project comprises of three members and my contribution is to use DistilBERT for rare NER.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages