This research focuses on leveraging DistilBERT, a smaller and more efficient variant of the BERT model, for Rare Named Entity Recognition (NER). DistilBERT's efficiency and robust transfer learning capabilities make it a promising tool for enhancing the detection of rare entities.
More specifically, the WNUT_17 dataset was chosen from the Hugging Face website because it focuses on less known or unusual named entities often ignored by formal corpora. The dataset contains social media samples that facilitate the collection of various informal idioms including uncommon ones and colloquial expressions.
Tag Categories: B-: Beginning of the Named Entity I-: Inside the Named Entity O: Outside of the Named Entity
Each entity type has a Beginning tag (B) and an Inside tag (I), facilitating the recognition of entities that span multiple tokens. The dataset comprises 13 categories of BIO tags in the final. NER_tags (list of class labels): NER tags of the tokens (using IOB2 format), with possible values:
- 0: O
- 1: B-corporation
- 2: I-corporation
- 3: B-creative-work
- 4: I-creative-work
- 5: B-group
- 6: I-group
- 7: B-location
- 8: I-location
- 9: B-person
- 10: I-person
- 11: B-product
- 12: I-product
An example of train data { "id": "0", "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0], "tokens": ["@paulwalk", "It", "'s", "the", "view", "from", "where", "I", "'m", "living", "for", "two", "weeks", ".", "Empire", "State", "Building", "=", "ESB", ".", "Pretty", "bad", "storm", "here", "last", "evening", "."] }
DistilBERT was chosen for its efficiency and robust performance in NLP tasks. The pre-trained DistilBERT model, distilbert-base-uncased, was fine-tuned for the specific task of NER.
Tokenizer and Model Definition: Used the AutoTokenizer and AutoModelForTokenClassification from the Hugging Face Transformers library. The tokenizer converts input text into tokens while the model performs token classification.
Tokenization and Label Alignment: The text data was tokenized, and the NER labels were aligned with the tokenized words. The function tokenize_and_align_labels ensured that the labels corresponded correctly to the subword tokens generated by the tokenizer.
Data Collation: A data collator was defined to batch the data efficiently during training and evaluation.
Training Configuration: The TrainingArguments class was used to set up the training configuration, including the learning rate, batch size, number of epochs, and strategies for evaluation and model saving.
Evaluation Metrics: A custom evaluation function compute_metrics was defined to calculate precision, recall, F1 score, and accuracy using the predictions and true labels.
Model Training and Evaluation: Used the Trainer class from the Transformers library to handle the training and evaluation process. The model was trained on the tokenized training data and evaluated on the tokenized test data.
Hyperparameters:
- Learning Rate: 2e-5
- Batch Size: 1 per device for both training and evaluation
- Number of Epochs: 2
- Weight Decay: 0.01
- Evaluation Strategy: Evaluation at the end of each epoch
- Save Strategy: Save the model at the end of each epoch
- Maximum Sequence Length: 128
- Precision: 93.61
- Recall: 94.39
- F1: 92.75
- Accuracy: 94.8