- Complete the overview distributed training on wikitext-103
- Reproduce the results on wikitext-103 (comparing on dense model and MoE)
- Implement the model on other datasets
- Pre-training 1.5B model on 2024 subset FineWeb
Implementation of paper Mixture of A Million Experts by Phan Nhat Huy
torchrun --nproc_per_node=N --nnodes=1 main.py
Wikitext-103 2.2B model, 8 layers, 8 head, dimension = 256, 512x512 experts.
Validation Perplexity
Method | Wikitext-103 Perplexity |
---|---|
PEER | 7.19 |
FFW | on-going |
@inproceedings{He2024MixtureOA,
title = {Mixture of A Million Experts},
author = {Xu Owen He},
year = {2024},
url = {https://api.semanticscholar.org/CorpusID:271038610}
}
I thank the implementation of PEER layer from lucidrains https://github.com/lucidrains/PEER-pytorch