bpe
Here are 77 public repositories matching this topic...
In this repo I will share different topics on anything I want to know in nlp and llms
-
Updated
Sep 16, 2024
Simple-to-use scoring function for arbitrarily tokenized texts.
-
Updated
Sep 12, 2024 - Python
Fast and customizable text tokenization library with BPE and SentencePiece support
-
Updated
Sep 3, 2024 - C++
Fast bare-bones BPE for modern tokenizer training
-
Updated
Aug 29, 2024 - Python
This repository provides a clear, educational implementation of Byte Pair Encoding (BPE) tokenization in plain Python. The focus is on algorithmic understanding, not raw performance.
-
Updated
Aug 28, 2024 - Python
Fast and versatile tokenizer for language models with BPE, Unigram and WordPiece tokenization. Compatible with SentencePiece, Tokenizers, Tiktoken and more.
-
Updated
Oct 1, 2024 - Rust
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
-
Updated
Aug 7, 2024 - Python
Zero-dependency implementation of BitNet neural network training and BPE tokenization in C
-
Updated
Aug 3, 2024 - C
Translating Indian Names to Hindi, a sequence-to-sequence modeling task, using character-level conditional language models.
-
Updated
Jul 19, 2024 - Jupyter Notebook
(py package) train your own tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)
-
Updated
Jun 7, 2024 - Jupyter Notebook
Byte-Pair Encoding tokenizer for training large language models on huge datasets
-
Updated
Jun 4, 2024 - Python
Strings Tokenization with Byte Pair Encoding.
-
Updated
May 29, 2024 - TypeScript
Improve this page
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."