Skip to content

Commit

Permalink
Improve auto-generated readme with section on tokenizer, fix error in…
Browse files Browse the repository at this point in the history
… example
  • Loading branch information
xhluca committed Sep 22, 2024
1 parent 8f1ca84 commit e58ae2d
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 1 deletion.
26 changes: 26 additions & 0 deletions bm25s/hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,32 @@
retriever = BM25HF.load_from_hub("{username}/{repo_name}", token=token)
```
## Tokenizer
If you have saved a `Tokenizer` object with the index using the following approach:
```python
from bm25s.hf import TokenizerHF
token = "your_hugging_face_token"
tokenizer = TokenizerHF(corpus=corpus, stopwords="english")
tokenizer.save_to_hub("{username}/{repo_name}", token=token)
# and stopwords too
tokenizer.save_stopwords_to_hub("{username}/{repo_name}", token=token)
```
Then, you can load the tokenizer using the following code:
```python
from bm25s.hf import TokenizerHF
tokenizer = TokenizerHF(corpus=corpus, stopwords=[])
tokenizer.load_vocab_from_hub("{username}/{repo_name}", token=token)
tokenizer.load_stopwords_from_hub("{username}/{repo_name}", token=token)
```
## Stats
This dataset was created using the following data:
Expand Down
2 changes: 1 addition & 1 deletion examples/index_to_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def main(user, save_dir="datasets", repo_name="bm25s-scifact-testing", dataset="
tokenizer.save_vocab_to_hub(repo_id=f"{user}/{repo_name}", token=hf_token)

# you can also load the retriever and tokenizer from the hub
tokenizer_new = bm25s.hf.TokenizerHF(stemmer=stemmer)
tokenizer_new = bm25s.hf.TokenizerHF(stemmer=stemmer, stopwords=[])
tokenizer_new.load_vocab_from_hub(repo_id=f"{user}/{repo_name}", token=hf_token)

# You can do the same for stopwords
Expand Down

0 comments on commit e58ae2d

Please sign in to comment.