Word Representation in Biomedical Domain

This is the repository of the NLP project for the 2023 Imperial Data Science Winter School. In this project, we will be using the CORD-19 dataset to study the word representation in biomedical domain.

Setup

create a virtual environment with python 3.7 or newer
```
python3 -m venv .venv
```
activate the virtual environment
```
source .venv/bin/activate
```
install the requirements
```
pip3 install -r requirements.txt
```

download the CORD-19 dataset

wget -P ./data/ https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz
tar -xf ./data/document_parses.tar.gz

download the NLTK data

python3 -m nltk.downloader -d ./data/nltk_data all

download the wikitext-103 dataset

wget -P ./data/ https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip ./data/wikitext-103-raw-v1.zip

run the notebook
```
jupyter notebook
```

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
.gitignore		.gitignore
README.md		README.md
nlp_project.ipynb		nlp_project.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word Representation in Biomedical Domain

Setup

About

Releases

Packages

Languages

lonelyotter/CORD-19_Word_Representation

Folders and files

Latest commit

History

Repository files navigation

Word Representation in Biomedical Domain

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages