This is the repository of the NLP project for the 2023 Imperial Data Science Winter School. In this project, we will be using the CORD-19 dataset to study the word representation in biomedical domain.
-
create a virtual environment with python 3.7 or newer
python3 -m venv .venv
-
activate the virtual environment
source .venv/bin/activate
-
install the requirements
pip3 install -r requirements.txt
-
download the CORD-19 dataset
wget -P ./data/ https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz tar -xf ./data/document_parses.tar.gz
-
download the NLTK data
python3 -m nltk.downloader -d ./data/nltk_data all
-
download the wikitext-103 dataset
wget -P ./data/ https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip unzip ./data/wikitext-103-raw-v1.zip
-
run the notebook
jupyter notebook