Skip to content

lonelyotter/CORD-19_Word_Representation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word Representation in Biomedical Domain

This is the repository of the NLP project for the 2023 Imperial Data Science Winter School. In this project, we will be using the CORD-19 dataset to study the word representation in biomedical domain.

Setup

  1. create a virtual environment with python 3.7 or newer

    python3 -m venv .venv
  2. activate the virtual environment

    source .venv/bin/activate
  3. install the requirements

    pip3 install -r requirements.txt
  4. download the CORD-19 dataset

    wget -P ./data/ https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2021-07-26/document_parses.tar.gz
    tar -xf ./data/document_parses.tar.gz
  5. download the NLTK data

    python3 -m nltk.downloader -d ./data/nltk_data all
  6. download the wikitext-103 dataset

    wget -P ./data/ https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
    unzip ./data/wikitext-103-raw-v1.zip
  7. run the notebook

    jupyter notebook

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published