DocQA

Ask questions on your documents.

This repo contains various tools for creating a document QA app from your text file to a RAG chatbot.

Installation

Get the source code

# clone the repo (with a submodule)
git clone --recurse-submodules https://github.com/lone17/docqa.git
cd docqa

It is recommended to create a virtual environment
```
python -m venv env
. env/bin/activate
```

First, let's install Marker (following its instructions)

cd marker
#  Install ghostscript > 9.55 by following https://ghostscript.readthedocs.io/en/latest/Install.html
scripts/install/ghostscript_install.sh
# install other requirements
cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y
pip install .

Then install docqa
```
cd ..
pip install -e .[dev]
```

Demo

This repo contains a demo for the whole pipeline for a QA chatbot on Generative Agents based on the information in this paper.

For information about the development process, please refer to the technical report

Try the Demo

From source

In order to use this app, you need a OpenAI API key.

Before playing with the demo, please populate your key and secrets in the .env file:

OPENAI_API_KEY=...
OPENAI_MODEL=...
OPENAI_SEED=...
WANDB_API_KEY=... # only needed if you want to fine-tune the model and use WanDB

All the scripts for the full pipeline as well as generated artifacts are in the demo folder.

create_dataset.py: This script handles the full data processing pipeline:
- parse the pdf file
- convert it to markdown
- chunk the content preserving structural content
- generate question-answers pairs
- prepare data for other steps: fine-tuning OpenAI models, and adding to vector-stores.
finetune_openai.py: As the name suggests, this script is used to fine-tune the OpenAI model using the data generated in create_dataset.py.
- Also includes Wandb logging.
pipeline.py: Declares the QA pipeline with semantic retrieval using ChromaDB.

The main.py script is the endpoint for running the backend app:

python main.py

And to run the front end:

streamlit run frontend.py

Using Docker

Alternatively, you can get the image from Docker Hub.

docker pull el117/docqa
docker run --rm -p 8000:8000 -e OPENAI_API_KEY=<...> el117/docqa

Note that the docker does not contain the front end. To run it you can simply do:

pip install streamlit
streamlit run frontend.py

Architecture

Data Pipeline

The diagram below describes the data life cycle. Results from each step can be found at docqa/demo/data/generative_agent.

flowchart LR
    subgraph pdf-to-md[PDF to Markdown]
        direction RL
        pdf[PDF] --> raw-md(raw\nmarkdown)
        raw-md --> tidied-md([tidied\nmarkdown])
    end

    subgraph create-dataset[Create Dataset]
        tidied-md --> sections([markdown\nsections])
        sections --> doc-tree([doc\ntree])
        doc-tree --> top-lv([top-level\nsections])
        doc-tree --> chunks([section-bounded\nchunks])
        top-lv --> top-lv-qa([top-level sections\nQA pairs])
        top-lv-qa --> finetune-data([fine-tuning\ndata])
    end


        finetune-data --> lm{{language\nmodel}}

        top-lv-qa --> vector-store[(vector\nstore)]
        chunks ----> vector-store

App

The diagram below describes the app's internal working, from receiving a question to answering it.

flowchart LR
    query(query) --> emb{{embedding\nmodel}}

    subgraph retriever[SemanticRetriever]
        direction LR
        vector-store[(vector\nstore)]

        emb --> vector-store
        vector-store --> chunks([related\nchunks])
        vector-store --> questions([similar\nquestions])
        questions --> sections([related\nsections])
    end

    sections --> ref([references])
    chunks --> ref

    query --> thresh{similarity > threshold}
    questions --> thresh

    thresh -- true --> answer(((answer &\nreferences)))
    thresh -- false --> answerer

    ref --> prompt(prompt)
    query --> prompt

    subgraph answerer[AnswerGenerator]
        direction LR
        prompt --> llm{{language\nmodel}}
    end

    llm --> answer
    ref --> answer

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
docqa		docqa
docs		docs
marker @ 2cc3a82		marker @ 2cc3a82
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
conftest.py		conftest.py
download_models.py		download_models.py
frontend.py		frontend.py
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-doc.txt		requirements-doc.txt
requirements.txt		requirements.txt
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocQA

Installation

Demo

Try the Demo

From source

Using Docker

Architecture

Data Pipeline

App

About

Releases

Packages

Contributors 2

Languages

lone17/docqa

Folders and files

Latest commit

History

Repository files navigation

DocQA

Installation

Demo

Try the Demo

From source

Using Docker

Architecture

Data Pipeline

App

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages