Skip to content

A python program for indexing documents and querying them with expansion over Elasticsearch engine

Notifications You must be signed in to change notification settings

mohsenMahmoodzadeh/query-expansion-with-elasticsearch

Repository files navigation

Query Expansion with Elasticsearch & NLTK

This project is developed in Python + NLTK + Elasticsearch for query expansion over a data. the data crawled from Snopes Fact checks and the designed crawler and its implementation is accessible from this repository.

Snopes Fact checks contains some rumors and questionable claims of the day. After gathering data via the crawler, it's time to index the data into a search engine to use it for retrieving the information. We use Elasticsearch which has a big community and also uses the power of Apache Lucene indexing & search tool.

We use query expansion, a technique for improving the quality of search results in a search engine and get help from wordnet database to find semantic relations between words. For simplifying the usage from wordnet and also tokenizing the queries and other preprocessings, we use NLTK python module.

The general idea behind query expansion is that for every token in the query, the sysnonyms are conjuncted with OR and the results are conjuncted with AND operator.

Environment

  • Python: 3.7.0
  • Elasticsearch: 7.16.0
  • NLTK: 3.6.7

Installation Guide

Clone the repository:

git clone https://github.com/mohsenMahmoodzadeh/query-expansion-with-elasticsearch.git

Create a virtual environment (to avoid conflicts):

virtualenv -p python3.7 fcquery

# this may vary depending on your shell
. fcquery/bin/activate 

Install the dependencies:

pip install -r requirements.txt

The dataset is accessible from here. Put it on the root directory of your project.

Usage Guide

First, download the elasticsearch configuration from here and run it according to the installation guide of the website.

After setting up elasticsearch service, run the following command to index the data into elasticsearch engine:

python create_index.py

After the completion of the indexing phase, run the script below to query on the data, expand the queries and save the results into result/ directory.

python search_index.py

Future Works

  • Preprocess the data to be prepared for analyzing. This can contain some tasks such as encoding, setting lowercase, converting to numeric, etc.

  • Analyze the data with applying DSL queries and creating dashboards.

Contributing

Fixes and improvements are more than welcome, so raise an issue or send a PR!

About

A python program for indexing documents and querying them with expansion over Elasticsearch engine

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages