This project is developed in Python + NLTK + Elasticsearch for query expansion over a data. the data crawled from Snopes Fact checks and the designed crawler and its implementation is accessible from this repository.
Snopes Fact checks contains some rumors and questionable claims of the day. After gathering data via the crawler, it's time to index the data into a search engine to use it for retrieving the information. We use Elasticsearch which has a big community and also uses the power of Apache Lucene indexing & search tool.
We use query expansion, a technique for improving the quality of search results in a search engine and get help from wordnet database to find semantic relations between words. For simplifying the usage from wordnet and also tokenizing the queries and other preprocessings, we use NLTK python module.
The general idea behind query expansion is that for every token in the query, the sysnonyms are conjuncted with OR and the results are conjuncted with AND operator.
- Python: 3.7.0
- Elasticsearch: 7.16.0
- NLTK: 3.6.7
Clone the repository:
git clone https://github.com/mohsenMahmoodzadeh/query-expansion-with-elasticsearch.git
Create a virtual environment (to avoid conflicts):
virtualenv -p python3.7 fcquery
# this may vary depending on your shell
. fcquery/bin/activate
Install the dependencies:
pip install -r requirements.txt
The dataset is accessible from here. Put it on the root directory of your project.
First, download the elasticsearch configuration from here and run it according to the installation guide of the website.
After setting up elasticsearch service, run the following command to index the data into elasticsearch engine:
python create_index.py
After the completion of the indexing phase, run the script below to query on the data, expand the queries and save the results into result/
directory.
python search_index.py
-
Preprocess the data to be prepared for analyzing. This can contain some tasks such as encoding, setting lowercase, converting to numeric, etc.
-
Analyze the data with applying DSL queries and creating dashboards.
Fixes and improvements are more than welcome, so raise an issue or send a PR!