Skip to content
This repository has been archived by the owner on Jan 15, 2020. It is now read-only.

DEPRECATED - no longer actively maintained

License

Notifications You must be signed in to change notification settings

nhsuk/elasticsearch-updater

Repository files navigation

DEPRECATED - no longer actively maintained


Elasticsearch Updater

Scheduled service to load data from a URL into Elasticsearch

GitHub Release Greenkeeper badge Build Status Coverage Status

The elasticsearch-updater is a dockerised application that will update an Elasticsearch instance on a regular basis using JSON data from a URL.

The file specified in the JSON_FILE_URL environment variable will be used as the source of the update if it is available, is valid JSON, and if the total count has not dropped by a significant amount as described in CHANGE_THRESHOLD below.

The destination instance is specified in the ES_HOST and ES_PORT variables.

As bespoke code is required to create mappings and perform transforms, Elasticsearch configurations must be provided within the elasticsearch-updater repository. ES configurations are stored on the config/esConfig object, and can be selected with the ES_INDEX parameter. Mappings and transforms for each available index are held in a folder with the same name as the configuration, i.e. profiles/mapping.json.

An ES configuration must provide a type, an idKey, a mapping definition and an optional transform function. The type is the index type used in the mapping, and the idKey identifies the unique id in the data. For the profiles configuration these are gps and choicesId respectively.

The file download and Elasticsearch update will run on startup, then on a daily schedule while the container continues to run.

The time of day defaults to 7am, and can be changed via the UPDATE_SCHEDULE environment variable. The schedule is run using node-schedule which uses a Cron-like syntax. Further details on node-schedule available here Note: the container time is GMT and does not take account of daylight saving, you may need to subtract an hour from the time if it is currently BST.

When updating the Elasticsearch instance the new data will be inserted into a date stamped index and validated against the existing index. Once validation passes the existing index will be deleted and an alias set up to the new index, i.e. profiles_20170629140702 will be aliased to profiles upon successful validation.

Validation will fail if the count of records drops significantly. The allowable drop in record count is controlled by the CHANGE_THRESHOLD environment variable. By default this is set to 0.99 which prevents the data being loaded if the new count is less than 99% of the previous count.

Environment variables

Environment variables are expected to be managed by the environment in which the application is being run. This is best practice as described by twelve-factor.

Variable Description Default Required
NODE_ENV Node environment development
LOG_LEVEL log level Depends on NODE_ENV
JSON_FILE_URL Publicly available URL of JSON data yes
ES_HOST Host name of Elasticsearch server yes
ES_INDEX Elasticsearch configuration to read yes
ES_PORT Port of Elasticsearch server 27017
ES_REPLICAS Number of replicas configured for the index 1
ES_TIMEOUT_SECONDS Maximum time in seconds to wait for response from Elasticsearch 180
ES_SHARDS Number of shards for the index 5
CHANGE_THRESHOLD Factor the data count can change by before erroring 0.99
UPDATE_SCHEDULE Time of day to run the update 0 7 * * * (7 am)

Docker Compose Structure for Deployment and Development

The docker-compose.yml used for development and deployment via Rancher have a similar structure. A stack is run with three elasticsearch-updater images having different configurations pointing at the same Elasticsearch instance.

The convention for environment variables used in the Rancher configuration is to add a suffix to each of the variables in the table above. These are then mapped to the appropriate suffix-less variable in the container, i.e. for the pharmacies container the JSON_FILE_URL_PHARMACIES is mapped to JSON_FILE_URL, ES_HOST_PHARMACIES is mapped to ES_HOST and so on.

Architecture Decision Records

This repo uses Architecture Decision Records to record architectural decisions for this project. They are stored in doc/adr.