Scheduled service to load data from a URL into Elasticsearch
The elasticsearch-updater is a dockerised application that will update an Elasticsearch instance on a regular basis using JSON data from a URL.
The file specified in the JSON_FILE_URL
environment variable will be used as
the source of the update if it is available, is valid JSON, and if the total
count has not dropped by a significant amount as described in
CHANGE_THRESHOLD
below.
The destination instance is specified in the ES_HOST
and ES_PORT
variables.
As bespoke code is required to create mappings and perform transforms,
Elasticsearch configurations must be provided within the
elasticsearch-updater
repository. ES configurations are stored on the
config/esConfig
object, and can be selected with the ES_INDEX
parameter.
Mappings and transforms for each available index are held in a folder with the
same name as the configuration, i.e. profiles/mapping.json
.
An ES configuration must provide a type
, an idKey
, a mapping definition
and an optional transform
function. The type
is the index type used in the
mapping, and the idKey
identifies the unique id in the data. For the
profiles
configuration these are gps
and choicesId
respectively.
The file download and Elasticsearch update will run on startup, then on a daily schedule while the container continues to run.
The time of day defaults to 7am, and can be changed via the UPDATE_SCHEDULE
environment variable. The schedule is run using node-schedule
which uses a
Cron-like syntax. Further details on node-schedule available
here
Note: the container time is GMT and does not take account of daylight saving,
you may need to subtract an hour from the time if it is currently BST.
When updating the Elasticsearch instance the new data will be inserted into a
date stamped index and validated against the existing index. Once validation
passes the existing index will be deleted and an alias set up to the new index,
i.e. profiles_20170629140702
will be aliased to profiles
upon successful
validation.
Validation will fail if the count of records drops significantly. The allowable
drop in record count is controlled by the CHANGE_THRESHOLD
environment
variable. By default this is set to 0.99
which prevents the data being loaded
if the new count is less than 99% of the previous count.
Environment variables are expected to be managed by the environment in which the application is being run. This is best practice as described by twelve-factor.
Variable | Description | Default | Required |
---|---|---|---|
NODE_ENV |
Node environment | development | |
LOG_LEVEL |
log level | Depends on NODE_ENV |
|
JSON_FILE_URL |
Publicly available URL of JSON data | yes | |
ES_HOST |
Host name of Elasticsearch server | yes | |
ES_INDEX |
Elasticsearch configuration to read | yes | |
ES_PORT |
Port of Elasticsearch server | 27017 | |
ES_REPLICAS |
Number of replicas configured for the index | 1 | |
ES_TIMEOUT_SECONDS |
Maximum time in seconds to wait for response from Elasticsearch | 180 | |
ES_SHARDS |
Number of shards for the index | 5 | |
CHANGE_THRESHOLD |
Factor the data count can change by before erroring | 0.99 | |
UPDATE_SCHEDULE |
Time of day to run the update | 0 7 * * * (7 am) |
The docker-compose.yml
used for development and deployment via Rancher have a similar structure.
A stack is run with three elasticsearch-updater
images having different configurations pointing at the same Elasticsearch instance.
The convention for environment variables used in the Rancher configuration is to add a suffix to each of the variables in the table above.
These are then mapped to the appropriate suffix-less variable in the container,
i.e. for the pharmacies
container the JSON_FILE_URL_PHARMACIES
is mapped to JSON_FILE_URL
, ES_HOST_PHARMACIES
is mapped to ES_HOST
and so on.
This repo uses Architecture Decision Records to record architectural decisions for this project. They are stored in doc/adr.