Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki-Canary: Add query spot checking and metric count checking #2344

Merged
merged 10 commits into from
Jul 13, 2020
17 changes: 14 additions & 3 deletions cmd/loki-canary/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,24 @@ func main() {
tls := flag.Bool("tls", false, "Does the loki connection use TLS?")
user := flag.String("user", "", "Loki username")
pass := flag.String("pass", "", "Loki password")
queryTimeout := flag.Duration("query-timeout", 10*time.Second, "How long to wait for a query response from Loki")

interval := flag.Duration("interval", 1000*time.Millisecond, "Duration between log entries")
size := flag.Int("size", 100, "Size in bytes of each log line")
wait := flag.Duration("wait", 60*time.Second, "Duration to wait for log entries before reporting them lost")
pruneInterval := flag.Duration("pruneinterval", 60*time.Second, "Frequency to check sent vs received logs, also the frequency which queries for missing logs will be dispatched to loki")
pruneInterval := flag.Duration("pruneinterval", 60*time.Second, "Frequency to check sent vs received logs, "+
"also the frequency which queries for missing logs will be dispatched to loki, and the frequency spot check queries are run")
buckets := flag.Int("buckets", 10, "Number of buckets in the response_latency histogram")

metricTestInterval := flag.Duration("metric-test-interval", 1*time.Hour, "The interval the metric test query should be run")
metricTestQueryRange := flag.Duration("metric-test-range", 24*time.Hour, "The range value [24h] used in the metric test instant-query."+
" Note: this value is truncated to the running time of the canary until this value is reached")

spotCheckInterval := flag.Duration("spot-check-interval", 15*time.Minute, "Interval that a single result will be kept from sent entries and spot-checked against Loki, "+
"e.g. 15min default one entry every 15 min will be saved and then queried again every 15min until spot-check-max is reached")
spotCheckMax := flag.Duration("spot-check-max", 4*time.Hour, "How far back to check a spot check entry before dropping it")
spotCheckQueryRate := flag.Duration("spot-check-query-rate", 1*time.Minute, "Interval that the canary will query Loki for the current list of all spot check entries")

printVersion := flag.Bool("version", false, "Print this builds version information")

flag.Parse()
Expand All @@ -71,8 +82,8 @@ func main() {
defer c.lock.Unlock()

c.writer = writer.NewWriter(os.Stdout, sentChan, *interval, *size)
c.reader = reader.NewReader(os.Stderr, receivedChan, *tls, *addr, *user, *pass, *lName, *lVal, *sName, *sValue)
c.comparator = comparator.NewComparator(os.Stderr, *wait, *pruneInterval, *buckets, sentChan, receivedChan, c.reader, true)
c.reader = reader.NewReader(os.Stderr, receivedChan, *tls, *addr, *user, *pass, *queryTimeout, *lName, *lVal, *sName, *sValue)
c.comparator = comparator.NewComparator(os.Stderr, *wait, *pruneInterval, *spotCheckInterval, *spotCheckMax, *spotCheckQueryRate, *metricTestInterval, *metricTestQueryRange, *interval, *buckets, sentChan, receivedChan, c.reader, true)
}

startCanary()
Expand Down
60 changes: 59 additions & 1 deletion docs/operations/loki-canary.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,52 @@ determine if they are truly missing or only missing from the WebSocket. If
missing entries are not found in the direct query, the `missing_entries` counter
is incremented.

### Additional Queries

#### Spot Check

Starting with version 1.6.0, the canary will spot check certain results over time
to make sure they are present in Loki, this is helpful for testing the transition
of inmemory logs in the ingester to the store to make sure nothing is lost.

`-spot-check-interval` and `-spot-check-max` are used to tune this feature,
`-spot-check-interval` will pull a log entry from the stream at this interval
and save it in a separate list up to `-spot-check-max`.

Every `-spot-check-query-rate`, Loki will be queried for each entry in this list and
`loki_canary_spot_check_entries_total` will be incremented, if a result
is missing `loki_canary_spot_check_missing_entries_total` will be incremented.

The defaults of `15m` for `spot-check-interval` and `4h` for `spot-check-max`
means that after 4 hours of running the canary will have a list of 16 entries
it will query every minute (default `spot-check-query-rate` interval is 1m),
so be aware of the query load this can put on Loki if you have a lot of canaries.

#### Metric Test

Starting with version 1.6.0 the canary will run a metric query `count_over_time` to
verify the rate of logs being stored in Loki corresponds to the rate they are being
created by the canary.

`-metric-test-interval` and `-metric-test-range` are used to tune this feature, but
by default every `15m` the canary will run a `count_over_time` instant-query to Loki
for a range of `24h`.

If the canary has not run for `-metric-test-range` (`24h`) the query range is adjusted
to the amount of time the canary has been running such that the rate can be calculated
since the canary was started.

The canary calculates what the expected count of logs would be for the range
(also adjusting this based on canary runtime) and compares the expected result with
the actual result returned from Loki. The _difference_ is stored as the value in
the gauge `loki_canary_metric_test_deviation`

It's expected that there will be some deviation, the method of creating an expected
calculation based on the query rate compared to actual query data is imperfect
and will lead to a deviation of a few log entries.

It's not expected for there to be a deviation of more than 3-4 log entries.

### Control

Loki Canary responds to two endpoints to allow dynamic suspending/resuming of the
Expand Down Expand Up @@ -246,14 +292,26 @@ All options:
The label name for this instance of loki-canary to use in the log selector (default "name")
-labelvalue string
The unique label value for this instance of loki-canary to use in the log selector (default "loki-canary")
-metric-test-interval duration
The interval the metric test query should be run (default 1h0m0s)
-metric-test-range duration
The range value [24h] used in the metric test instant-query. Note: this value is truncated to the running time of the canary until this value is reached (default 24h0m0s)
-pass string
Loki password
-port int
Port which loki-canary should expose metrics (default 3500)
-pruneinterval duration
Frequency to check sent vs received logs, also the frequency which queries for missing logs will be dispatched to loki (default 1m0s)
Frequency to check sent vs received logs, also the frequency which queries for missing logs will be dispatched to loki, and the frequency spot check queries are run (default 1m0s)
-query-timeout duration
How long to wait for a query response from Loki (default 10s)
-size int
Size in bytes of each log line (default 100)
-spot-check-interval duration
Interval that a single result will be kept from sent entries and spot-checked against Loki, e.g. 15min default one entry every 15 min will be saved andthen queried again every 15min until spot-check-max is reached (default 15m0s)
-spot-check-max duration
How far back to check a spot check entry before dropping it (default 4h0m0s)
-spot-check-query-rate duration
Interval that the canary will query Loki for the current list of all spot check entries (default 1m0s)
-streamname string
The stream name for this instance of loki-canary to use in the log selector (default "stream")
-streamvalue string
Expand Down
Loading