Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moving loki-canary project under loki repo #772

Merged
merged 19 commits into from
Jul 17, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@ cmd/promtail/promtail
cmd/loki/loki-debug
cmd/promtail/promtail-debug
cmd/docker-driver/docker-driver
cmd/loki-canary/loki-canary
/loki
/promtail
/logcli
/loki-canary
dlv
rootfs/
rootfs/
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Once you have promtail, Loki, and Grafana running, continue with [our usage docs
- [Promtail](./docs/promtail.md) is an agent which can tail your log files and push them to Loki.
- [Docker Logging Driver](./cmd/docker-driver/README.md) is a docker plugin to send logs directly to Loki from Docker containers.
- [Logcli](./docs/logcli.md) on how to query your logs without Grafana.
- [Loki Canary](./docs/canary/README.md) for monitoring your Loki installation for missing logs.
- [Troubleshooting](./docs/troubleshooting.md) for help around frequent error messages.
- [Usage](./docs/usage.md) for how to set up a Loki datasource in Grafana and query your logs.

Expand Down
4 changes: 4 additions & 0 deletions cmd/loki-canary/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FROM alpine:3.9
RUN apk add --update --no-cache ca-certificates
ADD loki-canary /usr/bin
ENTRYPOINT [ "/usr/bin/loki-canary" ]
78 changes: 78 additions & 0 deletions cmd/loki-canary/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
package main

import (
"flag"
"fmt"
"net/http"
"os"
"os/signal"
"strconv"
"syscall"
"time"

"github.com/prometheus/client_golang/prometheus/promhttp"

"github.com/grafana/loki/pkg/canary/comparator"
"github.com/grafana/loki/pkg/canary/reader"
"github.com/grafana/loki/pkg/canary/writer"
)

func main() {

lName := flag.String("labelname", "name", "The label name for this instance of loki-canary to use in the log selector")
lVal := flag.String("labelvalue", "loki-canary", "The unique label value for this instance of loki-canary to use in the log selector")
port := flag.Int("port", 3500, "Port which loki-canary should expose metrics")
addr := flag.String("addr", "", "The Loki server URL:Port, e.g. loki:3100")
tls := flag.Bool("tls", false, "Does the loki connection use TLS?")
user := flag.String("user", "", "Loki username")
pass := flag.String("pass", "", "Loki password")

interval := flag.Duration("interval", 1000*time.Millisecond, "Duration between log entries")
size := flag.Int("size", 100, "Size in bytes of each log line")
wait := flag.Duration("wait", 60*time.Second, "Duration to wait for log entries before reporting them lost")
pruneInterval := flag.Duration("pruneinterval", 60*time.Second, "Frequency to check sent vs received logs, also the frequency which queries for missing logs will be dispatched to loki")
buckets := flag.Int("buckets", 10, "Number of buckets in the response_latency histogram")
flag.Parse()

if *addr == "" {
_, _ = fmt.Fprintf(os.Stderr, "Must specify a Loki address with -addr\n")
os.Exit(1)
}

sentChan := make(chan time.Time)
receivedChan := make(chan time.Time)

w := writer.NewWriter(os.Stdout, sentChan, *interval, *size)
r := reader.NewReader(os.Stderr, receivedChan, *tls, *addr, *user, *pass, *lName, *lVal)
c := comparator.NewComparator(os.Stderr, *wait, *pruneInterval, *buckets, sentChan, receivedChan, r)

http.Handle("/metrics", promhttp.Handler())
go func() {
err := http.ListenAndServe(":"+strconv.Itoa(*port), nil)
if err != nil {
panic(err)
}
}()

interrupt := make(chan os.Signal, 1)
terminate := make(chan os.Signal, 1)
signal.Notify(interrupt, os.Interrupt)
signal.Notify(terminate, syscall.SIGTERM)

for {
select {
case <-interrupt:
_, _ = fmt.Fprintf(os.Stderr, "suspending indefinetely\n")
w.Stop()
r.Stop()
c.Stop()
case <-terminate:
_, _ = fmt.Fprintf(os.Stderr, "shutting down\n")
w.Stop()
r.Stop()
c.Stop()
return
}
}

}
108 changes: 108 additions & 0 deletions docs/canary/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@

# loki-canary

A standalone app to audit the log capturing performance of Loki.

## how it works

![block_diagram](block.png)

loki-canary writes a log to a file and stores the timestamp in an internal array, the contents look something like this:

```nohighlight
1557935669096040040 ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
```

The relevant part is the timestamp, the `p`'s are just filler bytes to make the size of the log configurable.

Promtail (or another agent) then reads the log file and ships it to Loki.

Meanwhile loki-canary opens a websocket connection to loki and listens for logs it creates

When a log is received on the websocket, the timestamp in the log message is compared to the internal array.

If the received log is:

* The next in the array to be received, it is removed from the array and the (current time - log timestamp) is recorded in the `response_latency` histogram, this is the expected behavior for well behaving logs
* Not the next in the array received, is is removed from the array, the response time is recorded in the `response_latency` histogram, and the `out_of_order_entries` counter is incremented
* Not in the array at all, it is checked against a separate list of received logs to either increment the `duplicate_entries` counter or the `unexpected_entries` counter.

In the background, loki-canary also runs a timer which iterates through all the entries in the internal array, if any are older than the duration specified by the `-wait` flag (default 60s), they are removed from the array and the `websocket_missing_entries` counter is incremented. Then an additional query is made directly to loki for these missing entries to determine if they were actually missing or just didn't make it down the websocket. If they are not found in the followup query the `missing_entries` counter is incremented.

## building and running

`make` will run tests and build a docker image

`make build` will create a binary `loki-canary` alongside the makefile

To run the image, you can do something simple like:

`kubectl run loki-canary --generator=run-pod/v1 --image=grafana/loki-canary:latest --restart=Never --image-pull-policy=Never --labels=name=loki-canary -- -addr=loki:3100`

Or you can do something more complex like deploy it as a daemonset, there is a ksonnet setup for this in the `production` folder, you can import it using jsonnet-bundler:

```shell
jb install github.com/grafana/loki-canary/production/ksonnet/loki-canary
```

Then in your ksonnet environments `main.jsonnet` you'll want something like this:

```nohighlight
local loki_canary = import 'loki-canary/loki-canary.libsonnet';

loki_canary {
loki_canary_args+:: {
addr: "loki:3100",
port: 80,
labelname: "instance",
interval: "100ms",
size: 1024,
wait: "3m",
},
_config+:: {
namespace: "default",
}
}

```

## config

It is required to pass in the Loki address with the `-addr` flag, if your server uses TLS, also pass `-tls=true` (this will create a wss:// instead of ws:// connection)

You should also pass the `-labelname` and `-labelvalue` flags, these are used by loki-canary to filter the log stream to only process logs for this instance of loki-canary, so they must be unique per each of your loki-canary instances. The ksonnet config in this project accomplishes this by passing in the pod name as the labelvalue

If you get a high number of `unexpected_entries` you may not be waiting long enough and should increase `-wait` from 60s to something larger.

__Be cognizant__ of the relationship between `pruneinterval` and the `interval`. For example, with an interval of 10ms (100 logs per second) and a prune interval of 60s, you will write 6000 logs per minute, if those logs were not received over the websocket, the canary will attempt to query loki directly to see if they are completely lost. __However__ the query return is limited to 1000 results so you will not be able to return all the logs even if they did make it to Loki.

__Likewise__, if you lower the `pruneinterval` you risk causing a denial of service attack as all your canaries attempt to query for missing logs at whatever your `pruneinterval` is defined at.

All options:

```nohighlight
-addr string
The Loki server URL:Port, e.g. loki:3100
-buckets int
Number of buckets in the response_latency histogram (default 10)
-interval duration
Duration between log entries (default 1s)
-labelname string
The label name for this instance of loki-canary to use in the log selector (default "name")
-labelvalue string
The unique label value for this instance of loki-canary to use in the log selector (default "loki-canary")
-pass string
Loki password
-port int
Port which loki-canary should expose metrics (default 3500)
-pruneinterval duration
Frequency to check sent vs received logs, also the frequency which queries for missing logs will be dispatched to loki (default 1m0s)
-size int
Size in bytes of each log line (default 100)
-tls
Does the loki connection use TLS?
-user string
Loki username
-wait duration
Duration to wait for log entries before reporting them lost (default 1m0s)
```
Binary file added docs/canary/block.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading