-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support configurable Zstd levels #81
Conversation
level: 10 # describes gRPC-level compression level (default 5) | ||
payload_compression: zstd # describes Arrow-IPC compression (default "none") | ||
otelarrow/fastest: | ||
compression: zstd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random question - curious what would happen if compression
(grpc level) was not zstd and payload_compression
uses zstd? i.e. should these compression methods always match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two settings are independent. The Arrow library has only a Zstd option exposed, and it's OK to "double compress". It sounds funny, but we've seen the double compression outperfom either by itself, so we are leaving the option open to experiment. However, since making Zstd levels configurable, I believe the gRPC compression will start to be successful on its own -- this deserves to be validated before we change defaults.
defer staticInstances.lock.Unlock() | ||
|
||
for level := MinLevel; level <= MaxLevel; level++ { | ||
updateOne(&staticInstances.byLevel[level].dec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm in SetEncoderCfg(cfg EncoderConfig)
we only do updateOne(&staticInstances.byLevel[cfg.Level].enc)
but here for the decoder we are updating the config for all levels. Why update all levels for the decoder config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine question! I have gone both ways on this question during development. Firstly, encoders have levels, but decoders only have limits.
As structured, the decoder level used in this code is an artifact of allowing the encoder to configure levels, because the name of the PAIR changes when the encoder level changes.
Given that the notion of decoder level serves only to match the decoder limits with encoder limits, then it's worth asking whether it would be useful. Imagine configuring permissive limits for high-level compression and smaller limits for lower compression levels. This makes sense, but it might not protect the user against in a hostile environment. The goal of these limits is to protect the user, so having one limit is just easier.
I originally coded multiple limits, then reasoned through what the user would see. The OTel-Arrow receiver component would be able to configure different limits for every level, so would have a slice of decoder configs. This would make it hard to simply configure a limit, because you'd have to configure them all.
The code could be restructured to not have 10 distinct decoders, or it could be extended back in the other direction, to allow distinct decoder configurations. For now, for simplicity, I chose this middle ground.
|
||
- `level`: in the range 1-10 determines a number of defaults (default 5) | ||
- `window_size_mib`: size of the Zstd window in MiB, 0 indicates to determine based on level (default 0) | ||
- `concurrency`: controls background CPU used for compression, 0 indicates to let `zstd` library decide (default 1) | ||
|
||
The exporter supports configuring compression at the [Arrow | ||
columnar-protocol | ||
level](https://arrow.apache.org/docs/format/Columnar.html#format-ipc). | ||
|
||
- `payload_compression`: compression applied at the Arrow IPC level, "none" by default, "zstd" supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum I thought that we used zstd compression at the arrow level by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have evidence that by configuring the Zstd level and window size in the encoder, the payload compression is not actually helping very much. All of our earlier experiments w/ gRPC-level Zstd compression had inadvertently used a relatively small window size, which defeated the gRPC-level compression and made Arrow-level payload compression relatively important.
Coincidentally, I audited the Arrow code that sets up Zstd and recognize that it is using single-record encode and decode operations, there are no Zstd streams being built up and there is also no configurability. I wonder whether the Arrow architects have considered using Zstd streams within the Arrow IPC streams, but it seems to not be that way presently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since #82 merged I will repeat my experiment that led to this conclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming that for the 30GB uncompressed data sample you and I have been using, @lquerel, for a single stream lasting the entire input (with #82), single-zstd at the gRPC level outperforms double-zstd at various levels.
For the uncompressed size 2.9875869937e+10, the results:
single-zstd, level 10: 2.540397625e+09 i.e., 0.085
double-zstd, level 10: 2.663619163e+09 i.e., 0.089
single-zstd, level 8: 2.627058159e+09 i.e., 0.088
single-zstd, level 5: 2.701691178e+09 i.e., 0.090
double-zstd, level 5: 2.870220537e+09, i.e., 0.096
Part of #35.
This is somewhat complex because gRPC supports only static registration of compression objects and collectors naturally let you provide per-component configuration. To configure per-component compression levels is somewhat supported, as follows:
There are 10 distinct compression levels, 1 through 10. Any one of them can be configured through any component, but there is only one global value permitted for the encoder configuration, per level, and decoder configuration is singular (to keep it simple--it could be per level as well but I didn't go this way).
Updated README to explain the caveat.
Note the implementation in
mru.go
is derived from a production-tested implementation used internally. I've edited it for minor details, added a Reset method, and written new tests.These configurations are independent of the OTel-collector-wide compression setting controlled by https://github.com/mostynb/go-grpc-compression. Unlike that library, the new support is thread-safe and reconfiguration at runtime results in resetting the MRU cache.
The names used by these compression settings are "zstdarrow1" through "zstdarrow10".