Poor compression & quality for difficult-to-compress data #87

lindstro · 2022-03-25T16:56:24Z

I am doing some compression studies that involve difficult-to-compress (even incompressible) data. Consider the chaotic data generated by the logistic map x_i+1 = 4 x_i (1 - x_i):

#include <cstdio>

int main()
{
  double x = 1. / 3;
  for (int i = 0; i < 256 * 256 * 256; i++) {
    fwrite(&x, sizeof(x), 1, stdout);
    x = 4 * x * (1 - x);
  }
  return 0;
}

We wouldn't expect this data to compress at all, but the inherent randomness at least suggests a predictable relationship between (RMS) error, E, and rate, R. Let σ = 1/√8 denote the standard deviation of the input data and define the accuracy gain as

α = log₂(σ / E) - R.

Then each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.

Below is a plot of α(R) for SZ 2.1.12.3 and other compressors applied to the above data interpreted as a 3D array of size 256 × 256 × 256. Here SZ's absolute error tolerance mode was used: sz -d -3 256 256 256 -M ABS -A tolerance -i input.bin -z output.sz. The tolerance was halved for each subsequent data point, starting with tolerance = 1.

The plot suggests an odd relationship between R and E, with very poor compression observed for small tolerances. For instance, when the tolerance is in {2^-13, 2^-14, 2^-15, 2^-16}, the corresponding rate is {13.9, 15.3, 18.2, 30.8}, while we would expect R to increase by one bit in each case. Is this perhaps a bug in SZ? Similar behavior is observed for other difficult-to-compress data sets (see rballester/tthresh#7).

The text was updated successfully, but these errors were encountered:

disheng222 · 2022-03-25T19:40:28Z

Hi Peter,
For the difficult-to-compress cases (or with fairly low error bound), I suggest you to increase the number of quantization bins. To this end, you can modify the sz.config as follows:
max_quant_intervals = 67108864
(The default value was 65536)
Then, use the following command to compress the dataset again:
sz -c sz.config -z -d -i chaotic_data.dat -M ABS -A tolerance -3 256 256 256

I tested that double-type dataset (its values are very random as I observed, so the compression ratio is low).

ABS bit-rate PSNR
2^-14 13.9 83.043196
2^-15 15.3 89.058258
2^-16 16.5 95.080285
2^-17 17.9 101.100309
2^-18 22.4 113.142963
2^-19 26.3 119.162523
2^-20 32.6 125.183684

I think this result makes more sense.
Increasing max_quant_intervals can mitigate that issue. As you know, max_quant_intervals is represented as integer type, so there is an upper limit for this setting. Then, when the error bound is small enough, the bit rate would jump again.
Hope this answer is helpful to your research.

Best,
Sheng

lindstro · 2022-03-25T20:17:57Z

Sheng, thanks for the suggestion. As someone who has not played around with SZ parameter settings much, is there a way to algorithmically choose the number of quantization bins, or is this basically a trial and error process that one has to go through for each data set?

As evidenced by the table you listed, the rate increase starts climbing again for the last few tolerances (ideally, the rate would not increase by more than one bit), which suggests to me that one would then have to increase the number of quantization bins even further. Is there a limit on the number of bins? Should it be a power of two?

I imagine that increasing the number of bins may adversely affect the compression ratio for large tolerances, i.e., there may be no single setting that works well for a wide range of tolerances. Otherwise, would there not be a better default setting?

Finally, is there a way to set this parameter without using an sz.config file? It is a bit cumbersome to have to generate such a text file when using the SZ command-line utility in shell scripts.

robertu94 · 2022-03-25T20:23:47Z

I'll leave the rest to Sheng.

Finally, is there a way to set this parameter without using an sz.config file? It is a bit cumbersome to have to generate such a text file when using the SZ command-line utility in shell scripts

@lindstro Not with the sz command line, but the libpressio command line from my libpressio-tools spack package provides a way to do that for SZ

git clone https://github.com/robertu94/spack_packages robertu94_packages
spack repo add ./robertu94_packages
spack install libpressio-tools ^ libpressio+sz+zfp
#if your compiler is older (pre c++17) you might need this instead
spack install libpressio-tools ^ libpressio+sz+zfp ^ libstdcompat+boost
spack load libpressio-tools

# doesn't matter here, but dims are in fortran order for all compressors; last 3 args print metrics
pressio -i ./chaotic_data.dat -t double -d 256 -d 256 -d 256 -b compressor=sz -o sz:max_quant_intervals=67108864 -o pressio:abs=1e-14 -m time -m size -M all

#print help for a compressor
pressio -b compressor=sz -a help

lindstro · 2022-03-25T23:13:17Z

@robertu94 Thanks for pointing that out. That is very convenient and another reason why I need to take a closer look at libpressio.

disheng222 · 2022-03-27T03:05:35Z

Hi Peter,
As for the sharp increase of the bitrate at accuracy of 2^-20, that's because of the Huffman tree overhead.
I checked the content of the compressed data. Basically, when the accuracy is 2^-12 to 2^-16, the major part is from the Huffman-encoded bytes and the Huffman tree overhead is tiny. However, when the accuracy is 2^-20, the Huffman-tree itself is about 1.5X as large as the Huffman-encoded bytes. In fact, in SZ, we didn't try best to store Huffman tree as dense as possible, because SZ is mainly focused on the lossy compression use-cases, and we didn't target the cases with extremely low error bound such as 2^-20. That is, if the error bound is fairly low or if the original data file is very small (e.g., around 1MB), the Huffman tree overhead is not negligible, and we should study how to minimize such an overhead.

To check the detailed component of the compressed data in SZ, you can do the following things:
./configure --enable-writestats
make

And then, when you use the executable 'sz' command, you can add the option -q to print the stats: more information such as Huffman tree size, encoded bytes.
BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz
if "actual used # intervals" doesn't reach 67108864, that means the quantization bin count is large enough to cover all the predicted values. BTW, the 'predThreshold' in sz.config is better to be set to 0.999 or larger value in order to make sure more data points can be covered by the quantization bins. All the above parameters are supported by Libpressio.

Best,
Sheng

robertu94 · 2022-03-27T10:55:25Z

@lindstro to elaborate on this

BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz

LibPressio’s CLI is capable out outputting this information as well. It will automatically detect that you compiled ‘sz+stats’ and add these metrics to the output above. The entire install command then would be:

‘’’
spack install libpressio-tools ^ libpressio+sz+zfp ^ sz+stats
‘’’

lindstro · 2022-03-28T18:40:40Z

In fact, in SZ, we didn't try best to store Huffman tree as dense as possible, because SZ is mainly focused on the lossy compression use-cases, and we didn't target the cases with extremely low error bound such as 2^-20.

Thanks for the explanation. I would argue that a tolerance of 2^-20 ≈ 10^-6 for data in (0, 1) is not all that low; it provides less accuracy than single precision. But it is good to know that this limitation is not unique to this data set, even if such difficult-to-compress data may require a larger number of quantization bins.

Just to make sure I fully understand, using 2ⁿ bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). But for very large residuals (or small n) outside the range of available bins, SZ may have to record the residual as a full floating-point value, which generally is more expensive. And for random data, SZ is likely to give poor predictions that result in frequent, large residuals. Do I have that right?

To check the detailed component of the compressed data in SZ, you can do the following things: ./configure --enable-writestats make

That seems handy. Is there an analogous setting for CMake builds? cmake -L does not reveal anything obvious.

BTW, you can use -p to print the actual number of quantization bins used in the compression: e.g., sz -p -s chaotic_data.dat.sz if "actual used # intervals" doesn't reach 67108864, that means the quantization bin count is large enough to cover all the predicted values.

I see. In this case, should one rerun with the number of bins reported?

robertu94 · 2022-03-28T18:43:29Z

Leaving the rest to @disheng222

That seems handy. Is there an analogous setting for CMake builds? cmake -L does not reveal anything obvious.

@lindstro the corresponding setting is BUILD_STATS:BOOL=ON

That is what libpressio uses when it build with sz+stats

lindstro · 2022-03-28T21:24:30Z

@lindstro the corresponding setting is BUILD_STATS:BOOL=ON

Doh! Not sure how I missed that. Thanks.

disheng222 · 2022-03-29T13:36:21Z

@lindstro
"I would argue that a tolerance of 2-20 ≈ 10-6 for data in (0, 1) is not all that low"
Yes. But the chaotic dataset you are using here is essentially a random data, so the prediction in SZ works very inefficiently. This is why we need to use a larger number of quantization bins to cover it. This situation is not expected by SZ. :-)

"Just to make sure I fully understand, using 2^n bins allows SZ to encode residuals in roughly n bits (depending on entropy, of course). ......."
This understanding is basically correct. Please kindly note that the SZ significantly depends on the smoothness of the data ('cause it uses prediction method and encode the residuals). By the 'entropy' here, if you mean the entropy of the residuals, we can say that the compression results depend on it (instead of entropy of original data).

You can use -q or -p or both in your compression operation as follows:

[sdi@localhost example]$ sz -p -q -z -d -c sz.config -i chaotic_data.dat -M ABS -A 1E-5 -3 256 256 256
===============stats about sz================
Constant data? : NO
use_mean: YES
blockSize 6
lorenzoPercent 0.955958
regressionPercent 0.044042
lorenzoBlocks 70825
regressionBlocks 3263
totalBlocks 74088
huffmanTreeSize 7792474
huffmanCodingSize 36967490
huffmanCompressionRatio 1.499306
huffmanNodeCount 599421
unpredictCount 0
unpredictPercent 0.000000
compression time = 1.063208
compressed data file: chaotic_data.dat.sz
=================SZ Compression Meta Data=================
Version: 2.1.12
Constant data?: NO
Lossless?: NO
Size type (size of # elements): 8 bytes
Num of elements: 16777216
compressor Name: SZ
Data type: DOUBLE
min value of raw data: 0.000000
max value of raw data: 1.000000
quantization_intervals: 0
max_quant_intervals: 67108864
actual used # intervals: 524288
dataEndianType (prior raw data): LITTLE_ENDIAN
sysEndianType (at compression): LITTLE_ENDIAN
sampleDistance: 100
predThreshold: 0.999000
szMode: SZ_BEST_COMPRESSION (with Zstd or Gzip)
gzipMode: Z_BEST_SPEED
errBoundMode: ABS
absErrBound: 0.000010
[sdi@localhost example]$

The -p metadata information are stored in the compressed data, so you can also use -p to check the compressed data as follows:

[sdi@localhost example]$ sz -p -s chaotic_data.dat.sz
=================SZ Compression Meta Data=================
Version: 2.1.12
Constant data?: NO
Lossless?: NO
Size type (size of # elements): 8 bytes
Num of elements: 16777216
compressor Name: SZ
Data type: DOUBLE
min value of raw data: 0.000000
max value of raw data: 1.000000
quantization_intervals: 0
max_quant_intervals: 67108864
actual used # intervals: 524288
dataEndianType (prior raw data): LITTLE_ENDIAN
sysEndianType (at compression): LITTLE_ENDIAN
sampleDistance: 100
predThreshold: 0.999000
szMode: SZ_BEST_COMPRESSION (with Zstd or Gzip)
gzipMode: Z_BEST_SPEED
errBoundMode: ABS
absErrBound: 0.000010

lindstro · 2022-07-21T16:54:07Z

@disheng222 I tried your proposed fix of using 2²⁶ quantization bins. This does improve things a little for this type of random data at high rates (low tolerances). There's a huge jump in rate, from about 32 to 60, when halving the error tolerance. That's a bit surprising.

I'm also including the results of this change in number of bins in the case of compressible data--the Miranda viscosity field from SDRBench. It doesn't seem that using more bins helps at high rates here. In fact, SZ does worse when increasing the number of bins. I'm just curious if there are other parameters that might help.

disheng222 · 2022-07-23T05:13:44Z

I tested the SZ2 (github.com/szcompressor/SZ) and SZ3 (github.com/szcompressor/SZ3) using Miranda's viscocity.d64.
I got the following results about the accuracy gain. I implemented the accuracy gain in the following way:
α = log₂(σ / E) - R, where σ is the standard deviation of the original dataset, and E is the root mean squared error, and R is the bit-rate (w.r.t 64 bits). Hope this calculation is correct.
Then, I got the following results:

I am using fixed quantization bins (65536) to run SZ2 and SZ3.
It seems that the accuracy gain I got is slightly different from your result.
Another observation is that SZ3 is better than SZ2.

The drop of the SZ's accuracy gain is likely because of the drop of its compression ratio when the error bound is very small. One can increase the max number of quantization bins via the configuration file, but this may also increase the Huffman tree size, which may mitigate the compression ratio improvement. How to store the Huffman tree more effectively was not the focus of our previous design, because we didn't consider such a small error bound before. That is, I just used a naive implementation to store Huffman tree. Improving the implementation of the Huffman tree storage can increase accuracy gain to a certain extent, which is not done yet.
In fact, recently, we developed a rather better version (called QoZ) than SZ3, which can get better quality for high-error cases but not for low-error cases. It hasn't been integrated in SZ3 github repo yet.

disheng222 · 2022-07-23T15:25:14Z

I tested SZ3 using the maximum # quantization bins 1048576, which can be set in sz3.config.
Then, I got the following results:

The result (i.e., SZ3_1m) looks better than both SZ2 and SZ3(default).
If I use larger number of quantization bins, the result would get worse probably because of the Huffman tree storage overhead, as I mentioned in previous comment. The Huffman tree size has an upper bound based on the max number of quantiztation bins, but this overhead is not negligible when the original data size (e.g., 256x384x384 in this case) is not very big. That said, if the original data size is 1024^3, the impact of Huffman tree overhead may not be that big.

Best,.
Sheng

lindstro · 2022-07-28T16:19:12Z

@disheng222 Thanks for your suggestions. We decided to go with SZ2 because it supports a pointwise relative error bound. AFAIK, SZ3 does not.

Your Miranda plots look much like the "default" curve I included and less like the S-shaped curve I got after increasing the number of bins.

Do you know why there's such a large jump in rate for the more random data?

disheng222 · 2022-07-28T17:05:57Z

Hi Peter, SZ3 also supports point-wise relative error bounds, but we didn't release this function in API. We can do it soon, and will let you know when it's ready. As for the large jump, there are two situations. If you are using the default setting (65536 bins), when the error bound is small enough, most of the data points cannot be predicted and covered within the quantization bin range (i.e., 65536*2*err_bound). In this situation, most of the values are compressed by the binary-representation analysis (e.g., truncate insignificant bits). That is, this is a different "compression method" compared with the classic SZ pipeline (prediction+quantization+Huffman+....). If you are using a very large number of quantization bins, when the error bound is small, the quantization bin range may still cover the distance between predicted value and real data value. That said, the dominant method is still prediction+quntization+..... However, note that the Huffman tree needs to be stored together in the compressed data. When the defact number of quantization bins is very large, the Huffman tree size could be large. I didn't use a very compact way to store Huffman tree because I suppose Huffman tree size is negligible for the overall compressed data size (which is not true for extremely small error bound however). So, I believe Huffman tree overhead could be a factor to the jump issue, especially when the original dataset has a small or median size (such as 10MB, 100MB, or so, depending on how small the error bound is).

…

On Thu, Jul 28, 2022 at 11:19 AM Peter Lindstrom ***@***.***> wrote: @disheng222 <https://github.com/disheng222> Thanks for your suggestions. We decided to go with SZ2 because it supports a pointwise relative error bound. AFAIK, SZ3 does not. Your Miranda plots look much like the "default" curve I included and less like the S-shaped curve I got after increasing the number of bins. Do you know why there's such a large jump in rate for the more random data? — Reply to this email directly, view it on GitHub <#87 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACK3KSJVI75ZBEM7G3MIS2LVWKXIVANCNFSM5RUWBYKQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lindstro · 2022-08-04T19:35:31Z

SZ3 also supports point-wise relative error bounds, but we didn't release
this function in API. We can do it soon, and will let you know when it's
ready.

@disheng222 Thanks--that would be great. If you don't mind, can we leave this issue open until you have added support to SZ3? I may have some follow-up questions once that's working.

disheng222 · 2022-08-04T19:41:54Z

Sure. Let's keep this issue open.

…

On Thu, Aug 4, 2022 at 2:35 PM Peter Lindstrom ***@***.***> wrote: SZ3 also supports point-wise relative error bounds, but we didn't release this function in API. We can do it soon, and will let you know when it's ready. @disheng222 <https://github.com/disheng222> Thanks--that would be great. If you don't mind, can we leave this issue open until you have added support to SZ3? I may have some follow-up questions once that's working. — Reply to this email directly, view it on GitHub <#87 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACK3KSM3BMFDZDUX6PBCMULVXQLQ3ANCNFSM5RUWBYKQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor compression & quality for difficult-to-compress data #87

Poor compression & quality for difficult-to-compress data #87

lindstro commented Mar 25, 2022

disheng222 commented Mar 25, 2022

lindstro commented Mar 25, 2022

robertu94 commented Mar 25, 2022 •

edited

Loading

lindstro commented Mar 25, 2022

disheng222 commented Mar 27, 2022

robertu94 commented Mar 27, 2022

lindstro commented Mar 28, 2022

robertu94 commented Mar 28, 2022

lindstro commented Mar 28, 2022

disheng222 commented Mar 29, 2022

lindstro commented Jul 21, 2022

disheng222 commented Jul 23, 2022

disheng222 commented Jul 23, 2022

lindstro commented Jul 28, 2022

disheng222 commented Jul 28, 2022 via email

lindstro commented Aug 4, 2022

disheng222 commented Aug 4, 2022 via email

Poor compression & quality for difficult-to-compress data #87

Poor compression & quality for difficult-to-compress data #87

Comments

lindstro commented Mar 25, 2022

disheng222 commented Mar 25, 2022

lindstro commented Mar 25, 2022

robertu94 commented Mar 25, 2022 • edited Loading

lindstro commented Mar 25, 2022

disheng222 commented Mar 27, 2022

robertu94 commented Mar 27, 2022

lindstro commented Mar 28, 2022

robertu94 commented Mar 28, 2022

lindstro commented Mar 28, 2022

disheng222 commented Mar 29, 2022

lindstro commented Jul 21, 2022

disheng222 commented Jul 23, 2022

disheng222 commented Jul 23, 2022

lindstro commented Jul 28, 2022

disheng222 commented Jul 28, 2022 via email

lindstro commented Aug 4, 2022

disheng222 commented Aug 4, 2022 via email

robertu94 commented Mar 25, 2022 •

edited

Loading