Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor compression & quality for difficult-to-compress data #189

Open
lindstro opened this issue Mar 25, 2022 · 12 comments · May be fixed by #196
Open

Poor compression & quality for difficult-to-compress data #189

lindstro opened this issue Mar 25, 2022 · 12 comments · May be fixed by #196
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@lindstro
Copy link

I am doing some compression studies that involve difficult-to-compress (even incompressible) data. Consider the chaotic data generated by the logistic map xi+1 = 4 xi (1 - xi):

#include <cstdio>

int main()
{
  double x = 1. / 3;
  for (int i = 0; i < 256 * 256 * 256; i++) {
    fwrite(&x, sizeof(x), 1, stdout);
    x = 4 * x * (1 - x);
  }
  return 0;
}

We wouldn't expect this data to compress at all, but the inherent randomness at least suggests a predictable relationship between L2 error, E, and rate, R. Let σ = 1/√8 denote the standard deviation of the input data and define the accuracy gain as

α = log₂(σ / E) - R.

Then each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.

Below is a plot of α(R) for MGARD 1.2.0 and other compressors applied to the above data interpreted as a 3D array of size 256 × 256 × 256. Here I used a smoothness parameter of 0, which should result in an L2 optimal reconstruction: mgard compress --smoothness 0 --tolerance tolerance --datatype double --shape 256x256x256 --input input.bin --output output.mgard. The tolerance was halved for each subsequent data point, starting with tolerance = 1.

The plot suggests an odd relationship between R and α, where α is far from stable when R > 17. Is this perhaps a bug in MGARD? Similar behavior is observed for other difficult-to-compress data sets (see rballester/tthresh#7).

logistic

@lindstro
Copy link
Author

Just a friendly reminder to look into this issue when you get a chance.

@ben-e-whitney ben-e-whitney added bug Something isn't working question Further information is requested labels Apr 27, 2022
@ben-e-whitney
Copy link
Collaborator

ben-e-whitney commented Apr 27, 2022

Apologies for the delay. I have begun working on this and have reproduced the problem.

image

I'm guessing there is some issue with the coefficient quantization. Looking into it now.

In the meantime, I have a question about accuracy gain.

… the inherent randomness at least suggests a predictable relationship between L₂ error, E, and rate … each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.

Under what circumstances "should" α(R) be about constant with respect to R? Is that the best possible behavior, or will it be achieved by a compressor well-suited to chaotic data or something? I am unfamiliar with accuracy gain and am trying to determine the point at which I should be pointing the finger at the algorithm rather than its implementation.

@lindstro
Copy link
Author

Under what circumstances "should" α(R) be about constant with respect to R? Is that the best possible behavior, or will it be achieved by a compressor well-suited to chaotic data or something? I am unfamiliar with accuracy gain and am trying to determine the point at which I should be pointing the finger at the algorithm rather than its implementation.

When R = 0, no compressed data is stored, so the best you can hope for is that the compression algorithm (with no knowledge about the input data) by luck reconstructs the decompressed data as a constant x = μ for all x, where μ is the mean of the input data. In this case, E = σ, so α = 0. Any other constant x will yield α < 0.

For large R, either E = 0 or E converges to some small positive value, e.g., due to roundoff error in the compression scheme. Hence, α either approaches +∞ (E = 0) or -∞ (E > 0) as R grows.

In between these limit behaviors, we would expect α to initially rise when R is small, indicating that the data is effectively being compressed, until it reaches a plateau, where α is constant. On this plateau, each increase in R by one bit/value results in a halving of E, so α remains constant. The value of this plateau is the number of per-value bits of information that the compressor has inferred and that need not be coded; it is a measure of how well the data has been compressed (higher is better). Each additional bit coded is at this point random and so cannot be compressed (one bit of output stored for each bit of input consumed). Eventually, α either goes to infinity or drops linearly.

The plot below illustrates this behavior for some compressors (in particular, SPERR and zfp) using the rather smooth Miranda viscosity data from SDRBench, while other compressors do not reach a stable plateau (e.g., MGARD, SZ, TTHRESH). Perhaps MGARD would if it weren't for this apparent bug. We have diagnosed the TTHRESH bug and are working on a fix.

Of course, for random data, we do not expect any bits to be inferred, so the plateau usually is at or below zero. Below zero because of overhead (e.g., metadata) and the pigeonhole principle; some outputs must be larger than the inputs in any compression scheme. That's what we're seeing for all compressors in the plot above.

Think of accuracy gain as simply "rotating" (or, more accurately, "shearing") a PSNR vs. R plot so that the expected 6.02 dB/bit slope turns into a slope of zero.

viscosity

@ben-e-whitney
Copy link
Collaborator

@lindstro Just an update: I'm working on a bug in the coefficient quantization code. See #190. Once that's done, I'll return to this.

@ben-e-whitney ben-e-whitney linked a pull request Jun 28, 2022 that will close this issue
@ben-e-whitney
Copy link
Collaborator

Hi @lindstro. Apologies again for how long this is taking. Fixing the Huffman code ended up entailing a lot of work; see #196. Running on that branch, the accuracy gain plot looks a bit better.

image

Comparison with what you were seeing:

  • The initial plateau now extends to around 20 bits/double. In your tests, it only extended to around 17 bits/double.
  • α falls after the initial plateau and then flattens out again around 28 bits/value, in keeping with your tests. In your tests, α then cratered and MGARD wasn't able to achieve compression ratios below 2. Now, we have a second 'plateau' (not that flat) lasting until about 58 bits/double, beyond which MGARD's quantization routine fails because its output can't be stored in the designated integer type.
  • In the second plateau there's some strange degradation of α as the rate increases. It looks like the error tolerance decreases slightly and the compression ratio falls and/or the achieved error jumps. (I'm making the data for these plots by running MGARD with a sequence of error tolerances, while α is defined in terms of the compression ratio and the achieved error.) Then the tolerance decreases a bit more and things return to normal. I don't know what's going on here.

Here's a summary of how the compression happens:

  • After the decorrelation stage, the transform coefficients are quantized, yielding a sequence of integers.
  • A Huffman code is built for the integer sequence. A certain range of integers is eligible for a Huffman codeword. In the configuration used to produce the plot above, that range is {−2^17, …, 2^17 − 1}. Integers inside this range are represented in a 'hit' buffer by the corresponding codeword. Integers outside this range are instead stored verbatim in a 'missed' buffer.
  • The 'hit' buffer, the 'missed' buffer, and a bunch of metadata are written to a serialization buffer. This serialization buffer is then compressed by ZSTD.

As the error tolerance decreases, the quantized coefficients become larger, and as a result more of them fall outside the range {−2^17, …, 2^17 − 1} and end up in the 'missed' buffer. I guessed that the first plateau might end when this happened to some critical fraction of the quantized coefficients, so I repeated the experiment with some different integer ranges.

image

I think we can say that this range parameter has some effect on the accuracy gain, but it's not the whole story.

How's all this looking to you?

@lindstro
Copy link
Author

lindstro commented Jul 5, 2022

Thanks, @ben-e-whitney, for all your work in addressing this issue. The accuracy gain plot looks a bit odd, but at least it doesn't go off a cliff, so that's a welcome improvement. Let me play with this a bit and see how MGARD does on a few data sets. Unfortunately, I cannot get the huffman-decoding branch to build on my Mac:

MGARD/include/huffman.tpp:183:14: fatal error: no viable conversion from 'std::shared_ptr<CodeCreationTreeNode>' to 'const bool'
  const bool children = node->left;
             ^          ~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/c++/v1/memory:2879:22: note: explicit conversion function is not a candidate
    _LIBCPP_EXPLICIT operator bool() const _NOEXCEPT {return get() != nullptr;}

@ben-e-whitney
Copy link
Collaborator

It looks like my compiler is letting me get away with something a little improper. Once the branch is ready to be merged I'll have someone with a Mac try building and iron out any issues that come up. In the meantime, I think it'll compile with static_cast<bool>(node->left) in place of node->left.

@lindstro
Copy link
Author

lindstro commented Jul 9, 2022

Thanks, but this leads to another issue:

MGARD/src/lossless_dispatcher.cpp:69:3: fatal error: static_assert failed due to requirement 'std::is_same<long long, long>::value' "deprecated Huffman coding written with assumption that `std::int64_t` is `long int`"
  static_assert(std::is_same<std::int64_t, long int>::value,
  ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

Would be great if you could ensure MGARD builds on Mac. I've heard from others that there might be linker issues as well.

@lindstro
Copy link
Author

Here's an update. I was able to get around the int64_t issue just by commenting the assertion out, which should be safe given that these types have the same size on my platform (Apple clang). The MGARD huffman-decoding branch now builds, albeit slowly (took 36 minutes!).

I'm including two plots below using the SDRBench Miranda viscosity field and this command line:

mgard-x -z -s 0 -m abs -e TOLERANCE -t d -n 3 256 384 384 -r 0 -l 3 -d serial -i viscosity.raw -c viscosity.mgard

where TOLERANCE = 2.62133 / 2i for nonnegative integers i. The first plot shows rate vs. accuracy gain. MGARD goes off the cliff, as before, but at a lower rate this time. The second plot shows that the absolute error tolerance is almost always exceeded.

By the way, the -r option is required but not documented. I guessed that a value of zero would be a good default.
viscosity
mgard-error

@lindstro
Copy link
Author

@ben-e-whitney It's been some time since I saw progress on this issue or #186. Do you have any updates to share?

@lindstro
Copy link
Author

lindstro commented Oct 7, 2022

@ben-e-whitney Any progress on this issue?

@lindstro
Copy link
Author

Can someone on the MGARD team please comment on the status of this issue? It appears unresolved in the latest 1.3 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants