Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit the OpenSSL interop exception model #55973

Closed
Tracked by #64488
bartonjs opened this issue Jul 19, 2021 · 6 comments · Fixed by #65148
Closed
Tracked by #64488

Revisit the OpenSSL interop exception model #55973

bartonjs opened this issue Jul 19, 2021 · 6 comments · Fixed by #65148
Labels
area-System.Security os-linux Linux OS (any supported distro)
Milestone

Comments

@bartonjs
Copy link
Member

For various reasons, the model we ended up with for OpenSSL is that when we decide it's time to throw an exception we drain out the error queue and throw for whatever the last error was. Largely this worked because each operation (probably) only really reports one error. Any time the error queue had more than one thing in it was because of spillover from a recoverable failure.

e.g.: new X509Certificate2(bytes)

  • call d2i_X509. If that failed, continue to next format
    • If it failed, it probably wrote some ASN decode errors... but all we checked was that it returned NULL. Error queue length 1.
  • call PEM_read_bio_X509.
    • Error queue length is probably 2.
  • call d2i_PKCS7
    • Probably 3.
  • PEM_read_bio_PKCS7
    • ~4.
  • Oh, look, it's a PKCS#12.

So we succeeded, and left 4 errors in the queue (unless we cleared them, but I doubt it).

Later, when we run into some other error (e.g. signing with a public key) we go "oh, exception throwing time... drain down to the last error and throw on that, since it's what we just encountered". Assuming that the operation only assigned one error code, that's true.

The "only one error" didn't quite hold up in an unfortunate way in #55787. Calling EVP_PKEY2PKCS8 with a public-only DSA key on OpenSSL 3 Beta 1 produced two errors: 1) ERR_LIB_PROV | ERR_R_NOT_A_PRIVATE_KEY, and 2) ERR_LIB_PROV | ERR_R_MALLOC_FAILURE. While the second error seems pretty spurious for the situation, it shows that we really wanted "the first error that happened during our operation".

Suggested changes:

  • Each operation probably wants to start off with ERR_clear_error(). It's relatively cheap these days.
    • (Probably worth a perf measurement to quantify this).
  • Change CreateOpenSslCryptographicException to call one shim function that returns the oldest error, uses a ref/out value to indicate anything special (e.g. malloc failure => OOM), and clears the queue
    • Perf on exception paths is less important, but reducing N calls down to 1 seems like goodness either way.
  • Do a one-time test where each P/Invoke ends with reporting if there was more than one error enqueued. For each place that we see multiple errors enqueued check that it is, in fact, the oldest one that looks like the best exception. (Or at least looks like a suitable exception)
    • Checking on 1.0.2 (Ubuntu 16.04), 1.1.0 (Ubuntu 18.04), 1.1.1 (Ubuntu 20.04) and 3.0 (?), of course.
@bartonjs bartonjs added area-System.Security os-linux Linux OS (any supported distro) labels Jul 19, 2021
@bartonjs bartonjs added this to the 7.0.0 milestone Jul 19, 2021
@ghost
Copy link

ghost commented Jul 19, 2021

Tagging subscribers to this area: @bartonjs, @vcsjones, @krwq, @GrabYourPitchforks
See info in area-owners.md if you want to be subscribed.

Issue Details

For various reasons, the model we ended up with for OpenSSL is that when we decide it's time to throw an exception we drain out the error queue and throw for whatever the last error was. Largely this worked because each operation (probably) only really reports one error. Any time the error queue had more than one thing in it was because of spillover from a recoverable failure.

e.g.: new X509Certificate2(bytes)

  • call d2i_X509. If that failed, continue to next format
    • If it failed, it probably wrote some ASN decode errors... but all we checked was that it returned NULL. Error queue length 1.
  • call PEM_read_bio_X509.
    • Error queue length is probably 2.
  • call d2i_PKCS7
    • Probably 3.
  • PEM_read_bio_PKCS7
    • ~4.
  • Oh, look, it's a PKCS#12.

So we succeeded, and left 4 errors in the queue (unless we cleared them, but I doubt it).

Later, when we run into some other error (e.g. signing with a public key) we go "oh, exception throwing time... drain down to the last error and throw on that, since it's what we just encountered". Assuming that the operation only assigned one error code, that's true.

The "only one error" didn't quite hold up in an unfortunate way in #55787. Calling EVP_PKEY2PKCS8 with a public-only DSA key on OpenSSL 3 Beta 1 produced two errors: 1) ERR_LIB_PROV | ERR_R_NOT_A_PRIVATE_KEY, and 2) ERR_LIB_PROV | ERR_R_MALLOC_FAILURE. While the second error seems pretty spurious for the situation, it shows that we really wanted "the first error that happened during our operation".

Suggested changes:

  • Each operation probably wants to start off with ERR_clear_error(). It's relatively cheap these days.
    • (Probably worth a perf measurement to quantify this).
  • Change CreateOpenSslCryptographicException to call one shim function that returns the oldest error, uses a ref/out value to indicate anything special (e.g. malloc failure => OOM), and clears the queue
    • Perf on exception paths is less important, but reducing N calls down to 1 seems like goodness either way.
  • Do a one-time test where each P/Invoke ends with reporting if there was more than one error enqueued. For each place that we see multiple errors enqueued check that it is, in fact, the oldest one that looks like the best exception. (Or at least looks like a suitable exception)
    • Checking on 1.0.2 (Ubuntu 16.04), 1.1.0 (Ubuntu 18.04), 1.1.1 (Ubuntu 20.04) and 3.0 (?), of course.
Author: bartonjs
Assignees: -
Labels:

area-System.Security, os-linux

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 19, 2021
@stephentoub
Copy link
Member

Suggested changes

Sounds reasonable, thanks.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 21, 2021
@bartonjs
Copy link
Member Author

bartonjs commented Feb 8, 2022

Sadly, this doesn't appear to be a clear improvement.

For diagnostics, I changed OpenSslCryptographicException to directly extend Exception. This made tests angry, so I captured their output and diffed it from before and after the model change.

Most of our existing tests weren't affected, indicating that the last error and the first error are the same (either a single error or an error code sandwich), e.g.

System.Security.Cryptography.Dsa.Tests.DSASignVerify_Span.PublicKey_CannotSign - match -  Interop+Crypto+OpenSslCryptographicException : error:0A070065:dsa routines:DSA_do_sign:missing parameters

All other "match" cases are hereby handwaved away, the ones that are different are more interesting

Tests where the exception message/code changed.

ecdhPrivate.DeriveKeyFromHash(otherPublic, ...), curve "mismatch"

System.Security.Cryptography.EcDiffieHellman.Tests.ECDiffieHellmanTests.ValidateNistP384_0
-    Interop+Crypto+OpenSslCryptographicException : error:0609B099:digital envelope routines:EVP_PKEY_derive_set_peer:different parameters
+    Interop+Crypto+OpenSslCryptographicException : error:10071065:elliptic curve routines:EC_POINT_cmp:incompatible objects
System.Security.Cryptography.EcDiffieHellman.Tests.ECDiffieHellmanTests.ValidateNistP384_1
-    Interop+Crypto+OpenSslCryptographicException : error:0609B099:digital envelope routines:EVP_PKEY_derive_set_peer:different parameters
+    Interop+Crypto+OpenSslCryptographicException : error:10071065:elliptic curve routines:EC_POINT_cmp:incompatible objects

The error queue:

(lldb) call (void)ERR_print_errors_fp(stdout)
140734362056448:error:10071065:elliptic curve routines:EC_POINT_cmp:incompatible objects:../crypto/ec/ec_lib.c:959:
140734362056448:error:0609B099:digital envelope routines:EVP_PKEY_derive_set_peer:different parameters:../crypto/evp/pmeth_fn.c:266:

in this case, the latter error ("different parameters") is probably the more friendly error.

rsaPrivate.Decrypt(..., RSASignaturePadding.Pkcs1), wrong key / corrupted data

System.Security.Cryptography.Rsa.Tests.EncryptDecrypt_Span.Decrypt_WrongKey_Pkcs7
-    Interop+Crypto+OpenSslCryptographicException : error:04065072:rsa routines:rsa_ossl_private_decrypt:padding check failed
+    Interop+Crypto+OpenSslCryptographicException : error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
System.Security.Cryptography.Rsa.Tests.RSAKeyExchangeFormatterTests.VerifyDecryptKeyExchangePkcs1
-    Interop+Crypto+OpenSslCryptographicException : error:04065072:rsa routines:rsa_ossl_private_decrypt:padding check failed
+    Interop+Crypto+OpenSslCryptographicException : error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error

The error queue:

(lldb) call (void)ERR_print_errors_fp(stdout)
140733788972800:error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error:../crypto/rsa/rsa_pk1.c:244:
140733788972800:error:04065072:rsa routines:rsa_ossl_private_decrypt:padding check failed:../crypto/rsa/rsa_ossl.c:485:

The latter error is... slightly... more friendly.

rsa.ImportRSAPublicKey(badBytes)

System.Security.Cryptography.Rsa.Tests.RSAKeyFileTests.NoFuzzyRSAPrivateKey
-     Interop+Crypto+OpenSslCryptographicException : error:0606F091:digital envelope routines:EVP_PKCS82PKEY:private key decode error
+     Interop+Crypto+OpenSslCryptographicException : error:0D0650DF:asn1 encoding routines:c2i_uint64_int:too large
System.Security.Cryptography.Rsa.Tests.RSAKeyFileTests.ReadWriteBigExponentPrivatePkcs1
-     Interop+Crypto+OpenSslCryptographicException : error:0606F091:digital envelope routines:EVP_PKCS82PKEY:private key decode error
+     Interop+Crypto+OpenSslCryptographicException : error:0D0680A8:asn1 encoding routines:asn1_check_tlen:wrong tag
System.Security.Cryptography.Rsa.Tests.RSAKeyFileTests.ReadWriteDiminishedDPPrivatePkcs1
-     Interop+Crypto+OpenSslCryptographicException : error:0606F091:digital envelope routines:EVP_PKCS82PKEY:private key decode error
+     Interop+Crypto+OpenSslCryptographicException : error:0D0680A8:asn1 encoding routines:asn1_check_tlen:wrong tag

The error queue:

(lldb) call (void)ERR_print_errors_fp(stdout)
140734362056448:error:0D0650DF:asn1 encoding routines:c2i_uint64_int:too large:../crypto/asn1/a_int.c:615:
140734362056448:error:0D08303A:asn1 encoding routines:asn1_template_noexp_d2i:nested asn1 error:../crypto/asn1/tasn_dec.c:646:Field=version, Type=RSAPrivateKey
140734362056448:error:04096004:rsa routines:rsa_priv_decode:RSA lib:../crypto/rsa/rsa_ameth.c:180:
140734362056448:error:0606F091:digital envelope routines:EVP_PKCS82PKEY:private key decode error:../crypto/evp/evp_pkey.c:44:

Honestly, all of the errors here are less than ideal. Basically, our implementation for this is "take the goo they give us, validate it's DER, and if it is, just wrap it in a PKCS8 header assuming it's an RSAPrivateKey... then call ImportPkcs8PrivateKey".

The outermost error "private key decode error" is the most useful, even though it says it came from PKCS8 ('cuz, well, it did). We /could/ validate the ASN structure before doing the data wrapping and improve this exception in managed code.

For NoFuzzyRSAPrivateKey the input is a valid DER payload, but from a different file format. For the rest, the test actually has valid input, but the test driver moves on from success to try things like input.Slice(1) and input[0..^1], and that's what's triggering the failure.

rsa.ImportRSAPublicKey(badBytes)

System.Security.Cryptography.Rsa.Tests.RSAKeyFileTests.NoFuzzyRSAPublicKey
-     Interop+Crypto+OpenSslCryptographicException : error:0B09407D:x509 certificate routines:x509_pubkey_decode:public key decode error
+     Interop+Crypto+OpenSslCryptographicException : error:0D078094:asn1 encoding routines:asn1_item_embed_d2i:sequence length mismatch
System.Security.Cryptography.Rsa.Tests.RSAKeyFileTests.ReadWritePublicPkcs1
-     Interop+Crypto+OpenSslCryptographicException : error:0B09407D:x509 certificate routines:x509_pubkey_decode:public key decode error
+     Interop+Crypto+OpenSslCryptographicException : error:0D0680A8:asn1 encoding routines:asn1_check_tlen:wrong tag

The error queue:

(lldb) call (void)ERR_print_errors_fp(stdout)
140734362056448:error:0D078094:asn1 encoding routines:asn1_item_embed_d2i:sequence length mismatch:../crypto/asn1/tasn_dec.c:405:Type=RSAPublicKey
140734362056448:error:0408B004:rsa routines:rsa_pub_decode:RSA lib:../crypto/rsa/rsa_ameth.c:105:
140734362056448:error:0B09407D:x509 certificate routines:x509_pubkey_decode:public key decode error:../crypto/x509/x_pubkey.c:124:

Similar to the one for RSAPrivateKey, the last one is not great, but probably the best of the three, and for similar reasons.

rsa.ImportParameters(badParameters)

System.Security.Cryptography.Rsa.Tests.RSAXml.FromNonsenseXml
-     Interop+Crypto+OpenSslCryptographicException : error:040A007E:rsa routines:RSA_check_key_ex:iqmp not inverse of q
+     Interop+Crypto+OpenSslCryptographicException : error:040A0080:rsa routines:RSA_check_key_ex:p not prime

The error queue

(lldb) call (void)ERR_print_errors_fp(stdout)
140734362056448:error:040A0080:rsa routines:RSA_check_key_ex:p not prime:../crypto/rsa/rsa_chk.c:67:
140734362056448:error:040A007F:rsa routines:RSA_check_key_ex:n does not equal p q:../crypto/rsa/rsa_chk.c:103:
140734362056448:error:040A007B:rsa routines:RSA_check_key_ex:d e not congruent to 1:../crypto/rsa/rsa_chk.c:151:
140734362056448:error:040A007C:rsa routines:RSA_check_key_ex:dmp1 not congruent to d:../crypto/rsa/rsa_chk.c:166:
140734362056448:error:040A007E:rsa routines:RSA_check_key_ex:iqmp not inverse of q:../crypto/rsa/rsa_chk.c:190:

These errors are all good, actually (aside maybe from 040A07B, since n/e/d are consistent, it's only p that's out of place... but it probably used "calculated N" instead of "the given N")

This particular test uses some valid parameters, but then copies D over top P, which produces some gibberish results.

Potential paths forward

(DSA and ECDSA would likely change more once they move to EVP_PKEY operations)

  1. Use the first code instead of the last code. It's probably just as valid, and they're all hard to reason about anyways.
  2. Use the last code still... but if the last code says there's a malloc failure, use the penultimate code. (If there isn't one, then OOM)
  3. Use the last code, except in the DSA case that triggered this behavior. Do some manual error code sniffing in that routine and prune out the malloc error.
  4. Return all the codes, using nested unthrown exceptions. For the FromNonsenseXml case, it'd look like throw new OpenSslCryptographicException(040A007E, new OpenSslCryptographicException(040A007C, ..., new OpenSslCryptographicException(040A0080))))))
1. Other suggestions welcome.

@vcsjones
Copy link
Member

vcsjones commented Feb 8, 2022

Use the last code, except in the DSA case that triggered this behavior. Do some manual error code sniffing in that routine and prune out the malloc error

Isn't that more or less what we already did for the RSA case in #63804?

Each operation probably wants to start off with ERR_clear_error()

I think there is still merit in this, maybe each shim function should

  1. assert the error queue is empty.
    Rationale being that other calls didn't properly check the error queue after calling out to the shims. I think that makes sense in the crypto primitive shims, less sure about the SSL/TLS ones.
  2. Clear the queue.

We can bundle all that up in to a macro or something:

#define error_check_preflight() \
    do { assert(ERR_peek_error() == 0); ERR_clear_error(); } while(0)

@bartonjs
Copy link
Member Author

bartonjs commented Feb 8, 2022

Isn't that more or less what we already did for the RSA case in #63804?

Hm. Interestingly, I think that's the same function that failed (in the same way) for DSA. Wonder what changed that it had been passing for me with RSA... and then the behaviors unified.

assert the error queue is empty.

Asserting seems a bit harsh, especially since we accept/expose SafeEvpPKeyHandle... debug builds could randomly shut down because of some other interop conditions.

maybe each shim function should ... Clear the queue.

That's what I've wired up in https://github.com/dotnet/runtime/compare/main...bartonjs:ossl_exception_model?expand=1. Since we're back to using the last error it's mostly unnecessary, but it signals intent and probably makes some debugging easier. I'll probably open that as a PR tomorrow if I don't get an overnight feeling of it being a bad idea.

@vcsjones
Copy link
Member

vcsjones commented Feb 8, 2022

Hm. Interestingly, I think that's the same function that failed

Your description in #55787 (comment) does sound very reminiscent of what was happening with RSA. I would not be surprised if the issue is "fixed" for the DSA portion now.

Asserting seems a bit harsh, especially since we accept/expose SafeEvpPKeyHandle

Fair.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 10, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 15, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Security os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants