Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlation id for response (526263) does not match request (0) error after IAM is enabled #51

Closed
mandy-yan-liu opened this issue Jan 20, 2022 · 6 comments

Comments

@mandy-yan-liu
Copy link

We have the application that was running without any issue before enable IAM. After IAM is configured, we started getting below errors in very low frequency.

"java.lang.IllegalStateException: Correlation id for response (526263) does not match request (0), request header: RequestHeader(apiKey=SASL_HANDSHAKE, apiVersion=1, clientId=ef958fc7-4d43-490d-aa3c-c1ba0d189003-StreamThread-1-restore-consumer, correlationId=0)

Encountered the following unexpected Kafka exception during processing, this usually indicate Streams internal errors:","error":{"stack":"org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'responses': Error reading array of size 1398754643, only 5 bytes available

There are threads indicate that the mismatch of kafka server-client version would cause above errors, but why does it only happen when IAM is enabled?

Versions we use:
Kafka cluster version: 2.6.2
kafka-client version: 2.5.0

@mandy-yan-liu mandy-yan-liu changed the title Correlation id for response (526263) does not match request (0) error only after IAM is enabled Correlation id for response (526263) does not match request (0) error after IAM is enabled Jan 20, 2022
@sayantacC
Copy link
Contributor

sayantacC commented Jan 28, 2022

Sorry about the delayed reply.
Would it be possible for you to provide debug logs for the application when running with IAM ?

@mandy-yan-liu
Copy link
Author

Sorry about the delayed reply. Would it be possible for you to provide debug logs for the application when running with IAM ?
Thanks for getting back! @sayantacC

This is the full error log for the two errors. 3 mins prior to the errors, I can see there is successful SASL authentication. Events were also being processed after the re-authentication until it fails.
error_logs.txt
Successful_sasl_prior_error.txt

Since the issue happens very infrequent, I'm waiting to see when it happens again and if there is similar pattern.

@sayantacC
Copy link
Contributor

@mandy-yan-liu Thanks for the logs. As you saw in the logs, when IAM is used, clients are required to reauthenticate periodically. This periodic re-authentication was not happening before you turned on IAM. I wonder if the low rate of failures are related to these periodic re-authentications.

I had a few more clarifying questions:

  • What is the size of the instance in your MSK cluster?
  • What is the approximate rate of failure either per unit time or as a fraction of total (re)authentication calls ?
  • What is the impact of these failures on the application? Does it crash and stop or does it automatically retry and recover from the failure?

@mandy-yan-liu
Copy link
Author

@sayantacC I'm also suspecting the error is related to the re-authentication, the specific correlationId error looks like it's from the communication between server and client for handshake api call and it usually fails after the authentication call is made.

  • The log I sent out was from our testing environment, the broker type is kafka.m5.large and the total number of brokers is 3, EBS storage volume per broker is 100 GIB. We also see the error happen in our production environment, where the broker type is kafka.m5.xlarge with 3 brokers as well, EBS storage volume per broker is 1000GIB.
  • It's difficult to determine the rate of failure, it's very inconsistent. For the last 6 days, the service we used to monitor this error has failed once in the testing environment, and twice in production. The IAM role the service is using is set with Maximum session duration=1hr.
  • With this error the service crash and stop, if we restart the service then it works fine again until the error shows up again. I don't see any automatically retry in the code/log and it cannot recover from the failure. Much appreciated if you could give us any guidance on if there is a way to add retry.

Thanks a lot for your help!

@sayantacC
Copy link
Contributor

sayantacC commented Feb 8, 2022

@mandy-yan-liu In version 2.5 you might be able to catch the uncaught exception by using KafkaStreams::setUncaughtExceptionHandler ?
But it seems mostly appropriate for logging or emitting metrics. It does not seem appropriate to replace the thread.

However, Kafka Streams added an easier to use setUncaughtExceptionHandler in version 2.8. This new version provides explicit results that allow you to choose to replace a thread, shutdown client/application.

Hope this helps.

@sayantacC
Copy link
Contributor

Closing due to inactivity. Please feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants