Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ibc: can't find consensus state for counterparty client #2612

Closed
conorsch opened this issue May 19, 2023 · 7 comments
Closed

ibc: can't find consensus state for counterparty client #2612

conorsch opened this issue May 19, 2023 · 7 comments
Assignees

Comments

@conorsch
Copy link
Contributor

While trying to re-enable the relayer, I encountered some trouble getting relaying to work. As we saw with #2472, testing against a clean devnet and a fresh preview environment shows no problems. The problem seems to occur when previously configured relayer client state exists on at least one of the chains. Detailed STR below.

Steps to reproduce

# make sure you have the latest `rly` bin on your PATH
git checkout main
cargo run --release --bin pd -- testnet unsafe-reset-all
./deployments/scripts/relayer-local-devnet
# wait until script reports devnet is up and running.
# then, in a separate terminal:
cd deployments/relayer
./build-path

You should see a successful run, ending in:

2023-05-19T21:10:35.151977Z     info    Found termination condition for channel handshake       {"path_name": "penumbra_path", "chain_id": "penumbra-devnet-local-b196f370", "client_id": "07-tendermint-0"}
2023-05-19T21:10:35.151985Z     debug   No cached IBC header for client trusted height  {"chain_id": "penumbra-devnet-local-b196f370", "client_id": "07-tendermint-0", "height": 184}
2023-05-19T21:10:35.152058Z     debug   Context done, quitting PathProcessor    {"chain_id_1": "penumbra-testnet-amalthe-413ed5d0", "chain_id_2": "penumbra-devnet-local-b196f370", "client_id_1": "07-tendermint-0", "client_id_2": "07-tendermint-0", "error": "context canceled"}
Finished! No errors encountered setting up link. Available paths:
 0: penumbra_path        -> chns(✔) clnts(✔) conn(✔) (penumbra-testnet-amalthe-413ed5d0<>penumbra-devnet-local-b196f370)

At that point, the following queries will all report clean results:

  • pcli -n http://localhost:8080 q ibc client 07-tendermint-0
  • pcli -n http://localhost:8080 q ibc connection connection-0
  • pcli -n http://localhost:8080 q ibc channel transfer channel-0

You can try rerunning build-path and it should continue to succeed. Now, let's try reproducing the error. Cancel (i.e. ctrl+c) the devnet script in the first terminal session, and rerun those setup steps. This will create a new local devnet, but against the same preview instance. (N.B. if preview has changed out from under you while you were following these steps, you'll have to reset and go from the top again.)

Then, when you rerun the build-path script, you'll see an error:

2023-05-19T22:11:48.654343Z     debug   Queried latest height   {"chain_name": "penumbra-local", "chain_id": "penumbra-devnet-local-76f29ec1", "latest_height": 570}
2023-05-19T22:11:48.679504Z     debug   Queried latest height   {"chain_name": "penumbra-preview", "chain_id": "penumbra-testnet-amalthe-413ed5d0", "latest_height": 869}
2023-05-19T22:11:49.655293Z     debug   Queried latest height   {"chain_name": "penumbra-local", "chain_id": "penumbra-devnet-local-76f29ec1", "latest_height": 571}
2023-05-19T22:11:49.680543Z     debug   Queried latest height   {"chain_name": "penumbra-preview", "chain_id": "penumbra-testnet-amalthe-413ed5d0", "latest_height": 869}
2023-05-19T22:11:50.655429Z     debug   Queried latest height   {"chain_name": "penumbra-local", "chain_id": "penumbra-devnet-local-76f29ec1", "latest_height": 571}
2023-05-19T22:11:50.679901Z     debug   Queried latest height   {"chain_name": "penumbra-preview", "chain_id": "penumbra-testnet-amalthe-413ed5d0", "latest_height": 869}
2023-05-19T22:11:51.655823Z     debug   Queried latest height   {"chain_name": "penumbra-local", "chain_id": "penumbra-devnet-local-76f29ec1", "latest_height": 571}
2023-05-19T22:11:51.681176Z     debug   Queried latest height   {"chain_name": "penumbra-preview", "chain_id": "penumbra-testnet-amalthe-413ed5d0", "latest_height": 869}
2023-05-19T22:11:52.550073Z     debug   Context done, quitting PathProcessor    {"chain_id_1": "penumbra-testnet-amalthe-413ed5d0", "chain_id_2": "penumbra-devnet-local-76f29ec1", "client_id_1": "07-tendermint-1", "client_id_2": "07-tendermint-0", "error": "context deadline exceeded"}
Error: identifier cannot be blank: invalid identifier

Inspecting the pd logs for the local devnet, I see:

2023-05-19T22:11:44.894860Z DEBUG abci:CheckTx{kind=New txid="dbb610d7f626fd9608ce33916a1e2c22a4621c4002415ef3a82b12813d7c4ca7"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: key=b"block_height" key_hash=KeyHash("c4b4e25a8824650a238d3451f754dcda1551dec8bec1a96abe23d2ff8b454db8")
2023-05-19T22:11:44.894905Z DEBUG abci:CheckTx{kind=New txid="dbb610d7f626fd9608ce33916a1e2c22a4621c4002415ef3a82b12813d7c4ca7"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: key=b"block_height" key_hash=KeyHash("c4b4e25a8824650a238d3451f754dcda1551dec8bec1a96abe23d2ff8b454db8")
2023-05-19T22:11:44.894947Z DEBUG abci:CheckTx{kind=New txid="dbb610d7f626fd9608ce33916a1e2c22a4621c4002415ef3a82b12813d7c4ca7"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: key=b"chain_params" key_hash=KeyHash("111493a3b144602d58199e6bc86ce9cd08f3bceebe0582f9c8d249c274480732")
2023-05-19T22:11:44.895011Z DEBUG abci:CheckTx{kind=New txid="dbb610d7f626fd9608ce33916a1e2c22a4621c4002415ef3a82b12813d7c4ca7"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: key=b"clients/07-tendermint-1/consensusStates/0-868" key_hash=KeyHash("c78b9eb6b705b053dd6b20d9f0f3c30724653899e7b0e23856b4fb34d21ed064")
2023-05-19T22:11:44.895063Z  INFO abci:CheckTx{kind=New txid="dbb610d7f626fd9608ce33916a1e2c22a4621c4002415ef3a82b12813d7c4ca7"}: tx rejected e=counterparty consensus state not found for client 07-tendermint-1 at height 0-868 elapsed=2.847708ms

Judging by the "counterparty consensus state not found for client 07-tendermint-1" message, we should check whether that client exists on the counterparty chain. Indeed, it does: running pcli -n https://grpc.testnet-preview.penumbra.zone q ibc client 07-tendermint-1 shows that client exists and appears to be configured well.

@conorsch
Copy link
Contributor Author

conorsch commented Jun 8, 2023

We discussed this ticket last week and decided that it should come after higher priority work on completing the ibc-types refactor. After we've cleared that hurdle, @avahowell plans to take a closer look at this report.

@conorsch
Copy link
Contributor Author

conorsch commented Jul 6, 2023

Fired up the tooling again today, and received a different error from pd:

2023-07-06T15:52:30.913327Z DEBUG abci:CheckTx{kind=New txid="0b4c41c0e0606d0aff39f4b6a0b2655c862e243b059ea926392bb47992493334"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: key=b"block_height" key_hash=KeyHash("c4b4e25a8824650a238d3451f754dcda15
51dec8bec1a96abe23d2ff8b454db8")                                                                                                                                                                                                                                                
2023-07-06T15:52:30.913415Z DEBUG abci:CheckTx{kind=New txid="0b4c41c0e0606d0aff39f4b6a0b2655c862e243b059ea926392bb47992493334"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: key=b"block_height" key_hash=KeyHash("c4b4e25a8824650a238d3451f754dcda15
51dec8bec1a96abe23d2ff8b454db8")                                                                                                                                                                                                                                                
2023-07-06T15:52:30.913456Z DEBUG abci:CheckTx{kind=New txid="0b4c41c0e0606d0aff39f4b6a0b2655c862e243b059ea926392bb47992493334"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: key=b"chain_params" key_hash=KeyHash("111493a3b144602d58199e6bc86ce9cd08
f3bceebe0582f9c8d249c274480732")                                                                                                                                                                                                                                                
2023-07-06T15:52:30.913533Z  INFO abci:CheckTx{kind=New txid="0b4c41c0e0606d0aff39f4b6a0b2655c862e243b059ea926392bb47992493334"}: tx rejected e=invalid client state: proof specs do not match elapsed=2.981365m

which points to

// check client proof specs match penumbra proof specs
if PENUMBRA_PROOF_SPECS.clone() != tm_client_state.proof_specs {
return Err(anyhow::anyhow!(
"invalid client state: proof specs do not match"
));
}

The error from relayer remains unchanged:

2023-07-06T15:52:39.572324Z     debug   Context done, quitting PathProcessor    {"chain_id_1": "penumbra-testnet-callisto-a4709f2c", "chain_id_2": "penumbra-devnet-local-52880d19", "client_id_1": "07-tendermint-0", "client_id_2": "07-tendermint-0", "error": "context deadline exceeded"}
Error: identifier cannot be blank: invalid identifier

Let's see if we can reproduce that error with further testing today.

@conorsch
Copy link
Contributor Author

conorsch commented Jul 6, 2023

Additionally logging output shows that the proof specs are indeed dissimilar:

2023-07-06T21:42:53.081146Z DEBUG abci:CheckTx{kind=New txid="c29e5a767a887785306520ce7677f12afb3c471c5709829a77b5bebab37d9543"}:execute:IbcAction{idx=1}:ConnectionOpenTry{client_id=07-tendermint-0}: p1=[ProofSpec { leaf_spec: Some(LeafOp { hash: Sha256, prehash_key: Sha256, prehash_value: Sha256, length: NoPrefix, prefix: [74, 77, 84, 58, 58, 76, 101, 97, 102, 78, 111, 100, 101] }), inner_spec: Some(InnerSpec { child_order: [0, 1], child_size: 32, min_prefix_length: 16, max_prefix_length: 16, empty_child: [83, 80, 65, 82, 83, 69, 95, 77, 69, 82, 75, 76, 69, 95, 80, 76, 65, 67, 69, 72, 79, 76, 68, 69, 82, 95, 72, 65, 83, 72, 95, 95], hash: Sha256 }), max_depth: 64, min_depth: 0, prehash_key_before_comparison: true }, ProofSpec { leaf_spec: Some(LeafOp { hash: Sha256, prehash_key: NoHash, prehash_value: NoHash, length: NoPrefix, prefix: [] }), inner_spec: Some(InnerSpec { child_order: [0, 1], child_size: 32, min_prefix_length: 0, max_prefix_length: 0, empty_child: [], hash: Sha256 }), max_depth: 1, min_depth: 0, prehash_key_before_comparison: true }] p2=[ProofSpec { leaf_spec: Some(LeafOp { hash: Sha256, prehash_key: Sha256, prehash_value: Sha256, length: NoPrefix, prefix: [74, 77, 84, 58, 58, 76, 101, 97, 102, 78, 111, 100, 101] }), inner_spec: Some(InnerSpec { child_order: [0, 1], child_size: 32, min_prefix_length: 16, max_prefix_length: 48, empty_child: [], hash: Sha256 }), max_depth: 64, min_depth: 0, prehash_key_before_comparison: false }, ProofSpec { leaf_spec: Some(LeafOp { hash: Sha256, prehash_key: NoHash, prehash_value: NoHash, length: NoPrefix, prefix: [] }), inner_spec: Some(InnerSpec { child_order: [0, 1], child_size: 32, min_prefix_length: 0, max_prefix_length: 0, empty_child: [], hash: Sha256 }), max_depth: 1, min_depth: 0, prehash_key_before_comparison: false }]

Specifically, the proof spec from the const in our Rust code does have empty_child defined, whereas the proof spec received from the relayer client does not. Addtionally, prehash_key_before_comparison is set to true vs false, respectively. We think this means that the two ends are using different versions of the JMT, but we don't understand how that's possible, given that we're running between a local devnet and preview, both of which are based on most recent HEAD of main branch.

@conorsch
Copy link
Contributor Author

conorsch commented Jul 6, 2023

We think this means that the two ends are using different versions of the JMT, but we don't understand how that's possible

Found it: https://github.com/cosmos/relayer/blob/1bfe06cdec3bdd1e0ba64fafac2b6f39cf495e70/relayer/chains/penumbra/tx.go#L1932-L1950

So the go code is hardcoding a shape to the proofspec that isn't compatible with our most recent changes. To be compatible, we need the relayer code to increment the version of its ics23 dependency to at least v0.10. @avahowell is going to take a stab at that change and we'll circle back on testing to see if it resolves our problems. (It's possible it'll resolve one problem, and we'll have more to knock flat.) Here's a PL branch for the relayer to collab on for that fix: https://github.com/penumbra-zone/relayer/tree/update-penumbra-protos-2023-07

@conorsch
Copy link
Contributor Author

conorsch commented Jul 6, 2023

Re-running the test based on the updated golang code, I no longer see the mismatched proofs, which is fantastic. However, the relayer ultimately fails with the message in the OP, with pd logging the same msg:

2023-07-06T23:41:27.271428Z DEBUG flushing response response=Ok(CheckTx(CheckTx { code: Err(1), data: b"", log: "counterparty consensus state not found for client 07-tendermint-2 at height 0-479", info: "", gas_wanted: 0, gas_used: 0, events: [], codespace: "", sender: "", priority: 0, mempool_error: "" }))

So, one problem down, at least one more to go!

@conorsch
Copy link
Contributor Author

conorsch commented Jul 7, 2023

We have a hunch about another off-by-one error here, specifically on the height between chains. Likely the patch to resolve will be in the relayer code. If it's in the Rust code, it'd be in the query sections, not in the state machine. @avahowell is currently adding more logging messages to try to isolate the specific conditions that cause this problem. We believe it occurs

// get the stored consensus state for the counterparty
let trusted_consensus_state = state
.get_verified_consensus_state(
self.proofs_height_on_a,
self.counterparty.client_id.clone(),
)
.await?;
but more detailed logging would be awesome for a lot of reasons. As for whether this will make it into 56, 🤞

conorsch added a commit that referenced this issue Jul 7, 2023
We encountered a long (~20s) timeout on all `rly` commands, and traced
it to a context timeout on a bad URL for the bablyon testchain. We're
still focusing on preview <-> devnet functionality, so let's exclude the
external testnet chains from relayer configuration until we're ready.

Also includes some documentation updates so it's a bit easier to
understand how test the relayer interactively. Eventually we'll want all
this logic bottled up in integration tests, e.g. `interchaintest`, but
still working toward that goal.

Refs #2612.
@conorsch
Copy link
Contributor Author

conorsch commented Jul 7, 2023

Success! We've re-establishing working clients, channels, and connections. The changes required were:

After those changes landed, we were able to get a working connection between a local devnet and preview. There are some small docs tweaks in #2804. Next week, we'll work on manual testing of packet transfers between the two chains. Once we have that working, we'll evaluate running a relayer instance between preview & testnet (which mostly depends on whether further code changes are needed in the Penumbra repo, in which case the target will be 57).

@conorsch conorsch closed this as completed Jul 7, 2023
conorsch added a commit that referenced this issue Jul 7, 2023
We encountered a long (~20s) timeout on all `rly` commands, and traced
it to a context timeout on a bad URL for the bablyon testchain. We're
still focusing on preview <-> devnet functionality, so let's exclude the
external testnet chains from relayer configuration until we're ready.

Also includes some documentation updates so it's a bit easier to
understand how test the relayer interactively. Eventually we'll want all
this logic bottled up in integration tests, e.g. `interchaintest`, but
still working toward that goal.

Refs #2612.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Testnet 56: Callisto
Development

No branches or pull requests

2 participants