Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: implement routing in xDS resolver with config selector API #7275

Merged

Conversation

voidzcy
Copy link
Contributor

@voidzcy voidzcy commented Jul 31, 2020

Implement xDS request routing with new architecture: make route decision in a config selector emitted by the xDS resolver.

Tests are subtle. Refactored to use a fake XdsClient for testing, which significantly reduces the complexity for triggering code under test. Further cleanup might be necessary.

Should not be merged until Channel/ClientCall's integration of config selector is done. Otherwise, the whole xDS LB code path will be broken.

@voidzcy voidzcy marked this pull request as ready for review August 3, 2020 17:59
@voidzcy voidzcy force-pushed the impl/resolver_routing_with_config_selector branch 3 times, most recently from 86fb85f to 5e46290 Compare August 7, 2020 09:13
@voidzcy voidzcy force-pushed the impl/resolver_routing_with_config_selector branch from 5e46290 to 1d6a137 Compare August 7, 2020 09:14
@dapengzhang0
Copy link
Member

XdsNameResolver2 looks a bit involved, still reviewing, adding @ejona86 to review. Other parts LGTM.

xds/src/main/java/io/grpc/xds/XdsNameResolver2.java Outdated Show resolved Hide resolved
}
}
}
if (cluster == null) { // should not happen if routing rules are configured correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible to get here with a null cluster? Aren't we guaranteed to have a cluster result here? How is it possible to go through the if or the else and not choose a cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should never ever happen. The converted data should be guaranteed to be valid (aka, the weighted clusters list should have at least one cluster). I added that validation in the logic of parsing weighted clusters.

}

String serviceConfigJson =
generateServiceConfigWithMethodConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do the service config generation and parsing for each RPC. That code should not be placed on the critical path, as we don't want performance pressure placed on it. It seems we should generate them ahead-of-time and just use the config object here.

This can be fine as a first cut, but we should consider when we'll improve it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putted a TODO item there.

xds/src/main/java/io/grpc/xds/XdsNameResolver2.java Outdated Show resolved Hide resolved
}

@SuppressWarnings("ModifyCollectionInEnhancedForLoop") // ok for concurrent map
private class ConfigWatcherImpl implements ConfigWatcher {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a weird place for this class. I'd expect it to be either near its usage (start()) or after the normal methods (shutdown()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address and format classes/methods correctly after all the other comments are addresses (just to avoid moving blocks of code around and make the review hard).

}
// Make newly added clusters selectable by config selector.
routes = update.getRoutes();
// Drops reference for deleted clusters, update service config to remove deleted clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says what it does but does not provide a hint as to why. The ordering here is very important.

"Make newly added clusters selectable by config selector." could maybe change to "Make newly added clusters selectable by config selector and deleted clusters no longer selectable."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Put them together to reflect the ordering.

// Drops reference for deleted clusters, update service config to remove deleted clusters
// not in use.
boolean shouldUpdateResult = false;
for (Map.Entry<String, AtomicInteger> entry : clusterRefs.entrySet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right. This loops through "all deleted clusters," not "all newly deleted clusters." Consider receiving the same update twice; both updates will run decrementAndGet() for the same clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I missed this consideration. Fixed now and added a test case to cover such case.

xds/src/main/java/io/grpc/xds/XdsNameResolver2.java Outdated Show resolved Hide resolved
String methodName = fullMethodName.substring(index + 1);
String timeout = timeoutNano / 1_000_000_000.0 + "s";
Map<String, String> serviceMethod = new HashMap<>();
serviceMethod.put("service", serviceName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned before, why not use wildcard service/method name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

xds/src/main/java/io/grpc/xds/XdsNameResolver2.java Outdated Show resolved Hide resolved
@voidzcy voidzcy force-pushed the impl/resolver_routing_with_config_selector branch from d31a06b to f7c6bd8 Compare August 14, 2020 20:34
@voidzcy voidzcy force-pushed the impl/resolver_routing_with_config_selector branch 2 times, most recently from 8c94094 to 5baf3f1 Compare August 14, 2020 21:51
@voidzcy voidzcy force-pushed the impl/resolver_routing_with_config_selector branch from 5baf3f1 to 391c2e4 Compare August 14, 2020 22:16
xds/src/main/java/io/grpc/xds/XdsNameResolver2.java Outdated Show resolved Hide resolved
xds/src/main/java/io/grpc/xds/XdsNameResolver2.java Outdated Show resolved Hide resolved
private void releaseCluster(final String cluster) {
int count = clusterRefs.get(cluster).decrementAndGet();
if (count == 0) {
syncContext.execute(new Runnable() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster is added back by onConfigChanged() in between above two lines, it will not putIfAbsent() to clusterRefs, and will be removed later by this runnable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster is added back by onConfigChanged() in between above two lines, it will not putIfAbsent() to clusterRefs

Good point. I'll note that it is a problem even before remove() was moved into the sync context.

It looks like the for (String cluster : addedClusters) { is broken because it never calls retainCluster() if the cluster is already present; it had had zero references for the clusters in the set so it must acquire a reference for every cluster in addedClusters. In this case, retainCluster() would return false because the reference count couldn't be increased from 0→1.

I'll say that I was bothered that releaseCluster() is passed a string instead of a specific object. That make would make fixing the bug easier. I wished the reference count was for a specific object whose reference was tied 1:1 with the reference count. That object could be AtomicInteger, but that seems a bit weird and hard to debug; we may want to extend AtomicInteger (or use AtomicIntegerFieldUpdater) to have it store the cluster name String as well.

If we leave things mostly-alone, then we could fix this by adding the missing retainCluster() to the addedClusters loop. If retainCluster() fails it would do the put(). We'd then also change release cluster to use the two-argument remove: clusterRefs.remove(cluster, atomicInteger).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before remove() is moved into the sync context, the race window is small. Yeah, but the problem does exists. But now, I think the fix isn't that complicated as add and remove operations for clusterRefs are all done in the sync context now.

putIfAbsent() is no longer necessary (but clusterRefs still needs to be a concurrent map), we can just check if clusterRefs contains the cluster to be added. If it does, increment its ref count; if not, put AtomicInteger(1). These two steps do not need to be atomic. This fixes the problem of failing to add back clusters before the previous deleted one is removed from clusterRefs.

When trying to remove the cluster with ref count 0 (also in the sync context), we need to check again its ref count is indeed 0. If it is, remove it from clusterRefs; if it is not, do nothing. These two steps do not need to be atomic as the the only chance that could increment the ref count from 0 to 1 is receiving a new update that adds the cluster back, which happens in the sync context. This prevents removing the cluster that is added back again after its ref count is 0.

I added a test case (resolve_raceBetweenClusterReleasedAndResourceUpdateAddBackAgain) that reproduces this issue and now it is fixed.

… retained, and the cluster is removed and added back before the call finishes.
@voidzcy voidzcy force-pushed the impl/resolver_routing_with_config_selector branch from 1aaff29 to 549da89 Compare August 18, 2020 20:31
@voidzcy
Copy link
Contributor Author

voidzcy commented Aug 20, 2020

All comments have been addressed. PTAL.

TODO: rearrange methods and inner classes in XdsNameResolver2 (and maybe its tests as well).

Copy link
Member

@dapengzhang0 dapengzhang0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@voidzcy
Copy link
Contributor Author

voidzcy commented Aug 24, 2020

The last commit rearranged the ordering of inner classes and methods.

@voidzcy voidzcy merged commit 292f3b9 into grpc:master Aug 24, 2020
dfawley pushed a commit to dfawley/grpc-java that referenced this pull request Jan 15, 2021
…#7275)

Name resolver implementation for performing xDS request routing before the call is made: the resolver emits a config selector to the Channel to let calls make routing decision before delegating to the corresponding cluster load balancer's picker.

This is a branched xDS name resolver implementation. It will replace the existing xDS resolver once the Channel's integration for using config selector is done.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants