Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlchemiscaleClient get_tasks_status and set_tasks_status slow for many tasks #148

Closed
dotsdl opened this issue Jun 20, 2023 · 6 comments
Closed

Comments

@dotsdl
Copy link
Member

dotsdl commented Jun 20, 2023

When trying to reset many tasks, e.g. 1000, from error to waiting, it can take quite a bit of time to use get_tasks_status and set_tasks_status to achieve this. These methods currently loop through all the Task ScopedKeys they are given one at a time to perform their operations.

We should identify ways to speed this up, including:

@dotsdl
Copy link
Member Author

dotsdl commented Jun 20, 2023

@hmacdope can you drop any findings you come across on this question here?

@hmacdope
Copy link
Collaborator

Google Firebase and Compute seem to have batching for their endpoints while also supporting async modes

Mailchimp also seems to have an async batched workflow as does Meta's ad REST API .

Some info on doing batching within Neo4j queries themselves is here and here IMO second one is better. These make heavy use of UNWIND . There is also a general guide to Cypher performance here which recommends dropping unused identifiers which is something I think we do a little bit.

Another (IMO great) option is to use APOC and its periodic.iterate functionality documented here

CALL apoc.periodic.iterate(
  "UNWIND $nodes AS node
   RETURN node",
  "MATCH (n:Label {id: node.id})
   SET n.property = node.property",
  {batchSize: 100, parallel: true, params: {nodes: $nodeList}}
)

@hmacdope
Copy link
Collaborator

This may require the official drivers rather than py2neo but something like:

# Assuming you have a list of unique IDs
id_list = [1, 2, 3, 4]

# Connect to Neo4j
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Execute the Cypher query
with driver.session() as session:
    result = session.run("MATCH (n:Label) WHERE n.id IN $idList RETURN n", idList=id_list)
    nodes = result.data()

# Print the retrieved nodes
for node in nodes:
    print(node)

May work according to Copilot X

@hmacdope
Copy link
Collaborator

I will say that most of the batched APIs have a hardcoded limit on the batchsize.

@dotsdl
Copy link
Member Author

dotsdl commented Jun 20, 2023

Awesome, thanks for these notes @hmacdope! This is great fodder for a decision on how to proceed for these methods.

dotsdl added a commit that referenced this issue Sep 8, 2023
@dotsdl
Copy link
Member Author

dotsdl commented Jun 5, 2024

Closed by #150.

@dotsdl dotsdl closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants