Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - NNFDM implementation in AXL #131

Draft
wants to merge 39 commits into
base: main
Choose a base branch
from
Draft

Conversation

mcfadden8
Copy link
Contributor

It basically compiles and cannot be fully tested until we have an
operational server.

It basically compiles and cannot be fully tested until we have an
operational server.
src/axl_async_nnfdm.c Outdated Show resolved Hide resolved
@adammoody
Copy link
Contributor

Thanks @mcfadden8 . Nicely done.

I haven't checked things under this context, but it would be good to think through whether the HPE API supports SCR's scalable restart and scavenge operations.

In the case of a scalable restart, we normally try to cancel any outstanding flush. Since there is no way to cancel, I think we'd need the restarted job to be able to resume and/or wait on any outstanding flush that was started from a previous run, i.e., I don't think we'd want the restarted job to initiate a new flush of the same files that are already in progress from a flush in a prior run.

For scavenge, is there a way for the job script to see the status of a flush started by the last run? If not, will there be problems if we try to copy the files again while a flush may still be ongoing?

@mcfadden8
Copy link
Contributor Author

mcfadden8 commented Aug 2, 2022

Thanks @mcfadden8 . Nicely done.

I haven't checked things under this context, but it would be good to think through whether the HPE API supports SCR's scalable restart and scavenge operations.

In the case of a scalable restart, we normally try to cancel any outstanding flush. Since there is no way to cancel, I think we'd need the restarted job to be able to resume and/or wait on any outstanding flush that was started from a previous run, i.e., I don't think we'd want the restarted job to initiate a new flush of the same files that are already in progress from a flush in a prior run.

For scavenge, is there a way for the job script to see the status of a flush started by the last run? If not, will there be problems if we try to copy the files again while a flush may still be ongoing?

Hi @adammoody, for scalable restart, I agree that we would need an API to cancel any outstanding flushes.

For both scalable restart and scavange, I think we will need a way to list the requests that are still in progress from any previous runs.

I think that these requests may already be documented, but we should discuss to be sure I am understanding things correctly.

@mcfadden8
Copy link
Contributor Author

@adammoody - I've integrated with the latest C++ api provided from HPE. My next step will be to add in their new API for canceling and enumeration of old jobs which should allow us to support scalable restart and scavenge.

Comment on lines +37 to +40
IF(HAVE_NNFDM)
LIST(APPEND libaxl_srcs nnfdm.cpp)
ENDIF(HAVE_NNFDM)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll also want this in src/dist/CMakeLists.txt.

src/nnfdm.cpp Outdated
Comment on lines 185 to 187
do {
status = nnfdm_stat(uid, max_seconds_to_wait);
} while (status == AXL_STATUS_INPROG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future work, we may want to add a sleep() in here to avoid thrashing the server will poll requests.

@adammoody
Copy link
Contributor

Nice. Thanks, @mcfadden8

src/axl.c Outdated
Comment on lines 850 to 853
#ifdef HAVE_NNFDM
case AXL_XFER_ASYNC_NNFDM:
break;
#endif /* HAVE_NNFDM */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include a comment here about why we don't have to do anything.

@mcfadden8 mcfadden8 changed the title Prototype for NNFDM implementation in AXL NNFDM implementation in AXL Aug 7, 2023
@mcfadden8 mcfadden8 changed the title NNFDM implementation in AXL WIP - NNFDM implementation in AXL Jun 26, 2024
@gonsie gonsie mentioned this pull request Jun 26, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants