Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to seed numba RNG #89

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

juliustao
Copy link
Collaborator

Calling numpy.random.seed at the start of code can set the global seed for all numpy.random methods.
However, calling this method from Python does not seed the numba generator, so numba JIT-compiled code is nondeterministic (e.g., the cutout_square() function returned by generate_code() in the Cutout operation).
See this numba documentation for further details.
Adding an optional seed argument to such random transforms allows reproducibility across runs.

@GuillaumeLeclerc
Copy link
Collaborator

Hi! I'm not sure doing this part of the transforms is the optimal way as every transform with randomness will have to add it, and users have to set the seed many times. Moreover it will be ran at each batch which will slow down things. Maybe there is a better place.
Any idea how long does the random state in numba lives ? Is it per thread or per process ?

@juliustao
Copy link
Collaborator Author

Thanks for the quick response Guillaume!

I agree that this is not a great way to seed the random state in numba. I wasn't sure how to modify the Operation parent class so that np.random.seed(seed) could be called in an arbitrary function returned by generate_code.

A nicer solution would be to seed the numba random state once for all future JIT-compiled functions.

I'm not too familiar with numba, but the documentation linked above says that

Since version 0.28.0, the generator is thread-safe and fork-safe. Each thread and each process will produce independent streams of random numbers.

This numba thread suggests that it's possible to set the numba random state once at the start for determinism in single-threaded code. I'll dig into this more and run some tests once the slurm cluster is back up.

Hope that clarified some questions :)

@GuillaumeLeclerc
Copy link
Collaborator

I think the best would be to do it at the start of loading and use the seed argument. Doing it in the operations mean you get the same random sequence at each batch which most likely will produce adverse results during training.

@juliustao juliustao reopened this Jan 24, 2022
@juliustao
Copy link
Collaborator Author

I was able to get determinism after using training.num_workers = 1 with

torch.backends.cudnn.deterministic = True
torch.manual_seed(SEED)
np.random.seed(SEED)

in train_cifar.py and setting the numba seed inside the EpochIterator thread.

I cannot set the numba seed in train_cifar.py like numpy or torch since every thread has an independent numba state.
This solution is still suboptimal, and I hope there's a simple fix that I overlooked.

@juliustao
Copy link
Collaborator Author

Also, I'm confused about the threading in ffcv: why is the EpochIterator object returned by iter(Loader()) implemented as a Thread? Is it since waiting for the cuda stream takes significant time where we can perform other cpu operations?

@GuillaumeLeclerc
Copy link
Collaborator

I suspect that it would work with multiple workers too as workers are only activate in the body of the transforms. Have you tried that?

EpochIterator is implemented as a thread so that the augmentations (especially the CPU ones) are not blocking the main training loop of the user

@juliustao
Copy link
Collaborator Author

With the code above, setting training.num_workers > 1 does not give deterministic results :(

I haven't confidently figured out why that's the case, but I suspect that the cause is numba threads interleaving randomly.

@GuillaumeLeclerc
Copy link
Collaborator

Oh that's the good call in my opinion. Not sure yet how to get around that problem. Do you personally have use cases where determinism is needed. Usually determinism doesn't play well with high-performance code (cuddn deterministic mode can be significantly slower too)

@juliustao
Copy link
Collaborator Author

My current work is looking at how fixing different sources of randomness affects training outcomes, and data augmentations are an example. Maybe this use case is rather niche, and the changes are not worth the hit in performance. Hopefully this thread at least can help others with similar issues :)

@juliustao
Copy link
Collaborator Author

On a related note, is the desired default behavior of the Random TraversalOrder to have the same shuffle order across independent runs? The default is self.seed = self.loader.seed = 0, which implies the above since the seed for each epoch is always self.seed + epoch.

@GuillaumeLeclerc
Copy link
Collaborator

GuillaumeLeclerc commented Jan 24, 2022 via email

@heitorrapela
Copy link

Did you succeed to run ffcv deterministic @juliustao? I am facing similar problem with a gap of more than 5 points in my metric with same code on different runs. I seed everything, but I was looking for how to seed the workers as in pytorch dataloaders, but could not find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants