Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BiocParallel fail to start with MPI #120

Open
raffaelepotami opened this issue Jul 28, 2020 · 2 comments
Open

BiocParallel fail to start with MPI #120

raffaelepotami opened this issue Jul 28, 2020 · 2 comments

Comments

@raffaelepotami
Copy link

Hello Everyone,
We are having trouble running BiocParallel within our SLURM cluster environment.

The foo.R script we are trying to run is

library("BiocParallel")
library("Rmpi")

param <- SnowParam(workers = 3, type = "MPI")
FUN <- function(i) system("hostname", intern=TRUE)
bplapply(1:6, FUN, BPPARAM = param)

If we request an interactive job allocation, for example with salloc -p mpi -N 2 -n 4 -t 1:00:00 and then start R with:
mpiexec -np 1 R --no-save and run the above script from this interactive shell we have as expected:

> library("BiocParallel")
library("BiocParallel")
> library("Rmpi")
library("Rmpi")
> param <- SnowParam(workers = 3, type = "MPI")
param <- SnowParam(workers = 3, type = "MPI")
> FUN <- function(i) system("hostname", intern=TRUE)
FUN <- function(i) system("hostname", intern=TRUE)
> bplapply(1:6, FUN, BPPARAM = param)
bplapply(1:6, FUN, BPPARAM = param)
	3 slaves are spawned successfully. 0 failed.
[[1]]
[1] "compute-a-16-21"

[[2]]
[1] "compute-a-16-21"

[[3]]
[1] "compute-a-16-22"

[[4]]
[1] "compute-a-16-22"

[[5]]
[1] "compute-a-16-22"

[[6]]
[1] "compute-a-16-22"

However if we try to run the same R script from within a sbatch job with:

#!/bin/bash

#SBATCH -p mpi
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -t 2:00:00

mpiexec -np 1 Rscript foo.R  # or R CMD BATCH foo.R 

The execution hangs for several seconds and eventually fails with the MPI error:

[compute-a-16-21:10780] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Does anyone have any idea of why the primary R process is failing to start the other tasks?

Thank you
Raffaele

@raffaelepotami raffaelepotami changed the title BiocParallel fail to BiocParallel fail to start with MPI Jul 28, 2020
@raffaelepotami
Copy link
Author

Update:
starting the batch job with

mpiexec -np 1 R --no-save --file=foo.R

instead of R CMD BATCH or Rscript seems to work.
The execution still ends with a bad OMPI since the task just dies out there, but at least it does run the hostname on the distributed system

@nturaga
Copy link
Contributor

nturaga commented Aug 7, 2020

Can you try using the BiocParallel::BatchToolsParam() interface and try it on your SLURM cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants