[QE-users] Issue with running parallel version of QE 6.3 on more than 16 cpu

Wed Aug 29 23:49:05 CEST 2018

Dear all,

I have been successfully using QE 5.4 for a while now but recently decided
to install the newest version hoping that some issues I have been
experiencing with 5.4 would be resolved. However, I now have some issues
when running version 6.3 in parallel. In particular, if I run a sample
calculation (input file provided below) on more than 16 processors the
calculation crashes after printing this line "Starting wfcs are random" and
the following error message is printed in the output file:
[compute-0-5.local:5241] *** An error occurred in MPI_Bcast
[compute-0-5.local:5241] *** on communicator MPI COMMUNICATOR 20 SPLIT FROM
18
[compute-0-5.local:5241] *** MPI_ERR_TRUNCATE: message truncated
[compute-0-5.local:5241] *** MPI_ERRORS_ARE_FATAL: your MPI job will now
abort
--------------------------------------------------------------------------
mpirun has exited due to process rank 16 with PID 5243 on
node compute-0-5.local exiting improperly. There are two reasons this could
occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-0-5.local:05226] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[compute-0-5.local:05226] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages

Note that I have been running QE 5.4 on 24 cpu on this same computer
cluster without any issue. I am copying my input file at the end of this
email.

Any help with this would be greatly appreciated.
Thank you in advance.

All the best,
Martina

Martina Lessio
Department of Chemistry
Columbia University

*Input file:*
&control
    calculation = 'relax'
    restart_mode='from_scratch',
    prefix='MoTe2_bulk_opt_1',
    pseudo_dir = '/home/mlessio/espresso-5.4.0/pseudo/',
    outdir='/home/mlessio/espresso-5.4.0/tempdir/'
 /
 &system
    ibrav= 4, A=3.530, B=3.530, C=13.882, cosAB=-0.5, cosAC=0, cosBC=0,
    nat= 6, ntyp= 2,
    ecutwfc =60.
    occupations='smearing', smearing='gaussian', degauss=0.01
    nspin =1
 /
 &electrons
    mixing_mode = 'plain'
    mixing_beta = 0.7
    conv_thr =  1.0d-10
 /
 &ions
 /
ATOMIC_SPECIES
 Mo  95.96 Mo_ONCV_PBE_FR-1.0.upf
 Te  127.6 Te_ONCV_PBE_FR-1.1.upf
ATOMIC_POSITIONS {crystal}
Te     0.333333334         0.666666643         0.625000034
Te     0.666666641         0.333333282         0.375000000
Te     0.666666641         0.333333282         0.125000000
Te     0.333333334         0.666666643         0.874999966
Mo     0.333333334         0.666666643         0.250000000
Mo     0.666666641         0.333333282         0.750000000

K_POINTS {automatic}
  8 8 2 0 0 0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20180829/2c1723a4/attachment.html>