[QE-developers] [QE-GPU] Computing Pc [DH,Drho] |psi>

Vladislav Olegovich KHAUSTOV vladislav.khaustov at sns.it
Sun May 26 14:46:51 CEST 2024


Dear QE developers,
I am trying to perform Raman simulations for 24 atoms system on GPU nodes,
but facing the problem at "Computing Pc [DH,Drho] |psi>" calculation step.
The problem seems to be related to the memory allocation on the device. The
maximum number of GPUs is 16 (NVIDIA Tesla V100 16Gb) and with or without
"-npool" parallelization option it gives the following lines and then
shutdowns:

"Computing Pc [DH,Drho] |psi>

     Derivative coefficient:  0.001000    Threshold: 1.00E-12
     kpoint   1 ibnd**** pcgreen: root not converged 1.654E-12
     kpoint   1 ibnd**** pcgreen: root not converged 3.354E-12
     kpoint   1 ibnd**** pcgreen: root not converged 1.592E-09
     kpoint   1 ibnd**** pcgreen: root not converged 5.693E-11
     kpoint   1 ibnd**** pcgreen: root not converged 2.128E-10
     kpoint   1 ibnd**** pcgreen: root not converged 1.017E-11
     kpoint   1 ibnd**** pcgreen: root not converged 7.944E-12
     kpoint   1 ibnd**** pcgreen: root not converged 2.466E-11
"

I can add some CPUs (max 80) and perform "-npool 16" parallelization and
there are no messages about convergence but after some time it stops.
Verbosity 'high' option does not give any useful information at this
calculation step, but GPUs system message writes the following message
after shutdown:

"
Loading pgi-23.5/quantum-espresso/7.2
  Loading requirement: nvhpc-nompi/23.5 nvidia/cuda-12.0.0
gcc-8.5.0/hwloc-2.9.0
    gcc-8.5.0/gdrcopy-2.3.1 gcc-8.5.0/ucx-1.14.1_gdr gcc-8.5.0/pmix-4.2.2
    gcc-8.5.0/ucc-1.2.0_ucx-gdr pgi-23.5/ompi-4.1.4_nccl_pbs
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

...
deleted block   device:0x150274000000 size:132475392 threadid=1
deleted block   device:0x1502d2000000 size:147194880 threadid=1
deleted block   device:0x150280000000 size:147194880 threadid=1
deleted block   device:0x1502ee000000 size:147194880 threadid=1
deleted block   device:0x1502be000000 size:147194880 threadid=1
deleted block   device:0x1502c8000000 size:147194880 threadid=1
deleted block   device:0x1501bc000000 size:518400000 threadid=1
deleted block   device:0x150246000000 size:618218496 threadid=1
FATAL ERROR: data in use_device clause was not found on device 4:
host:0x149a8210

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28735,1],7]
  Exit code:    1
"
The simulation works well using cpu version of QE with 256 CPUs and the
successful outcome gives the following message:

"     Computing Pc [DH,Drho] |psi>

     Derivative coefficient:  0.001000    Threshold: 1.00E-12
     Non-scf  u_k: avg # of iterations = 45.7
     Non-scf Du_k: avg # of iterations = 65.1
"
Remarkably, the previous step of calculation "Electric Fields Calculation"
and phonons calculations show a speed up factor of 10-20 when moving from
256 CPUs to 16 GPUs.

Is there a way to overcome the issue of "Computing Pc [DH,Drho] |psi> "
calculation step?

-- 

Best regards,

Vladislav Khaustov.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/developers/attachments/20240526/f0ce8029/attachment.html>


More information about the developers mailing list