[QE-users] A strange error when using GPU accelerated ph.x

lq1998 1148330678 at qq.com
Fri Apr 19 18:00:40 CEST 2024


Dear developers and users,


I tried to run GPU version of QE for electron phonon coupling calculation on an a100 card. The structure relaxation and self-consistent calculation are successful. However, when I did phonon calculation, my job crashed with a strange error:


##################
[m005:65520:0:65520] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffffc)


/fs08/home/js_luqing/src/qe-7.2/PHonon/PH/phq_setup.f90: [ phq_setup_() ]
      ...
      322   !     nat_todo, atomo, comp_irr
      323 
      324   DO irr=0,nirr
==>   325      comp_irr(irr)=comp_irr_iq(irr,current_iq)
      326      IF (elph .AND. irr>0) comp_elph(irr)=comp_irr(irr)
      327   ENDDO
      328   !


==== backtrace (tid:  65520) ====
 0 0x00000000004a2780 phq_setup_()  /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/phq_setup.f90:325
 1 0x00000000004700e1 initialize_ph_()  /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/initialize_ph.f90:79
 2 0x000000000041a811 do_phonon_()  /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/do_phonon.f90:100
 3 0x0000000000413d25 MAIN_()  /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/phonon.f90:78
 4 0x0000000000413c71 main()  ???:0
 5 0x0000000000022555 __libc_start_main()  ???:0
 6 0x000000000040cd8d _start()  ???:0
=================================
[m005:65520] *** Process received signal ***
[m005:65520] Signal: Segmentation fault (11)
[m005:65520] Signal code:  (-6)
[m005:65520] Failing at address: 0x6a80000fff0
[m005:65520] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2ac4edf09630]
[m005:65520] [ 1] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x4a2780]
[m005:65520] [ 2] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x4700e1]
[m005:65520] [ 3] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x41a811]
[m005:65520] [ 4] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x413d25]
[m005:65520] [ 5] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x413c71]
[m005:65520] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac4ee9e7555]
[m005:65520] [ 7] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x40cd8d]
[m005:65520] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node m005 exited on signal 11 (Segmentation fault).

#########################


I am not an expert of coding, but it seems like the line 325 wasn't recognized, which is fairly strange. I don't know how to solve this problem, and I am glad if anyone can help me.


Yours,
Qing Lu


lq1998
1148330678 at qq.com



 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20240420/d667843e/attachment.html>


More information about the users mailing list