[QE-users] A strange error when using GPU accelerated ph.x
lq1998
1148330678 at qq.com
Fri Apr 19 18:00:40 CEST 2024
Dear developers and users,
I tried to run GPU version of QE for electron phonon coupling calculation on an a100 card. The structure relaxation and self-consistent calculation are successful. However, when I did phonon calculation, my job crashed with a strange error:
##################
[m005:65520:0:65520] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffffc)
/fs08/home/js_luqing/src/qe-7.2/PHonon/PH/phq_setup.f90: [ phq_setup_() ]
...
322 ! nat_todo, atomo, comp_irr
323
324 DO irr=0,nirr
==> 325 comp_irr(irr)=comp_irr_iq(irr,current_iq)
326 IF (elph .AND. irr>0) comp_elph(irr)=comp_irr(irr)
327 ENDDO
328 !
==== backtrace (tid: 65520) ====
0 0x00000000004a2780 phq_setup_() /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/phq_setup.f90:325
1 0x00000000004700e1 initialize_ph_() /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/initialize_ph.f90:79
2 0x000000000041a811 do_phonon_() /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/do_phonon.f90:100
3 0x0000000000413d25 MAIN_() /fs08/home/js_luqing/src/qe-7.2/PHonon/PH/phonon.f90:78
4 0x0000000000413c71 main() ???:0
5 0x0000000000022555 __libc_start_main() ???:0
6 0x000000000040cd8d _start() ???:0
=================================
[m005:65520] *** Process received signal ***
[m005:65520] Signal: Segmentation fault (11)
[m005:65520] Signal code: (-6)
[m005:65520] Failing at address: 0x6a80000fff0
[m005:65520] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2ac4edf09630]
[m005:65520] [ 1] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x4a2780]
[m005:65520] [ 2] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x4700e1]
[m005:65520] [ 3] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x41a811]
[m005:65520] [ 4] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x413d25]
[m005:65520] [ 5] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x413c71]
[m005:65520] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac4ee9e7555]
[m005:65520] [ 7] /fs08/home/js_luqing/src/qe-7.2/bin/ph.x[0x40cd8d]
[m005:65520] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node m005 exited on signal 11 (Segmentation fault).
#########################
I am not an expert of coding, but it seems like the line 325 wasn't recognized, which is fairly strange. I don't know how to solve this problem, and I am glad if anyone can help me.
Yours,
Qing Lu
lq1998
1148330678 at qq.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20240420/d667843e/attachment.html>
More information about the users
mailing list