[Pw_forum] Error "Not diagonalizing because representation xx is not done" in "image" parallelization by ph.x

Ye Luo xw111luoye at gmail.com
Thu May 5 17:24:10 CEST 2016

Hi Coiby,

Using -ntg and -ndiag doesn't mean exactly using threaded library.
Those options are useful when your calculation involves over 1k processors
without (-ni image parallelization)
 and needed to be tested carefully at pw.x before performing any large
rigorous calculations.
I don't think you need them.

What I meant is using threaded math library MKL. You need to build QE suite
with openmp (check make.sys to see if the threaded MKL is linked). I
recommend using at lease QE version 5.3.
If you have 32 nodes with 24 cores each,
export OMP_NUM_THREADS=4 # each MPI rank has 4 threads. You need to ensure
each node gets only 6 MPI ranks.
Still run the following,
mpirun -np 192 ph.x -ni 4 -nk 3 -inp your_ph.input
You don't even need to redo you pw calculation if only the threads and
total nodes used are increased correspondingly.

For phonon calculation, I also often get disk quota issues. Ph.x eats a lot
of disk even though reduce_io=.true. is set.
If you have more than 3 k-points, try to maximize your -nk option and
reduce the corresponding -ni.
More -nk uses more disk but less -ni uses less disk. In sum, you should
need less disk without compromising parallel efficiency.
If you still get disk issue, use less images and more threads.


Ye Luo, Ph.D.
Leadership Computing Facility
Argonne National Laboratory

2016-05-05 8:42 GMT-05:00 Coiby Xu <coiby.xu at gmail.com>:

> Dear Dr. Luo,
> Thank you for your detailed reply!
> I'm sorry I disabled Mail delivery before so I didn't receive the email
> until I checked the mailing list archive.
> I've successfully run phonon calculation without using *wf_collect=.true.*
> following your advise. This helps reduce the size of *outdir* from 142G
> to 48G.
> For threaded MKL and FFT, I tested one case (-nimage  48 -npool 3 -ntg 2
> -ndiag 4). To my surprise, it's marginally slower than the calculation
> without* -ntg 2 -ndiag 4*. In PHonon/examples/Image_example, I didn't
> find any useful info.
> In the file environment_variables, no info about ntg and ndiag are given
>> PARA_POSTFIX=" -nk 1 -nd 1 -nb 1 -nt 1 "
>> PARA_IMAGE_PREFIX="mpirun -np 4"
> I also checked the job log for failed calculation ("Not diagonalizing
> because representation xx is not done"). Maybe ph.x crashes due to I/O
> problem (the size of outdir was 142G).
> forrtl: No such file or directory
>> forrtl: No such file or directory
>> forrtl: severe (28): CLOSE error, unit 20, file "Unknown"
>> Image              PC                Routine            Line
>> Source
>> ph.x               000000000088A00F  Unknown               Unknown
>> Unknown
>> ph.x               0000000000517B26  buffers_mp_close_         620
>> buffers.f90
>> ph.x               00000000004B85E8  close_phq_                 39
>> close_phq.f90
>> ph.x               00000000004B7888  clean_pw_ph_               41
>> clean_pw_ph.f90
>> ph.x               000000000042E5EF  do_phonon_                126
>> do_phonon.f90
>> ph.x               000000000042A554  MAIN__                     78
>> phonon.f90
>> ph.x               000000000042A4B6  Unknown               Unknown
>> Unknown
>> libc.so.6          0000003921A1ED1D  Unknown               Unknown
>> Unknown
>> ph.x               000000000042A3A9  Unknown               Unknown
>> Unknown
> forrtl: severe (28): CLOSE error, unit 20, file "Unknown"
> Btw, I'm from School of Earth and Space Science of USTC.
> On Wed, May 4, 2016 at 07:41:30 CEST, Ye Luo <xw111luoye at gmail.com
> <coiby.xu at gmail.com>> wrote:
>> Hi Coiby,
>> "it seems to be one requirement to let ph.x and pw.x have the same number
>> of processors."
>> This is not true.
>> If you are using image parallelization in your phonon calculation, you need
>> to maintain the same amount of processes per image as your pw calculation.
>> In this way, wf_collect=.true. is not needed.
>> Here is an example. I assume you use k point parallelization (-nk).
>> 1, mpirun -np 48 pw.x -nk 12 -inp your_pw.input
>> 2, mpirun -np 192 ph.x -ni 4 -nk 12 -inp your_ph.input
>> In this step, you might notice "Not diagonalizing because representation
>> xx is not done" which is normal.
>> The code should not abort because of this.
>> 3, After calculating all the representations belongs a given q or q-mesh.
>>     Just add "recover = .true."  in your_ph.input and run
>>     mpirun -np 48 ph.x -nk  12 -inp your_ph.input
>>     The dynamical matrix will be computed for that q.
>> If you are confident with threaded pw.x, ph.x also gets benefit from
>> threaded MKL and FFT and the time to solution is further reduced.
>> For more details, you can look into PHonon/examples/Image_example.
>> P.S.
>> Your affiliation is missing.
>> ===================
>> Ye Luo, Ph.D.
>> Leadership Computing Facility
>> Argonne National Laboratory
>> On Wed, May 4, 2016 at 11:33 AM, Coiby Xu <coiby.xu at gmail.com> wrote:
>>> Dear Quantum Espresso Developers and Users,
>>> I'm running a phonon calculation parallelizing over the
>>> representations/q vectors. For my cluster, there are 24 cores per node. I
>>> want to use as many nodes as possible to speed up the calculation.
>>> I set the number of parallelizations to be the number of nodes,
>>>> mpirun -np NUMBER_OF_NODESx24  ph.x -nimage NUMBER_OF_NODES
>>> If I only use 4 nodes (4 images), 8 nodes ( 8 images), the calculation
>>> will be finished successfully. However, more than 8 nodes, say 16 or 32
>>> nodes, are used, each time running the calculation, such error will be
>>> given,
>>>> Not diagonalizing because representation  xx is not done
>>> Btw, I want to reduce I/O overhead by discarding `wf_collect` option,
>>> but the following way doesn't work (the number of processors and pools for
>>> scf calculation is the same to that in phonon calculation)
>>> mpirun -np NUMBER_OF_NODESx24  pw.x
>>> ph.x complains,
>>>> Error in routine phq_readin (1):pw.x run with a different number of
>>>> processors.
>>>> Use wf_collect=.true.
>>> The beginning output of pw.x,
>>>>     Parallel version (MPI), running on    96 processors
>>>>      R & G space division:  proc/nbgrp/npool/nimage =      96
>>>>      Waiting for input...
>>>>      Reading input from standard input
>>> and the beginning output of ph.x,
>>>>  Parallel version (MPI), running on    96 processors
>>>>      path-images division:  nimage    =       4
>>>>      R & G space division:  proc/nbgrp/npool/nimage =      24
>>> Do I miss something? I know it's inefficient to let pw.x use so many
>>> processors, but it seems to be one requirement to let ph.x and pw.x have
>>> the same number of processors.
>>> Thank you!
>>> --
>>> *Best regards,*
>>> *Coiby*
> --
> *Best regards,*
> *Coiby*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160505/184a7637/attachment.html>

More information about the users mailing list