[Pw_forum] Error "Not diagonalizing because representation xx is not done" in "image" parallelization by ph.x

Thu May 5 17:24:10 CEST 2016

Hi Coiby,

Using -ntg and -ndiag doesn't mean exactly using threaded library.
Those options are useful when your calculation involves over 1k processors
without (-ni image parallelization)
 and needed to be tested carefully at pw.x before performing any large
rigorous calculations.
I don't think you need them.

What I meant is using threaded math library MKL. You need to build QE suite
with openmp (check make.sys to see if the threaded MKL is linked). I
recommend using at lease QE version 5.3.
If you have 32 nodes with 24 cores each,
export OMP_NUM_THREADS=4 # each MPI rank has 4 threads. You need to ensure
each node gets only 6 MPI ranks.
Still run the following,
mpirun -np 192 ph.x -ni 4 -nk 3 -inp your_ph.input
You don't even need to redo you pw calculation if only the threads and
total nodes used are increased correspondingly.

For phonon calculation, I also often get disk quota issues. Ph.x eats a lot
of disk even though reduce_io=.true. is set.
If you have more than 3 k-points, try to maximize your -nk option and
reduce the corresponding -ni.
More -nk uses more disk but less -ni uses less disk. In sum, you should
need less disk without compromising parallel efficiency.
If you still get disk issue, use less images and more threads.

Ciao,
Ye

===================
Ye Luo, Ph.D.
Leadership Computing Facility
Argonne National Laboratory

2016-05-05 8:42 GMT-05:00 Coiby Xu <coiby.xu at gmail.com>:

> Dear Dr. Luo,
>
> Thank you for your detailed reply!
>
> I'm sorry I disabled Mail delivery before so I didn't receive the email
> until I checked the mailing list archive.
>
> I've successfully run phonon calculation without using *wf_collect=.true.*
> following your advise. This helps reduce the size of *outdir* from 142G
> to 48G.
>
> For threaded MKL and FFT, I tested one case (-nimage  48 -npool 3 -ntg 2
> -ndiag 4). To my surprise, it's marginally slower than the calculation
> without* -ntg 2 -ndiag 4*. In PHonon/examples/Image_example, I didn't
> find any useful info.
>
>> PH_IMAGE_COMMAND="$PARA_IMAGE_PREFIX $BIN_DIR/ph.x $PARA_IMAGE_POSTFIX"
>>
>
> In the file environment_variables, no info about ntg and ndiag are given
>
>> PARA_POSTFIX=" -nk 1 -nd 1 -nb 1 -nt 1 "
>> PARA_IMAGE_POSTFIX="-ni 2 $PARA_POSTFIX"
>> PARA_IMAGE_PREFIX="mpirun -np 4"
>>
>
> I also checked the job log for failed calculation ("Not diagonalizing
> because representation xx is not done"). Maybe ph.x crashes due to I/O
> problem (the size of outdir was 142G).
>
> forrtl: No such file or directory
>> forrtl: No such file or directory
>> forrtl: severe (28): CLOSE error, unit 20, file "Unknown"
>> Image              PC                Routine            Line
>> Source
>> ph.x               000000000088A00F  Unknown               Unknown
>> Unknown
>> ph.x               0000000000517B26  buffers_mp_close_         620
>> buffers.f90
>> ph.x               00000000004B85E8  close_phq_                 39
>> close_phq.f90
>> ph.x               00000000004B7888  clean_pw_ph_               41
>> clean_pw_ph.f90
>> ph.x               000000000042E5EF  do_phonon_                126
>> do_phonon.f90
>> ph.x               000000000042A554  MAIN__                     78
>> phonon.f90
>> ph.x               000000000042A4B6  Unknown               Unknown
>> Unknown
>> libc.so.6          0000003921A1ED1D  Unknown               Unknown
>> Unknown
>> ph.x               000000000042A3A9  Unknown               Unknown
>> Unknown
>
> forrtl: severe (28): CLOSE error, unit 20, file "Unknown"
>>
>
>
> Btw, I'm from School of Earth and Space Science of USTC.
>
> On Wed, May 4, 2016 at 07:41:30 CEST, Ye Luo <xw111luoye at gmail.com
> <coiby.xu at gmail.com>> wrote:
>
>> Hi Coiby,
>>
>> "it seems to be one requirement to let ph.x and pw.x have the same number
>> of processors."
>> This is not true.
>>
>> If you are using image parallelization in your phonon calculation, you need
>> to maintain the same amount of processes per image as your pw calculation.
>> In this way, wf_collect=.true. is not needed.
>>
>> Here is an example. I assume you use k point parallelization (-nk).
>> 1, mpirun -np 48 pw.x -nk 12 -inp your_pw.input
>> 2, mpirun -np 192 ph.x -ni 4 -nk 12 -inp your_ph.input
>> In this step, you might notice "Not diagonalizing because representation
>> xx is not done" which is normal.
>> The code should not abort because of this.
>> 3, After calculating all the representations belongs a given q or q-mesh.
>>     Just add "recover = .true."  in your_ph.input and run
>>     mpirun -np 48 ph.x -nk  12 -inp your_ph.input
>>     The dynamical matrix will be computed for that q.
>>
>> If you are confident with threaded pw.x, ph.x also gets benefit from
>> threaded MKL and FFT and the time to solution is further reduced.
>>
>> For more details, you can look into PHonon/examples/Image_example.
>>
>> P.S.
>> Your affiliation is missing.
>>
>> ===================
>> Ye Luo, Ph.D.
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>>
>> On Wed, May 4, 2016 at 11:33 AM, Coiby Xu <coiby.xu at gmail.com> wrote:
>>
>>> Dear Quantum Espresso Developers and Users,
>>>
>>>
>>> I'm running a phonon calculation parallelizing over the
>>> representations/q vectors. For my cluster, there are 24 cores per node. I
>>> want to use as many nodes as possible to speed up the calculation.
>>>
>>> I set the number of parallelizations to be the number of nodes,
>>>
>>>> mpirun -np NUMBER_OF_NODESx24  ph.x -nimage NUMBER_OF_NODES
>>>>
>>>
>>>
>>> If I only use 4 nodes (4 images), 8 nodes ( 8 images), the calculation
>>> will be finished successfully. However, more than 8 nodes, say 16 or 32
>>> nodes, are used, each time running the calculation, such error will be
>>> given,
>>>
>>>> Not diagonalizing because representation  xx is not done
>>>>
>>>
>>> Btw, I want to reduce I/O overhead by discarding `wf_collect` option,
>>> but the following way doesn't work (the number of processors and pools for
>>> scf calculation is the same to that in phonon calculation)
>>>
>>> mpirun -np NUMBER_OF_NODESx24  pw.x
>>>>
>>>
>>> ph.x complains,
>>>
>>>> Error in routine phq_readin (1):pw.x run with a different number of
>>>> processors.
>>>> Use wf_collect=.true.
>>>>
>>>
>>> The beginning output of pw.x,
>>>
>>>>     Parallel version (MPI), running on    96 processors
>>>>      R & G space division:  proc/nbgrp/npool/nimage =      96
>>>>      Waiting for input...
>>>>      Reading input from standard input
>>>>
>>>
>>> and the beginning output of ph.x,
>>>
>>>>  Parallel version (MPI), running on    96 processors
>>>>      path-images division:  nimage    =       4
>>>>      R & G space division:  proc/nbgrp/npool/nimage =      24
>>>>
>>>
>>> Do I miss something? I know it's inefficient to let pw.x use so many
>>> processors, but it seems to be one requirement to let ph.x and pw.x have
>>> the same number of processors.
>>>
>>> Thank you!
>>>
>>> --
>>> *Best regards,*
>>> *Coiby*
>>>
>>>
>
>
> --
> *Best regards,*
> *Coiby*
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.quantum-espresso.org/pipermail/users/attachments/20160505/184a7637/attachment.html>