#1 Low CPU utilisation by Terry 11.01.2022 11:59

Hi folks

I am continuing my investigation of ZNO slabs. For the attached inputs I have 32 k-points, so I run with 32 MPI processes. For similar (but different) slabs on the machine I am running on, I can get good CPU utilisation if I assign up to 14 cores to each MPI process. (FWIW, the next compatible graining puts 1 process on each 28 core node, which does not perform well.)

For the attached job I don't seem to be able to get more than around 5-6 cores working effectively on each k-point. Trying to give fleur more cores does not decrease wall times. Is there something wrong with the way I am setting this up, or am I just hitting some inherent scaling limit within fleur for the peculiarities of this job? (All this with 5.1 compiled with OpenMPI 4.1 & Intel 2019.5 compilers.)


#2 RE: Low CPU utilisation by Gregor 11.01.2022 20:02

I have a few questions to better understand the problem.

1. On how many nodes do you run the calculation, how many CPU cores are available on each node?
2. How many OpenMP thredas do you use?
3. There is a file juDFT_times.json created. Could you attach that?

Best, Gregor.

#3 RE: Low CPU utilisation by Terry 12.01.2022 11:46

Here's juDFT_times.json (from a slightly different geometry). Run with 32 MPI processes, each allocated 9 cores (9 OpenMP threads) on 28 core broadwell nodes. So up to three MPI processes per node, 11 nodes.

#4 RE: Low CPU utilisation by Gregor 12.01.2022 17:52

The diagonalization seems to consume a larger fraction of the runtime than what I would expect. This may or may not be acceptable. But it should be understood. So two more questions:

1. Which library do you use for the diagonalization?
2. What is the output of "grep nvd out.xml"

#5 RE: Low CPU utilisation by Terry 13.01.2022 12:12

In out.xml we have:
<basis nvd="23313" lmaxd="14" nlotot="0"/>

I didn't specify any options for the diagonalisation when building. The configuration output tells me that SCALAPACK was found (via MKL), but nothing else that looks like diagonalisation. Is that the info you want?

#6 RE: Low CPU utilisation by Gregor 13.01.2022 12:44

Here is what I can say:

1. nvd is about the number of LAPW basis functions (without LOs) that you have in your system. 23.000 LAPW basis functions really is a lot for 82 atoms (typical sizes are about 100/atom). This seems to be due to a very high choice for Kmax. Was this automatically generated or the result of a convergence study? It may be reasonable because you have rather small bond lengths in your system and thus also rather small MT spheres.

2. It is good that you mix MPI parallelization with OpenMP parallelization. Maybe a little bit less OpenMP would improve the CPU utilization. In that plot you can see that at the moment we would expect about a factor 5 speedup when 12 OpenMP threads are used, but with 6 OpenMP threads you also nearly get this speedup. I typically use something like 4 to 8 OpenMP threads. If you can rebalance the different parallelization schemes in that direction you might benefit. You can also use more than 32 MPI processes. 64 or 128 may be reasonable. You can also see in the plot that the diagonalization actually is the limiting code part when it comes to the OpenMP parallelization. In your system you have many LAPW basis functions for not so many atoms. This leads to a large runtime demand for the diagonalization and thus this becomes the determining bottleneck.

3. Since the CPU utilization is strongly determined by the diagonalization, which is processed in a separate library there are only a few measures you can take to improve this situation. Rebalancing the different parallelization schemes is one approach, another would be the use of a more efficient diagonalization library. It may be a good idea to try out the ELPA library. This may give you some boost.

#7 RE: Low CPU utilisation by Gregor 13.01.2022 13:02

Another point: At the moment I am doing convergence studies on numerous Oxide materials, also to prepare a more efficient default parameter setup... With respect to convergence it may be a good idea to increase the Oxygen MT radius. In my studies I find that this radius should be at least 1.3 Bohr radii. Zinc probably can be slightly smaller (you have to make compromises on the other atoms). I don't know what demands Hydrogen has.

#8 RE: Low CPU utilisation by Terry 14.01.2022 06:37

Hi Gregor

Thanks for the advice. I will explore those options.


Xobor Einfach ein eigenes Xobor Forum erstellen