Slow LCAO/GPU on large systems: CPU pdgemm as the bottleneck

## Description
I'm calculating a system with 1536 atoms, and the SCF is slow on GPU. The calculation is performed on a node with CPU/genelpa, as well as GPU/cusolvermp or GPU/elpa as `ks_solver`.

The GPU calculation is slower than expected, and the finding is that
`cal_dm_psi` in `source_estate/module_dm/cal_dm_psi.cpp` calls a `ScalapackConnector::gemm` and this part is done on CPU with `pdgemm`, which has not been ported to GPU, becoming a bottleneck.

CPU: Intel(R) Xeon(R) Platinum 8462Y+ (x2)
ks_solver=genelpa
244 steps in total
 DiagoElpa    elpa_solve        85.10s    80.77%
 elecstate    cal_dm_psi        8.88s     8.46 %  
 psiMulPsiMpi pdgemm        8.79s     8.38%

```
 TIME STATISTICS
----------------------------------------------------------------
  CLASS_NAME       NAME        TIME/s   CALLS    AVG/s   PER/%  
----------------------------------------------------------------
              total           25709.13 13       1977.63  100.00 
 Driver       atomic_world    25709.13 1        25709.13 100.00 
 Relax_Driver relax_driver    25686.79 1        25686.79 99.91  
 ESolver_KS   runner          25585.71 1        25585.71 99.52  
 HSolverLCAO  solve           25347.74 244      103.88   98.59  
 HamiltLCAO   updateHk        1211.63  244      4.97     4.71   
 OperatorLCAO init            1211.63  1220     0.99     4.71   
 Veff         contributeHR    1189.46  244      4.87     4.63   
 Gint         cal_gint_vl     1189.46  244      4.87     4.63   
 HSolverLCAO  hamiltSolvePsiK 20779.55 244      85.16    80.83  
 DiagoElpa    elpa_solve      20764.22 244      85.10    80.77  
 elecstate    cal_dm_psi      2174.60  245      8.88     8.46   
 psiMulPsiMpi pdgemm          2153.94  245      8.79     8.38   
 LCAO_domain  dm2rho          1183.56  244      4.85     4.60   
 Gint         cal_gint_rho    1151.72  244      4.72     4.48   
----------------------------------------------------------------
```

GPU: 1 node Tesla V100-SXM2-32GB (x4) with CPU Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz (x2)
ks_solver=cusolvermp
5 steps
 DiagoCusolverMP      73.98s   34.63%
 elecstate         cal_dm_psi           80.25 s  45.08%
 psiMulPsiMpi      pdgemm             78.91s   44.33
```
TIME STATISTICS
-----------------------------------------------------------------------
    CLASS_NAME            NAME         TIME/s   CALLS    AVG/s  PER/%  
-----------------------------------------------------------------------
                   total               1068.13 13       82.16   100.00 
 Driver            atomic_world        1068.13 1        1068.13 100.00 
 ESolver_KS_LCAO   before_all_runners  53.67   1        53.67   5.02   
 Structure_Factor  setup               32.63   1        32.63   3.06   
 Charge            atomic_rho          19.13   2        9.57    1.79   
 Relax_Driver      relax_driver        1013.84 1        1013.84 94.92  
 ESolver_KS        runner              864.02  1        864.02  80.89  
 Potential         cal_veff            21.10   6        3.52    1.98   
 PotXC             cal_veff            17.44   6        2.91    1.63   
 XC_Functional     v_xc                17.33   6        2.89    1.62   
 HSolverLCAO       solve               822.14  5        164.43  76.97  
 HamiltLCAO        updateHk            32.77   5        6.55    3.07   
 OperatorLCAO      init                14.67   15       0.98    1.37   
 Nonlocal          contributeHR        14.54   5        2.91    1.36   
 Nonlocal          calculate_HR        14.37   1        14.37   1.34   
 TD_Efficiency     Gint                24.12   11       2.19    2.26   
 HSolverLCAO       hamiltSolvePsiK     382.06  5        76.41   35.77  
 DiagoCusolverMP   Diag_CusolverMP_gvd 369.90  5        73.98   34.63  
 elecstate         cal_dm_psi          481.52  6        80.25   45.08  
 psiMulPsiMpi      pdgemm              473.46  6        78.91   44.33  
 ESolver_KS_LCAO   after_scf           15.03   1        15.03   1.41   
 ESolver_KS_LCAO   cal_force           149.82  1        149.82  14.03  
 Force_Stress_LCAO getForceStress      149.74  1        149.74  14.02  
 Stress            stress_loc          12.90   1        12.90   1.21   
 Stress            stress_ewa          26.47   1        26.47   2.48   
-----------------------------------------------------------------------
```
Where `psiMulPsiMpi      pdgemm` consumes a big share exceeding that of diag on GPU, while only takes ~10% diag for CPU.


## INPUT
https://github.com/MCresearch/TEAS/tree/main/2025-APL-MXenes/1_MH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow LCAO/GPU on large systems: CPU pdgemm as the bottleneck #7463

Description

INPUT

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Slow LCAO/GPU on large systems: CPU pdgemm as the bottleneck #7463

Description

Description

INPUT

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions