Description
I'm calculating a system with 1536 atoms, and the SCF is slow on GPU. The calculation is performed on a node with CPU/genelpa, as well as GPU/cusolvermp or GPU/elpa as ks_solver.
The GPU calculation is slower than expected, and the finding is that
cal_dm_psi in source_estate/module_dm/cal_dm_psi.cpp calls a ScalapackConnector::gemm and this part is done on CPU with pdgemm, which has not been ported to GPU, becoming a bottleneck.
CPU: Intel(R) Xeon(R) Platinum 8462Y+ (x2)
ks_solver=genelpa
244 steps in total
DiagoElpa elpa_solve 85.10s 80.77%
elecstate cal_dm_psi 8.88s 8.46 %
psiMulPsiMpi pdgemm 8.79s 8.38%
TIME STATISTICS
----------------------------------------------------------------
CLASS_NAME NAME TIME/s CALLS AVG/s PER/%
----------------------------------------------------------------
total 25709.13 13 1977.63 100.00
Driver atomic_world 25709.13 1 25709.13 100.00
Relax_Driver relax_driver 25686.79 1 25686.79 99.91
ESolver_KS runner 25585.71 1 25585.71 99.52
HSolverLCAO solve 25347.74 244 103.88 98.59
HamiltLCAO updateHk 1211.63 244 4.97 4.71
OperatorLCAO init 1211.63 1220 0.99 4.71
Veff contributeHR 1189.46 244 4.87 4.63
Gint cal_gint_vl 1189.46 244 4.87 4.63
HSolverLCAO hamiltSolvePsiK 20779.55 244 85.16 80.83
DiagoElpa elpa_solve 20764.22 244 85.10 80.77
elecstate cal_dm_psi 2174.60 245 8.88 8.46
psiMulPsiMpi pdgemm 2153.94 245 8.79 8.38
LCAO_domain dm2rho 1183.56 244 4.85 4.60
Gint cal_gint_rho 1151.72 244 4.72 4.48
----------------------------------------------------------------
GPU: 1 node Tesla V100-SXM2-32GB (x4) with CPU Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz (x2)
ks_solver=cusolvermp
5 steps
DiagoCusolverMP 73.98s 34.63%
elecstate cal_dm_psi 80.25 s 45.08%
psiMulPsiMpi pdgemm 78.91s 44.33
TIME STATISTICS
-----------------------------------------------------------------------
CLASS_NAME NAME TIME/s CALLS AVG/s PER/%
-----------------------------------------------------------------------
total 1068.13 13 82.16 100.00
Driver atomic_world 1068.13 1 1068.13 100.00
ESolver_KS_LCAO before_all_runners 53.67 1 53.67 5.02
Structure_Factor setup 32.63 1 32.63 3.06
Charge atomic_rho 19.13 2 9.57 1.79
Relax_Driver relax_driver 1013.84 1 1013.84 94.92
ESolver_KS runner 864.02 1 864.02 80.89
Potential cal_veff 21.10 6 3.52 1.98
PotXC cal_veff 17.44 6 2.91 1.63
XC_Functional v_xc 17.33 6 2.89 1.62
HSolverLCAO solve 822.14 5 164.43 76.97
HamiltLCAO updateHk 32.77 5 6.55 3.07
OperatorLCAO init 14.67 15 0.98 1.37
Nonlocal contributeHR 14.54 5 2.91 1.36
Nonlocal calculate_HR 14.37 1 14.37 1.34
TD_Efficiency Gint 24.12 11 2.19 2.26
HSolverLCAO hamiltSolvePsiK 382.06 5 76.41 35.77
DiagoCusolverMP Diag_CusolverMP_gvd 369.90 5 73.98 34.63
elecstate cal_dm_psi 481.52 6 80.25 45.08
psiMulPsiMpi pdgemm 473.46 6 78.91 44.33
ESolver_KS_LCAO after_scf 15.03 1 15.03 1.41
ESolver_KS_LCAO cal_force 149.82 1 149.82 14.03
Force_Stress_LCAO getForceStress 149.74 1 149.74 14.02
Stress stress_loc 12.90 1 12.90 1.21
Stress stress_ewa 26.47 1 26.47 2.48
-----------------------------------------------------------------------
Where psiMulPsiMpi pdgemm consumes a big share exceeding that of diag on GPU, while only takes ~10% diag for CPU.
INPUT
https://github.com/MCresearch/TEAS/tree/main/2025-APL-MXenes/1_MH
Description
I'm calculating a system with 1536 atoms, and the SCF is slow on GPU. The calculation is performed on a node with CPU/genelpa, as well as GPU/cusolvermp or GPU/elpa as
ks_solver.The GPU calculation is slower than expected, and the finding is that
cal_dm_psiinsource_estate/module_dm/cal_dm_psi.cppcalls aScalapackConnector::gemmand this part is done on CPU withpdgemm, which has not been ported to GPU, becoming a bottleneck.CPU: Intel(R) Xeon(R) Platinum 8462Y+ (x2)
ks_solver=genelpa
244 steps in total
DiagoElpa elpa_solve 85.10s 80.77%
elecstate cal_dm_psi 8.88s 8.46 %
psiMulPsiMpi pdgemm 8.79s 8.38%
GPU: 1 node Tesla V100-SXM2-32GB (x4) with CPU Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz (x2)
ks_solver=cusolvermp
5 steps
DiagoCusolverMP 73.98s 34.63%
elecstate cal_dm_psi 80.25 s 45.08%
psiMulPsiMpi pdgemm 78.91s 44.33
Where
psiMulPsiMpi pdgemmconsumes a big share exceeding that of diag on GPU, while only takes ~10% diag for CPU.INPUT
https://github.com/MCresearch/TEAS/tree/main/2025-APL-MXenes/1_MH