事情的由头来自于用ORCA做过渡态优化,只是五十几个原子的体系,用的甚至还只是def2-SVP基组。 然后一看这个Hessian计算的耗时

----------------------------------------------
Solving the CP-SCF equations (RIJCOSX)           ...
----------------------------------------------
IBatch 1 (of 2)
     CP-SCF ITERATION   0:   5.7177e-01 (1494.1 sec   0/117 done)
     CP-SCF ITERATION   1:   6.5125e-02 (1485.8 sec   0/117 done)
     CP-SCF ITERATION   2:   3.5169e-02 (1991.6 sec   0/117 done)
     CP-SCF ITERATION   3:   1.1599e-02 (2049.5 sec   0/117 done)
     CP-SCF ITERATION   4:   5.3287e-03 (1982.7 sec   1/117 done)
     CP-SCF ITERATION   5:   1.6394e-03 (1696.5 sec  15/117 done)
     CP-SCF ITERATION   6:   5.6406e-04 (1484.0 sec  50/117 done)
     CP-SCF ITERATION   7:   1.7323e-04 (1140.9 sec  98/117 done)
     CP-SCF ITERATION   8:   6.6552e-05 ( 454.4 sec 117/117 done)
                    *** THE CP-SCF HAS CONVERGED ***
IBatch 2 (of 2)
     CP-SCF ITERATION   0:   3.0999e-01 ( 579.4 sec   0/ 27 done)
     CP-SCF ITERATION   1:   5.0835e-02 ( 509.2 sec   0/ 27 done)
     CP-SCF ITERATION   2:   2.2800e-02 ( 638.6 sec   0/ 27 done)
     CP-SCF ITERATION   3:   8.6497e-03 ( 478.2 sec   0/ 27 done)
     CP-SCF ITERATION   4:   3.2066e-03 ( 587.7 sec   0/ 27 done)
     CP-SCF ITERATION   5:   8.8847e-04 ( 477.8 sec  10/ 27 done)
     CP-SCF ITERATION   6:   3.5159e-04 ( 402.2 sec  24/ 27 done)
     CP-SCF ITERATION   7:   9.5689e-05 (  47.9 sec  27/ 27 done)
                    *** THE CP-SCF HAS CONVERGED ***

                                                 ... done   (  17523.4 sec)

略……

Total SCF Hessian time: 0 days 8 hours 59 min 36 sec 

整个人都不好了,得亏初猜结构足够好,转了3轮频率计算就收敛了。 通过top命令检查CPU占用

top - 06:27:33 up 5 days, 18:17,  1 user,  load average: 7.61, 8.08, 8.46
Tasks:  46 total,   7 running,  39 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.1 us,  0.0 sy,  0.0 ni, 84.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  56320.0 total,  46277.6 free,   5924.6 used,   4117.8 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  50395.4 avail Mem

显示一切正常,wa%项基本没有,说明不存在磁盘IO瓶颈。 通过lscpu命令检查CPU频率

…………
CPU(s):                          40
On-line CPU(s) list:             0-39
Thread(s) per core:              2
Core(s) per socket:              10
Socket(s):                       2
NUMA node(s):                    2

…………

Model name:                      Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Stepping:                        4
CPU MHz:                         2792.916
CPU max MHz:                     2800.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        5585.83
Virtualization:                  VT-x
L1d cache:                       640 KiB
L1i cache:                       640 KiB
L2 cache:                        5 MiB
L3 cache:                        50 MiB
NUMA node0 CPU(s):               0-9,20-29
NUMA node1 CPU(s):               10-19,30-39

…………

显示CPU频率一切正常,CPU被正常锁在基频。 因为使用的是HP DL380p G8机架式服务器,通过iLO发现内存报错,一根内存被degraded下线了。 而注意到物理主机安装的Proxmox VE虚拟化平台中通过dmidecode命令检查内存显示所有内存均在线。 本实例中Proxmox VE的Linux内核版本为5.4.140-1-pve。

root@dl380p:~# dmidecode -t memory | grep Size
        Size: 8192 MB
        Size: No Module Installed
        Size: No Module Installed
        Size: 8192 MB
        Size: No Module Installed
        Size: No Module Installed
        Size: No Module Installed
        Size: No Module Installed
        Size: 8192 MB
        Size: No Module Installed
        Size: No Module Installed
        Size: 8192 MB
        Size: 8192 MB
        Size: No Module Installed
        Size: No Module Installed
        Size: 8192 MB
        Size: No Module Installed
        Size: No Module Installed
        Size: No Module Installed
        Size: No Module Installed
        Size: 8192 MB
        Size: No Module Installed
        Size: No Module Installed
        Size: 8192 MB

此时检查此物理主机的内存读写速度可以发现降低颇多:

root@dl380p:~# dd if=/dev/zero of=/dev/null bs=10M count=1024
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 5.22757 s, 2.1 GB/s

过渡态计算任务整体耗时如下:

Timings for individual modules:

Sum of individual times         ...   173331.616 sec (=2888.860 min)
GTO integral calculation        ...      227.706 sec (=   3.795 min)   0.1 %
SCF iterations                  ...    18792.324 sec (= 313.205 min)  10.8 %
SCF Gradient evaluation         ...     6624.362 sec (= 110.406 min)   3.8 %
Geometry relaxation             ...       12.350 sec (=   0.206 min)   0.0 %
Analytical frequency calculation...   147674.875 sec (=2461.248 min)  85.2 %
                             ****ORCA TERMINATED NORMALLY****
TOTAL RUN TIME: 2 days 0 hours 9 minutes 44 seconds 730 msec

注意此机配置的是DDR3 1333,作为对照同配置的HP DL360p G8测试数据如下:

root@dl360p:~# dd if=/dev/zero of=/dev/null bs=10M count=1024
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 0.853596 s, 12.6 GB/s

相似过渡态计算任务整体耗时如下:

Timings for individual modules:

Sum of individual times         ...    25829.721 sec (= 430.495 min)
GTO integral calculation        ...       36.398 sec (=   0.607 min)   0.1 %
SCF iterations                  ...     4242.422 sec (=  70.707 min)  16.4 %
SCF Gradient evaluation         ...     1492.312 sec (=  24.872 min)   5.8 %
Geometry relaxation             ...        8.789 sec (=   0.146 min)   0.0 %
Analytical frequency calculation...    20049.801 sec (= 334.163 min)  77.6 %
                             ****ORCA TERMINATED NORMALLY****
TOTAL RUN TIME: 0 days 7 hours 10 minutes 56 seconds 41 msec

另外使用配置DDR4 2133 双通道的E5-2630L v4,X99平台,测试数据如下:

hzy@orca5:~$ dd if=/dev/zero of=/dev/null bs=10M count=1024
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 1.24068 s, 8.7 GB/s

相似过渡态计算任务整体耗时如下:

Timings for individual modules:

Sum of individual times         ...    37455.445 sec (= 624.257 min)
GTO integral calculation        ...       29.563 sec (=   0.493 min)   0.1 %
SCF iterations                  ...     6818.118 sec (= 113.635 min)  18.2 %
SCF Gradient evaluation         ...     2312.956 sec (=  38.549 min)   6.2 %
Geometry relaxation             ...        7.646 sec (=   0.127 min)   0.0 %
Analytical frequency calculation...    28287.161 sec (= 471.453 min)  75.5 %
                             ****ORCA TERMINATED NORMALLY****
TOTAL RUN TIME: 0 days 10 hours 24 minutes 40 seconds 729 msec

注意到内存关键的Hessian计算耗时比例显著提升。