事情的由头来自于用ORCA做过渡态优化,只是五十几个原子的体系,用的甚至还只是def2-SVP基组。 然后一看这个Hessian计算的耗时
----------------------------------------------
Solving the CP-SCF equations (RIJCOSX) ...
----------------------------------------------
IBatch 1 (of 2)
CP-SCF ITERATION 0: 5.7177e-01 (1494.1 sec 0/117 done)
CP-SCF ITERATION 1: 6.5125e-02 (1485.8 sec 0/117 done)
CP-SCF ITERATION 2: 3.5169e-02 (1991.6 sec 0/117 done)
CP-SCF ITERATION 3: 1.1599e-02 (2049.5 sec 0/117 done)
CP-SCF ITERATION 4: 5.3287e-03 (1982.7 sec 1/117 done)
CP-SCF ITERATION 5: 1.6394e-03 (1696.5 sec 15/117 done)
CP-SCF ITERATION 6: 5.6406e-04 (1484.0 sec 50/117 done)
CP-SCF ITERATION 7: 1.7323e-04 (1140.9 sec 98/117 done)
CP-SCF ITERATION 8: 6.6552e-05 ( 454.4 sec 117/117 done)
*** THE CP-SCF HAS CONVERGED ***
IBatch 2 (of 2)
CP-SCF ITERATION 0: 3.0999e-01 ( 579.4 sec 0/ 27 done)
CP-SCF ITERATION 1: 5.0835e-02 ( 509.2 sec 0/ 27 done)
CP-SCF ITERATION 2: 2.2800e-02 ( 638.6 sec 0/ 27 done)
CP-SCF ITERATION 3: 8.6497e-03 ( 478.2 sec 0/ 27 done)
CP-SCF ITERATION 4: 3.2066e-03 ( 587.7 sec 0/ 27 done)
CP-SCF ITERATION 5: 8.8847e-04 ( 477.8 sec 10/ 27 done)
CP-SCF ITERATION 6: 3.5159e-04 ( 402.2 sec 24/ 27 done)
CP-SCF ITERATION 7: 9.5689e-05 ( 47.9 sec 27/ 27 done)
*** THE CP-SCF HAS CONVERGED ***
... done ( 17523.4 sec)
略……
Total SCF Hessian time: 0 days 8 hours 59 min 36 sec
整个人都不好了,得亏初猜结构足够好,转了3轮频率计算就收敛了。
通过top
命令检查CPU占用
top - 06:27:33 up 5 days, 18:17, 1 user, load average: 7.61, 8.08, 8.46
Tasks: 46 total, 7 running, 39 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.1 us, 0.0 sy, 0.0 ni, 84.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 56320.0 total, 46277.6 free, 5924.6 used, 4117.8 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 50395.4 avail Mem
显示一切正常,wa%项基本没有,说明不存在磁盘IO瓶颈。
通过lscpu
命令检查CPU频率
…………
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
…………
Model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Stepping: 4
CPU MHz: 2792.916
CPU max MHz: 2800.0000
CPU min MHz: 1200.0000
BogoMIPS: 5585.83
Virtualization: VT-x
L1d cache: 640 KiB
L1i cache: 640 KiB
L2 cache: 5 MiB
L3 cache: 50 MiB
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
…………
显示CPU频率一切正常,CPU被正常锁在基频。
因为使用的是HP DL380p G8机架式服务器,通过iLO发现内存报错,一根内存被degraded下线了。
而注意到物理主机安装的Proxmox VE虚拟化平台中通过dmidecode
命令检查内存显示所有内存均在线。
本实例中Proxmox VE的Linux内核版本为5.4.140-1-pve。
root@dl380p:~# dmidecode -t memory | grep Size
Size: 8192 MB
Size: No Module Installed
Size: No Module Installed
Size: 8192 MB
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 8192 MB
Size: No Module Installed
Size: No Module Installed
Size: 8192 MB
Size: 8192 MB
Size: No Module Installed
Size: No Module Installed
Size: 8192 MB
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: No Module Installed
Size: 8192 MB
Size: No Module Installed
Size: No Module Installed
Size: 8192 MB
此时检查此物理主机的内存读写速度可以发现降低颇多:
root@dl380p:~# dd if=/dev/zero of=/dev/null bs=10M count=1024
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 5.22757 s, 2.1 GB/s
过渡态计算任务整体耗时如下:
Timings for individual modules:
Sum of individual times ... 173331.616 sec (=2888.860 min)
GTO integral calculation ... 227.706 sec (= 3.795 min) 0.1 %
SCF iterations ... 18792.324 sec (= 313.205 min) 10.8 %
SCF Gradient evaluation ... 6624.362 sec (= 110.406 min) 3.8 %
Geometry relaxation ... 12.350 sec (= 0.206 min) 0.0 %
Analytical frequency calculation... 147674.875 sec (=2461.248 min) 85.2 %
****ORCA TERMINATED NORMALLY****
TOTAL RUN TIME: 2 days 0 hours 9 minutes 44 seconds 730 msec
注意此机配置的是DDR3 1333,作为对照同配置的HP DL360p G8测试数据如下:
root@dl360p:~# dd if=/dev/zero of=/dev/null bs=10M count=1024
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 0.853596 s, 12.6 GB/s
相似过渡态计算任务整体耗时如下:
Timings for individual modules:
Sum of individual times ... 25829.721 sec (= 430.495 min)
GTO integral calculation ... 36.398 sec (= 0.607 min) 0.1 %
SCF iterations ... 4242.422 sec (= 70.707 min) 16.4 %
SCF Gradient evaluation ... 1492.312 sec (= 24.872 min) 5.8 %
Geometry relaxation ... 8.789 sec (= 0.146 min) 0.0 %
Analytical frequency calculation... 20049.801 sec (= 334.163 min) 77.6 %
****ORCA TERMINATED NORMALLY****
TOTAL RUN TIME: 0 days 7 hours 10 minutes 56 seconds 41 msec
另外使用配置DDR4 2133 双通道的E5-2630L v4,X99平台,测试数据如下:
hzy@orca5:~$ dd if=/dev/zero of=/dev/null bs=10M count=1024
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 1.24068 s, 8.7 GB/s
相似过渡态计算任务整体耗时如下:
Timings for individual modules:
Sum of individual times ... 37455.445 sec (= 624.257 min)
GTO integral calculation ... 29.563 sec (= 0.493 min) 0.1 %
SCF iterations ... 6818.118 sec (= 113.635 min) 18.2 %
SCF Gradient evaluation ... 2312.956 sec (= 38.549 min) 6.2 %
Geometry relaxation ... 7.646 sec (= 0.127 min) 0.0 %
Analytical frequency calculation... 28287.161 sec (= 471.453 min) 75.5 %
****ORCA TERMINATED NORMALLY****
TOTAL RUN TIME: 0 days 10 hours 24 minutes 40 seconds 729 msec
注意到内存关键的Hessian计算耗时比例显著提升。