高CPU负载,低内核使用率,内核中的(ECC)内存错误

我有一个非常奇怪的行为……我的计算机的CPU负载通过屋顶(8核机器上的> 4)但是没有占用大量CPU的进程(参见附图)虽然机器的8核心正在经历高负荷(htop显示它们都在30-70%之间振荡。

CpuLoad Top输出

在使用计算机X分钟后出现此行为(随机,范围从几分钟到几个小时)。 而且,在这之后,计算机最终会被冻结。

我在这里失去了,我在15.04遇到了这个问题,更新到15.10,同样。

机器有这些部件:主板:华硕Z10PE-D8WS CPU:英特尔(R)Xeon(R)CPU E5-1620 v3 @ 3.50GHz内存:2x金士顿16Go PC4-2133 CL15 – ECC注册(KVR21R15D4 / 16)硬盘:2x 2在Raid 0中的ATA ST2000DM001-1ER1

我发现的唯一奇怪的事情是内核日志中的那些行:

Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17386.894665] CMCI storm detected: switching to poll mode Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.299974] EDAC MC0: 4 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x1042 offset:0x100 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.299989] EDAC MC0: 4 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x85392b offset:0xa80 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:1) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.299999] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x850da9 offset:0x580 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300009] EDAC MC0: 3 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x85f599 offset:0x100 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300018] EDAC MC0: 3 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x11b2 offset:0x780 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300022] EDAC MC0: 2 CE Error at MMIOH area, on addr 0x000000087fd43a40 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300032] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8474e2 offset:0xf00 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300042] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8476f8 offset:0xd80 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300051] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8466eb offset:0x500 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:1) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300060] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x846b23 offset:0x7c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300070] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x846b23 offset:0xcc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300080] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x846d32 offset:0xe40 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300089] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x5c251b offset:0x640 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:1) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300099] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8474e3 offset:0x1c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0) Feb 15 18:46:02 XXXX-Z10PE-D8-WS kernel: [17387.300108] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x847711 offset:0xf40 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:2 rank:0) Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891537] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891561] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc08388000010090 Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891566] EDAC sbridge MC0: TSC 0 Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891569] EDAC sbridge MC0: ADDR 87fc60500 EDAC sbridge MC0: MISC 14032b286 Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17387.891576] EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1455579963 SOCKET 0 APIC 0 Feb 15 18:46:03 XXXX-Z10PE-D8-WS kernel: [17388.299184] EDAC MC0: 8418 CE Error at MMIOH area, on addr 0x000000087fc60500 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0) Feb 15 18:51:03 XXXX-Z10PE-D8-WS kernel: [17687.707744] CMCI storm subsided: switching to interrupt mode 

这些线条重复了很多

 Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236569] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236586] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00064000010090 Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236589] EDAC sbridge MC0: TSC 0 Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236592] EDAC sbridge MC0: ADDR 103fb00 EDAC sbridge MC0: MISC 4062e286 Feb 15 19:07:47 XXXX-Z10PE-D8-WS kernel: [18691.236597] EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1455581267 SOCKET 0 APIC 0 

间隔一些

 Feb 15 19:07:48 XXXX-Z10PE-D8-WS kernel: [18692.381405] EDAC MC0: 26415 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1042 offset:0xa00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) Feb 15 19:07:48 XXXX-Z10PE-D8-WS kernel: [18692.381481] EDAC MC0: 4 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7c5acf offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1) 

救命 !

还在跟踪这个问题? 看起来你有一个坏内存模块,机器暂停只是等待硬件自己纠正这个错误。 您可能需要尝试删除或更换第一个CPU,第二个通道和第一个插槽的内存。 请参考: https : //serverfault.com/questions/569289/server-freezes-completely-in-unknown-condition

希望能帮助到你。

谢谢你提醒我完成这个!

事实上,在看完线后,我注意到:插槽:0是问题所在。 假设它是坏内存,我把它拿出来(插槽由你的主板分配,或至少在我的,插槽零是主板的插槽1)

因此我把它拿出来,测试了48个小时,没有出现任何错误。 将RAM发送到保修,重新获得一个。

一切都在仙境中完美!