David Madore's WebLog: Memory sticks, ECC, and northbridge oddities

I just upgraded my home PC's RAM with four (Kingston) 1GB memory sticks (the older sticks will go in another PC). DDR2 ECC RAM isn't easy to come by in the Chinese-owned computer hardware shops (that's Chinese-owned computer-hardware-shops, not Chinese-owned-computer hardware-shops ) of Paris's rue Montgallet, so I bought them online from RAMShopping.fr. Perhaps I didn't really need 4GB (it's strange to think that I now have as much RAM in my PC now as I had hard disk space in '97), but disk cache is always useful—and since the PC in question operates in 64-bit mode there is no reason not to go beyond 3GB.

Of course, there is a rule of the Universe which says that the first time memory sticks are inserted in their socket they will always fail because they weren't pushed hard enough (even if the plastic thingy clicked satisfactorily). I still don't understand why they can't put a minimal amount of very slow fail-safe RAM directly on the motherboard which would enable the BIOS to boot enough to print your system RAM is not responding or something: the first time I tried, the machine beeped forever on boot, and the second time it didn't even do that—I had to unplug every cable on the computer, lay it horizontally, and reinsert the sticks in the socket, before the system finally agreed to boot successfully.

My chipset's northbridge (an Intel 82955X) is ECC-capable (otherwise there would be little point in buying ECC RAM, of course), so I'd like to have a Linux driver to warn of (corrected or detected) ECC errors. Unfortunately, no driver presently seems to exist, even though the chipset's specs are public. I thought I might try writing one myself: but the chip is reacting in a bizarre way that I can't make sense of—it constantly reports multiple-bit ECC errors (as well as LOCK to Non-DRAM Memory errors, something I can't quite make sense of), even though an extensive memory test shows nothing wrong. And these errors seem to occur in somewhat magical-seeming memory locations, like, just before or just after a gigabyte-boundary: 0xff6bb980 (which may, or may not, really mean 0x13f6bb980, because there's the PCI I/O space at 0xc0000000–0xffafffff or something), 0xffc7db00, 0xffc86080, 0xbffe5000, 0xe8bd2f80, 0x00750580—not randomly like one might expect from faulty RAM (and the previous memory sticks gave a similar result, but at different memory locations such as 0x3ffe5000). So I think there's nothing wrong with the RAM itself, but I can't figure out what these error messages codes and, more importantly, how I can filter them out to prevent them from hiding the potential real ECC errors.

vega david ~ $ sudo lspci -xxx -s 0:0.0
00:00.0 Host bridge: Intel Corporation 955X Express Memory Controller Hub (rev 81)
00: 86 80 74 27 06 00 90 20 81 00 00 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 78 81
30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
40: 01 90 d1 fe 01 40 d1 fe 05 00 00 f0 01 80 d1 fe
50: 00 00 02 00 03 00 00 00 01 50 fe bf ff 00 00 00
60: 00 30 d1 fe 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 10 11 11 11 11 33 33 00 40 00 4f 00 c0 0a 38 00
a0: 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 03 02 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 20 01 00 00
e0: 09 00 09 21 c9 40 1a 98 0c 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 01 00 00 00 00 00

(Here the northbridge is indicating a multiple-bit ECC error and a LOCK to Non-DRAM Memory error, at memory location 0xbffe5000—which seems quite normal when I look at it.)

I wonder how I could find some more detailed information on my memory controller than given on Intel's datasheet.

Incidentally, I added a little JavaScript magic to this blog's comment system so that comments' dates are now displayed in the client's timezone (rather than UTC).