I just upgraded my home PC's RAM
with four (Kingston) 1GB memory
sticks (the older sticks will go in
another PC). DDR2 ECC RAM
isn't easy to come by in the Chinese-owned computer hardware shops
(that's Chinese-owned computer-hardware-shops
,
not Chinese-owned-computer hardware-shops
) of
Paris's rue Montgallet, so I bought them online
from RAMShopping.fr.
Perhaps I didn't really need 4GB (it's strange to think that I
now have as much RAM in my PC now as I
had hard disk space in '97), but disk cache is always useful—and
since the PC in question operates in 64-bit mode there is
no reason not to go beyond 3GB.
Of course, there is a rule of the Universe which says that the
first time memory sticks are inserted in their socket they will always
fail because they weren't pushed hard enough (even if the plastic
thingy clicked satisfactorily). I still don't understand why they
can't put a minimal amount of very slow
fail-safe RAM directly on the motherboard which
would enable the BIOS to boot enough to
print your system RAM is not responding
or
something: the first time I tried, the machine beeped forever on boot,
and the second time it didn't even do that—I had to unplug every
cable on the computer, lay it horizontally, and reinsert the sticks in
the socket, before the system finally agreed to boot successfully.
My chipset's northbridge
(an Intel
82955X) is ECC-capable (otherwise there would be
little point in buying ECC RAM, of
course), so I'd like to have a Linux driver to warn of (corrected or
detected) ECC errors. Unfortunately, no driver presently
seems to exist, even though the chipset's specs are public. I thought
I might try writing one myself: but the chip is reacting in a bizarre
way that I can't make sense of—it constantly reports
multiple-bit ECC errors (as well as LOCK to
Non-DRAM Memory
errors, something
I can't quite make sense of), even though an extensive memory test
shows nothing wrong. And these errors
seem to occur in
somewhat magical-seeming memory locations, like, just before or just
after a gigabyte-boundary: 0xff6bb980
(which may, or may
not, really mean 0x13f6bb980
, because there's
the PCI I/O space
at 0xc0000000
–0xffafffff
or
something), 0xffc7db00
, 0xffc86080
,
0xbffe5000
, 0xe8bd2f80
,
0x00750580
—not randomly like one might expect from
faulty RAM (and the previous memory sticks gave a
similar result, but at different memory locations such
as 0x3ffe5000
). So I think there's nothing wrong with
the RAM itself, but I can't figure out what these
error messages codes and, more importantly, how I can filter them out
to prevent them from hiding the
potential real ECC errors.
vega david ~ $ sudo lspci -xxx -s 0:0.0 00:00.0 Host bridge: Intel Corporation 955X Express Memory Controller Hub (rev 81) 00: 86 80 74 27 06 00 90 20 81 00 00 06 00 00 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 78 81 30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00 40: 01 90 d1 fe 01 40 d1 fe 05 00 00 f0 01 80 d1 fe 50: 00 00 02 00 03 00 00 00 01 50 fe bf ff 00 00 00 60: 00 30 d1 fe 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 10 11 11 11 11 33 33 00 40 00 4f 00 c0 0a 38 00 a0: 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 03 02 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 20 01 00 00 e0: 09 00 09 21 c9 40 1a 98 0c 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 86 0f 01 00 00 00 00 00
(Here the northbridge is indicating a multiple-bit ECC
error and a LOCK to Non-DRAM
Memory
error, at memory
location 0xbffe5000
—which seems quite normal when I
look at it.)
I wonder how I could find some more detailed information on my memory controller than given on Intel's datasheet.
Incidentally, I added a little JavaScript magic to this blog's comment system so that comments' dates are now displayed in the client's timezone (rather than UTC).