And now for something completely different equally
geeky: more about ECC RAM.
First, I'm told that what
I called memory sticks
should
properly be referred to as memory modules
(memory sticks are,
indeed,
something rather
different).
Well screw
the Linux source for confusing me: I should
have trusted
my feelings in this respect. But no matter.
The mystery of my phony ECC errors has been solved
and, of course, the answer is ridiculously simple:
my BIOS was in a quick boot
mode in which it
does not initialize RAM at startup, so there was
bogus ECC data causing numerous errors as long as the
regions in question were not written to. (Part of the mystery
remains, however: if these regions of RAM had never
been written to, why ever were they being read? They couldn't possibly
contain anything useful… I suspect one of two things: either
entire pages were being read, because that's simpler, or else a memory
bus read operation always addresses a certain quantity (perhaps 128
bytes) which is greater than a write operation, so bogus data was
being read at the edge of previously written valid data.)
The silver lining of this is that now I know the ECC reporting mechanism works, and I was able to use this to test a little ECC-checking Perl script I wrote for my chipset under Linux (you can find it here). So if an error is detected or corrected on one of three computers I administer which have ECC RAM, I will get an email telling me about it. I feel my files are much safer now.
Also along the lines of improving computer reliability, I discovered (almost by accident) that the same three computers, as with most recent Intel chipsets, have a hardware watchdog feature—which I activated. This means that if the computer hangs up, the hardware will detect it (once the watchdog is started, it needs to be pinged at regular intervals) and cause a reboot. Hopefully this means I will no longer (or not so often) need to email my mother and ask her to reboot the computer in my room in Orsay.