David Madore's WebLog: Needle in a haystack

I'm sure it's happened to all of us Web surfers: you stumble upon a Web page one day (while randomly clicking on links), you don't bother to bookmark it because it doesn't seem so specially interesting or because you don't think you'll be needing it again, or for any other reason, and then, one week or so later, you become obsessed with finding that site again because there's just a little something you wanted to check, or because it turns out to be very important after all… and then, of course, it's like finding a needle in a haystack. Google can't help because you don't remember any specific phrase that was on the site or perhaps because the page isn't indexed for some reason, or else there are just too many pages matching anything you can think of. Your Web browser may keep a history of the last few thousand visited page (mine does, at my request), but you can't really search among them. And you can't search in the cache, either, because it doesn't last long enough (I use a very small browser cache, because I believe beyond a few megabytes it just uses disk space needlessly).

My answer to this would be to keep a text-only cache/history combo of some kind: the browser would store just the text (tags deleted, no images) of the last thousands (or tens of thousands) most recently visited pages with their URLs. Secondly, implement something which Unix grep currently does not have: search for “at least n of the following m words” (it is not too difficult to grep records containing all or at least one of a set of words, but it is very much of a pain to find records containing at least five of a set of seven words, say); the point of this would be to help locating the appropriate page when the user cannot remember exactly which words might be in the text but has a certain number of reasonable candidates.