On Fri, 22 Dec 2006, Timo Sirainen wrote:
> But I guess that's for a mailbox that's already in file cache? I'd think
> that in a real mail server most users' mailboxes need to be read from
> the disk as they're searched, and for a loaded server that can be even
> slower.
All true. But a loaded server is in for a world of hurt no matter what
steps you take. Calling upon it to do things, and abandoning the work in
progress, makes the problem worse rather than better.
The solution to loaded servers is to do things that make them less loaded.
Indexing is certainly one such measure, although I think that I mentioned
in a previous message that it can be a double-edged sword.
> I've also heard of users whose INBOX is over 2 gigabytes..
The current UW imapd version won't let you get above 2GB for the flat
files. Some sysadmins consider that to be a feature, not a bug. ;-)
mix doesn't have that limit (although I haven't tested messages above
2GB).
> But for a standard search, yes, I'm converting mails to UTF-8 before
> doing any searching. I should add support for case-insensitive UTF-8
> searches also, but for now I'm doing it only for ASCII. No-one's
> complained yet though :)
OK, so you're doing i;ascii-casemap which is what the COMPARATOR draft
suggests. I'm promoting i;unicode-casemap (I have an I-D on that) as
something that better servers should do, and trying to see if we can make
it a server option to do i;unicode-casemap even if the client doesn't
know any better and asks for i;ascii-casemap.
FWIW, UW imapd switched from i;ascii-casemap to i;unicode-casemap during
imap-2006 development.
> Anyway, yes, I could probably get my standard search code a lot faster
> (UW-IMAP searches mboxes 2-3 times faster),
Just for what it's worth, I don't promote UW imapd as being an example of
the ultimate in search performance. Rather, as a reference implementation
it's intended to be a baseline, as in "your implementation should perform
at least as well as UW imapd".
UW imapd does a rather basic Boyer-Moore search. The real cost in its
search code is in preparing the strings; it first converts both the
pattern and the search string into a canonicalized UTF-8 that has been
decomposed and coerced into titlecase. The current implementation of that
code is far more costly than the search code, and almost certainly can be
improved upon.
So, if UW imapd searches 2-3 times faster than your standard search code
that's an indication that it's worth putting in some work into search
performance. You should be able to beat UW imapd's times.
> but that won't help with
> disk I/O usage. Usually there's enough CPU to go around, but not that
> much available disk I/O. Indexing helps a lot with that. So it's not
> just for bringing down search times from a few seconds to zero, but also
> lowering the system load in general.
I agree. Don't let me talk you out of indexing. I'm just suggesting that
you cast a wider net, and hopefully bring in a lot more fish... ;-)
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.