[Imap-protocol] Searching

E-mail headers

From:	Timo Sirainen <tss@iki.fi>
To:	imap-protocol@u.washington.edu
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	1166603479.22214.298.camel@hurina permalink / raw / eml / mbox

RFC says: "In all search keys that use strings, a message matches the
key if the string is a substring of the field."

Strictly reading this, it means that the SEARCH command can't really be
optimized with most of the normal search index algorithms/libraries,
because they support only matching from the beginning of words.

So, in case I wanted to use such indexes, what do you think I should do?

1) Just use it for BODY and TEXT searches. Probably won't really break
anything? I think some servers already do this.

2) Add some X-NONEXACT-BODY search extension. Or use COMPARATOR somehow?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20061220/6b6b6991/attachment.sig>

Reply

E-mail headers

From:	arnt@gulbrandsen.priv.no
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	8j2+4TzX9tp2p0Ep9/IgQw.md5@libertango.oryx.com permalink / raw / eml / mbox

Timo Sirainen writes:
> RFC says: "In all search keys that use strings, a message matches the 
> key if the string is a substring of the field."
>
> Strictly reading this, it means that the SEARCH command can't really 
> be optimized with most of the normal search index 
> algorithms/libraries, because they support only matching from the 
> beginning of words.
>
> So, in case I wanted to use such indexes, what do you think I should do?
>
> 1) Just use it for BODY and TEXT searches. Probably won't really break 
> anything? I think some servers already do this.

As a first try, I would scan for spaces in the search key and doing an 
index-assisted search on whatever follows the first space, then looking 
at the matches more closely. If the command is ?SEARCH SUBJECT "a b"?, 
that implies that only messages whose subject contain a word starting 
with b can match, so use that to narrow down the possibilities. If that 
doesn't help your users enough, you could revisit the problem in a 
later server version.

> 2) Add some X-NONEXACT-BODY search extension. Or use COMPARATOR somehow?

You could define a new collation whose substring operation only chops at 
word boundaries, and then name that collation using COMPARATOR, but I 
doubt that would help you. Which clients would do it?

For me, the most important aspect of this problem is address searches. 
After discussing that with Mark Crispin, I implemented a little bit of 
address parsing: My code treats a ?SEARCH FROM "<tss@"? as a search for 
messages from addresses whose complete localpart is tss. My code will 
not find ?From: "<tss@" <any@thi.ng>?, but it finds mail from you 
really quickly.

Arnt

Reply

E-mail headers

From:	mrc@CAC.Washington.EDU
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	alpine.OSX.0.81.0612201018450.10225@pangtzu.panda.com permalink / raw / eml / mbox

On Wed, 20 Dec 2006, Timo Sirainen wrote:
> RFC says: "In all search keys that use strings, a message matches the
> key if the string is a substring of the field."
> 
> Strictly reading this, it means that the SEARCH command can't really be
> optimized with most of the normal search index algorithms/libraries,
> because they support only matching from the beginning of words.
> 
> So, in case I wanted to use such indexes, what do you think I should do?

Implement SEARCH according to the IMAP specification.  If a library search 
routine does not fufill the IMAP specification, then it is unsuitable for 
use with IMAP SEARCH.

If you are worried about performance, I suggest that you look into one of 
the optimizing algorithms for fast search.  UW and Cyrus use a Boyer-Moore 
fast search which is quite a bit faster than the straightforward linear 
search.

> 1) Just use it for BODY and TEXT searches. Probably won't really break
> anything? I think some servers already do this.

There are clients that depend upon substring searches.  I also do 
substring searches on a regular basis.  Any server that does not do 
substring searches is broken and non-compliant.

I believe that you are a reasonable person, will accept this bad news with 
grace, and will implement a compliant SEARCH in your server in spite of 
the difficulty that it causes you.  There needs to be at least one 
compliant IMAP server that supports Maildir format (Courier is not 
compliant).  Please show that my belief is true.

> 2) Add some X-NONEXACT-BODY search extension. Or use COMPARATOR somehow?

The problem with one-server extensions is that nobody uses them.  Feel 
free to do so anyway, as long as you implement the standard SEARCH 
correctly; but it will languish in obscurity.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.

Reply

E-mail headers

From:	tss@iki.fi
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	1166620490.22214.327.camel@hurina permalink / raw / eml / mbox

On Wed, 2006-12-20 at 12:09 +0100, Arnt Gulbrandsen wrote:
> Timo Sirainen writes:
> > RFC says: "In all search keys that use strings, a message matches the 
> > key if the string is a substring of the field."
> >
> > Strictly reading this, it means that the SEARCH command can't really 
> > be optimized with most of the normal search index 
> > algorithms/libraries, because they support only matching from the 
> > beginning of words.
> >
> > So, in case I wanted to use such indexes, what do you think I should do?
> >
> > 1) Just use it for BODY and TEXT searches. Probably won't really break 
> > anything? I think some servers already do this.
> 
> As a first try, I would scan for spaces in the search key and doing an 
> index-assisted search on whatever follows the first space, then looking 
> at the matches more closely. If the command is ?SEARCH SUBJECT "a b"?, 
> that implies that only messages whose subject contain a word starting 
> with b can match, so use that to narrow down the possibilities. If that 
> doesn't help your users enough, you could revisit the problem in a 
> later server version.

Normally people search only single words, so this doesn't help much.

> > 2) Add some X-NONEXACT-BODY search extension. Or use COMPARATOR somehow?
> 
> You could define a new collation whose substring operation only chops at 
> word boundaries, and then name that collation using COMPARATOR, but I 
> doubt that would help you. Which clients would do it?

I mostly need it for one webmail installation where the webmail code can
be modified if needed, but getting it standardized would be nice. I
imagine other server implementors (and especially admins and users)
would be happy to have fast message body searches. At least I've
preferred grep for a long time since it was just too slow via IMAP (yea,
I should look into optimizing my search code too :).

Perhaps related to this issue is the possibility to allow some kind of
fuzzy matching, where the server may decide to show results for strings
which sort-of match, but not completely.

BTW. I did also implement Cyrus Squat like indexes which allow full
substring searches to be done fast, but they take more disk space and
they're slower to update than "normal" search indexes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20061220/405e2eb8/attachment.sig>

Reply

E-mail headers

From:	tss@iki.fi
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	1166643258.22214.350.camel@hurina permalink / raw / eml / mbox

On Wed, 2006-12-20 at 10:40 -0800, Mark Crispin wrote:
> On Wed, 20 Dec 2006, Timo Sirainen wrote:
> > RFC says: "In all search keys that use strings, a message matches the
> > key if the string is a substring of the field."
> > 
> > Strictly reading this, it means that the SEARCH command can't really be
> > optimized with most of the normal search index algorithms/libraries,
> > because they support only matching from the beginning of words.
> > 
> > So, in case I wanted to use such indexes, what do you think I should do?
> 
> Implement SEARCH according to the IMAP specification.  If a library search 
> routine does not fufill the IMAP specification, then it is unsuitable for 
> use with IMAP SEARCH.

That's what I was thinking, just wanted to make sure that I wouldn't
waste time on doing it in a more complex way than necessary.

> If you are worried about performance, I suggest that you look into one of 
> the optimizing algorithms for fast search.  UW and Cyrus use a Boyer-Moore 
> fast search which is quite a bit faster than the straightforward linear 
> search.

Optimizing the string search would help some, but for large mailboxes
it's still a bit too slow. People want instant search results
nowadays. :)

> > 2) Add some X-NONEXACT-BODY search extension. Or use COMPARATOR somehow?
> 
> The problem with one-server extensions is that nobody uses them.  Feel 
> free to do so anyway, as long as you implement the standard SEARCH 
> correctly; but it will languish in obscurity.

Perhaps. I think it depends on how badly mail admins want it. If it's
only a small s/BODY/X-NONEXACT-BODY/ replace for their webmail code,
it'll get usage at least within Dovecot community.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20061220/ad8620f1/attachment.sig>

Reply

E-mail headers

From:	arnt@gulbrandsen.priv.no
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	rBtlBkn+facAYdHBwRiNTg.md5@libertango.oryx.com permalink / raw / eml / mbox

Mark Crispin writes:
> If you are worried about performance, I suggest that you look into one 
> of the optimizing algorithms for fast search.  UW and Cyrus use a 
> Boyer-Moore fast search which is quite a bit faster than the 
> straightforward linear search.

Boyer-Moore is k*O(n) with an impressive k. People want k*O(log n) these 
days, if possible with an impressive k.

Arnt

Reply

E-mail headers

From:	MRC@CAC.Washington.EDU
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	alpine.WNT.0.81.0612201412380.3856@Shimo-Tomobiki.panda.com permalink / raw / eml / mbox

On Wed, 20 Dec 2006, Timo Sirainen wrote:
> Optimizing the string search would help some, but for large mailboxes
> it's still a bit too slow. People want instant search results
> nowadays. :)

Please define what you mean by "large" and "instant".

It took some effort for me to construct a mailbox that was pathologically 
large enough for a search in UW imapd to take a whole 2 seconds.

> Perhaps. I think it depends on how badly mail admins want it. If it's
> only a small s/BODY/X-NONEXACT-BODY/ replace for their webmail code,
> it'll get usage at least within Dovecot community.

By the way, are you doing charset and i18n case-mapping in your 
"non-exact" search?  That, and not the searching, is what takes time.

IMAP2 searches didn't have any of that; but for some reason the people in 
non-English speaking countries didn't think that it was good enough.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Reply

E-mail headers

From:	mrc@CAC.Washington.EDU
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	alpine.OSX.0.81.0612201155450.414@pangtzu.panda.com permalink / raw / eml / mbox

On Wed, 20 Dec 2006, Arnt Gulbrandsen wrote:
> Boyer-Moore is k*O(n) with an impressive k. People want k*O(log n) these 
> days, if possible with an impressive k.

I question that assumption for IMAP searches.  Even with large mailboxes, 
Boyer-Moore causes no user-visible delay in search time.

What causes search time delays is I/O time; and that is the attraction 
behind such things as indexed searching.  However, indexing can have its 
own costs (I turned off Spotlight on my Mac due to its indexing consuming 
most of the CPU for hours on end!), and it doesn't do any good if it 
violates the specification and leads to the failure of a client that 
relies upon the specification.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.

Reply

E-mail headers

From:	tss@iki.fi
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	1166744138.22214.523.camel@hurina permalink / raw / eml / mbox

On Wed, 2006-12-20 at 14:19 -0800, Mark Crispin wrote:
> On Wed, 20 Dec 2006, Timo Sirainen wrote:
> > Optimizing the string search would help some, but for large mailboxes
> > it's still a bit too slow. People want instant search results
> > nowadays. :)
> 
> Please define what you mean by "large" and "instant".

Some people would want to see the results as they keep typing the search
keyword. For that kind of a user interface the search can't really take
much longer than 0.1 seconds or it'll look slow.

> It took some effort for me to construct a mailbox that was pathologically 
> large enough for a search in UW imapd to take a whole 2 seconds.

But I guess that's for a mailbox that's already in file cache? I'd think
that in a real mail server most users' mailboxes need to be read from
the disk as they're searched, and for a loaded server that can be even
slower. I've also heard of users whose INBOX is over 2 gigabytes..

> > Perhaps. I think it depends on how badly mail admins want it. If it's
> > only a small s/BODY/X-NONEXACT-BODY/ replace for their webmail code,
> > it'll get usage at least within Dovecot community.
> 
> By the way, are you doing charset and i18n case-mapping in your 
> "non-exact" search?  That, and not the searching, is what takes time.

The "non-exact" naming means just that the text/body searching can
implement different search string matching rules than what IMAP RFC
defines. I couldn't think of a good name for it. Maybe
X-NONRFC-TEXT/BODY :)

But for a standard search, yes, I'm converting mails to UTF-8 before
doing any searching. I should add support for case-insensitive UTF-8
searches also, but for now I'm doing it only for ASCII. No-one's
complained yet though :)

Anyway, yes, I could probably get my standard search code a lot faster
(UW-IMAP searches mboxes 2-3 times faster), but that won't help with
disk I/O usage. Usually there's enough CPU to go around, but not that
much available disk I/O. Indexing helps a lot with that. So it's not
just for bringing down search times from a few seconds to zero, but also
lowering the system load in general.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20061222/07ac7c51/attachment.sig>

Reply

E-mail headers

From:	MRC@CAC.Washington.EDU
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:38 -0000
Message-ID:	alpine.WNT.0.81.0612211836110.372@Tomobiki-Cho.CAC.Washington.EDU permalink / raw / eml / mbox

On Fri, 22 Dec 2006, Timo Sirainen wrote:
> But I guess that's for a mailbox that's already in file cache? I'd think
> that in a real mail server most users' mailboxes need to be read from
> the disk as they're searched, and for a loaded server that can be even
> slower.

All true.  But a loaded server is in for a world of hurt no matter what 
steps you take.  Calling upon it to do things, and abandoning the work in 
progress, makes the problem worse rather than better.

The solution to loaded servers is to do things that make them less loaded. 
Indexing is certainly one such measure, although I think that I mentioned 
in a previous message that it can be a double-edged sword.

> I've also heard of users whose INBOX is over 2 gigabytes..

The current UW imapd version won't let you get above 2GB for the flat 
files.  Some sysadmins consider that to be a feature, not a bug. ;-)

mix doesn't have that limit (although I haven't tested messages above 
2GB).

> But for a standard search, yes, I'm converting mails to UTF-8 before
> doing any searching. I should add support for case-insensitive UTF-8
> searches also, but for now I'm doing it only for ASCII. No-one's
> complained yet though :)

OK, so you're doing i;ascii-casemap which is what the COMPARATOR draft 
suggests.  I'm promoting i;unicode-casemap (I have an I-D on that) as 
something that better servers should do, and trying to see if we can make 
it a server option to do i;unicode-casemap even if the client doesn't 
know any better and asks for i;ascii-casemap.

FWIW, UW imapd switched from i;ascii-casemap to i;unicode-casemap during 
imap-2006 development.

> Anyway, yes, I could probably get my standard search code a lot faster
> (UW-IMAP searches mboxes 2-3 times faster),

Just for what it's worth, I don't promote UW imapd as being an example of 
the ultimate in search performance.  Rather, as a reference implementation 
it's intended to be a baseline, as in "your implementation should perform
at least as well as UW imapd".

UW imapd does a rather basic Boyer-Moore search.  The real cost in its 
search code is in preparing the strings; it first converts both the 
pattern and the search string into a canonicalized UTF-8 that has been 
decomposed and coerced into titlecase.  The current implementation of that 
code is far more costly than the search code, and almost certainly can be 
improved upon.

So, if UW imapd searches 2-3 times faster than your standard search code 
that's an indication that it's worth putting in some work into search 
performance.  You should be able to beat UW imapd's times.

> but that won't help with
> disk I/O usage. Usually there's enough CPU to go around, but not that
> much available disk I/O. Indexing helps a lot with that. So it's not
> just for bringing down search times from a few seconds to zero, but also
> lowering the system load in general.

I agree.  Don't let me talk you out of indexing.  I'm just suggesting that 
you cast a wider net, and hopefully bring in a lot more fish... ;-)

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Reply

mailing list archives

[Imap-protocol] Searching