[Imap-protocol] SEARCH SUBJECT

E-mail headers

From:	Timo Sirainen <tss@iki.fi>
To:	imap-protocol@u.washington.edu
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	1203623023.4901.242.camel@hurina permalink / raw / eml / mbox

SEARCH SUBJECT is defined to match to envelope's subject field.

ENVELOPE's SUBJECT may not be an exact match to Subject: header itself.
RFC 3501 doesn't seem to give a clear definition how it should be
generated, so many servers at least compress LWSP to single spaces.

SEARCH HEADER is defined to match to the header's value.

So if we have a message:

Subject: hello<TAB>world

And the server returns ENVELOPE's subject field with the <TAB> replaced
with a space ("hello world").

Now I think SEARCHes should work like:

SEARCH SUBJECT "hello world" -> match
SEARCH SUBJECT "hello<TAB>world" -> non-match
SEARCH HEADER subject "hello world" -> non-match
SEARCH HEADER subject "hello<TAB>world" -> match

Right?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20080221/2812e7b7/attachment.sig>

Reply

E-mail headers

From:	leiba@watson.ibm.com
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	47BDE0F2.4070704@watson.ibm.com permalink / raw / eml / mbox

My opinion is that most searches are user-originated.  A few, like flag 
searches, are used in server processes (to discover all the unseen 
messages, for instance, or to limit retrieval to a certain date 
threshold), but most are user-originated.  Because of that, the point is 
more to do what's likely to make users happy, to give them what they 
expect... than it is to exactly match some picky spec.

Therefore, I'd say that any server that normalized white space in a text 
search would be doing everyone a favour, whether or not it's 
to-the-letter "compliant" to anything.  The same for a search that 
spanned "lines" (CRLF boundaries), treated email addresses 
intelligently, or normalized parts of speech or conjugations (treating 
"swim", "swims", and "swam" as the same word, say).

None of that sort of thing is likely to have any interoperability 
effect.  It's just likely to help users find what they're looking for.

Barry

Reply

E-mail headers

From:	MRC@Washington.EDU
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	alpine.WNT.1.00.0802211344070.4860@Tomobiki-Cho.CAC.Washignton.EDU permalink / raw / eml / mbox

I agree.  I feel that a server is allowed to implement fuzzy string 
searching, with case-independence being the only absolute requirement.

More specifically, I never intended to forbid fuzzy matching, and 
deliberately left it open-ended to allow implementations to experiment 
with what worked best.  [Google considered it good news when I told them 
this was something in their server that I thought did NOT need fixing!]

This means that SEARCH compliance testing can only test for false 
negatives; that is, for failure to match cases that both a rigid and a 
fuzzy server would catch.

Clearly, if a message has
 	Subject: Hello<tab>world
then
 	tag SEARCH SUBJECT "Hello<tab>world"
and
 	tag SEARCH HEADER "SUBJECT" "Hello<tab>world"
and
 	tag SEARCH SUBJECT "HELLO<tab>WORLD"
and
 	tag SEARCH HEADER "SUBJECT" "hello<tab>WoRlD"

should all match, but it is server-dependent if

 	tag SEARCH SUBJECT "Hello world"
and
 	tag SEARCH HEADER "SUBJECT" "Hello, world"
and
 	tag SEARCH SUBJECT "hi, planet!"
and
 	tag SEARCH HEADER "SUBJECT" "konnichi ha, seikai"
match.  [The last two being extreme examples that I wouldn't expect to 
work.]

On Thu, 21 Feb 2008, Barry Leiba wrote:
> My opinion is that most searches are user-originated.  A few, like flag 
> searches, are used in server processes (to discover all the unseen messages, 
> for instance, or to limit retrieval to a certain date threshold), but most 
> are user-originated.  Because of that, the point is more to do what's likely 
> to make users happy, to give them what they expect... than it is to exactly 
> match some picky spec.
>
> Therefore, I'd say that any server that normalized white space in a text 
> search would be doing everyone a favour, whether or not it's to-the-letter 
> "compliant" to anything.  The same for a search that spanned "lines" (CRLF 
> boundaries), treated email addresses intelligently, or normalized parts of 
> speech or conjugations (treating "swim", "swims", and "swam" as the same 
> word, say).
>
> None of that sort of thing is likely to have any interoperability effect. 
> It's just likely to help users find what they're looking for.
>
> Barry
> _______________________________________________
> Imap-protocol mailing list
> Imap-protocol@u.washington.edu
> https://mailman1.u.washington.edu/mailman/listinfo/imap-protocol
>

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Reply

E-mail headers

From:	tss@iki.fi
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	1203631656.4901.270.camel@hurina permalink / raw / eml / mbox

On Thu, 2008-02-21 at 13:56 -0800, Mark Crispin wrote:
> I agree.  I feel that a server is allowed to implement fuzzy string 
> searching, with case-independence being the only absolute requirement.
> 
> More specifically, I never intended to forbid fuzzy matching, and 
> deliberately left it open-ended to allow implementations to experiment 
> with what worked best.  [Google considered it good news when I told them 
> this was something in their server that I thought did NOT need fixing!]

So, how is this related to what you said about substring searches a year
ago? 

http://mailman1.u.washington.edu/pipermail/imap-protocol/2006-December/000328.html

I doubt Google (or anyone else implementing fuzzy matching) supports
substring matching.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20080222/71af67d5/attachment.sig>

Reply

E-mail headers

From:	MRC@Washington.EDU
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	alpine.WNT.1.00.0802211411191.4800@Tomobiki-Cho.CAC.Washignton.EDU permalink / raw / eml / mbox

On Fri, 22 Feb 2008, Timo Sirainen wrote:
> So, how is this related to what you said about substring searches a year
> ago?
> http://mailman1.u.washington.edu/pipermail/imap-protocol/2006-December/000328.html

That dealt with false *negatives* due to failure to do substring matching.

I don't object to fuzzy matching that adds positives that a non-fuzzy 
search would not match.

But that is a good question.  It deserves clarification in the 
specification.  The principle should be "match what you are required to 
match, but if you have some fuzzy algorithms that produce useful 
additional matches, then go for it."

In spam filtering, we want to err on the side of false negatives.  But in 
IMAP searches, we err on the side of false positives.

> I doubt Google (or anyone else implementing fuzzy matching) supports
> substring matching.

They claimed that it works.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Reply

E-mail headers

From:	tss@iki.fi
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	1203640514.4901.280.camel@hurina permalink / raw / eml / mbox

On Thu, 2008-02-21 at 14:14 -0800, Mark Crispin wrote:
> On Fri, 22 Feb 2008, Timo Sirainen wrote:
> > So, how is this related to what you said about substring searches a year
> > ago?
> > http://mailman1.u.washington.edu/pipermail/imap-protocol/2006-December/000328.html
> 
> That dealt with false *negatives* due to failure to do substring matching.
> 
> I don't object to fuzzy matching that adds positives that a non-fuzzy 
> search would not match.
> 
> But that is a good question.  It deserves clarification in the 
> specification.  The principle should be "match what you are required to 
> match, but if you have some fuzzy algorithms that produce useful 
> additional matches, then go for it."

What do you think the fuzzy matching fields could be?

 - SUBJECT, TEXT, BODY at least
 - FROM, TO, CC, BCC real name fields, user@domain maybe?
 - HEADER x y? HEADER message-id, in-reply-to, references (and maybe
others?) probably a bad idea.
 - SMALLER, LARGER probably not? (So server couldn't decide that 1MB+1
wouldn't match with SMALLER 1048576)
 - Date searches not(?)
 - Keywords not

> > I doubt Google (or anyone else implementing fuzzy matching) supports
> > substring matching.
> 
> They claimed that it works.

Not at least in the current public implementation:

x search subject different
x* SEARCH 1
x OK SEARCH completed (Success)
 search subject ifferent 
* SEARCH
x OK SEARCH completed (Success)

x search body thanks
* SEARCH 1 5 8
x OK SEARCH completed (Success)
x search body hanks
* SEARCH
x OK SEARCH completed (Success)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20080222/449d3cdd/attachment.sig>

Reply

E-mail headers

From:	MRC@Washington.EDU
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	alpine.WNT.1.00.0802211724261.4800@Tomobiki-Cho.CAC.Washignton.EDU permalink / raw / eml / mbox

On Fri, 22 Feb 2008, Timo Sirainen wrote:
> - SUBJECT, TEXT, BODY at least

Yes.

> - FROM, TO, CC, BCC real name fields, user@domain maybe?

Yes.  For the address list, I canonicalize the names into RFC 2822 
shortest form.  I probably should always use phrase route-addr form in 
order to make "<user@example.com>" always match.

> - HEADER x y? HEADER message-id, in-reply-to, references (and maybe
> others?) probably a bad idea.

You're probably right here, but I don't want to commit to any definite 
statement here since I haven't thoroughly considered all the 
possibilities.

> - SMALLER, LARGER probably not? (So server couldn't decide that 1MB+1
> wouldn't match with SMALLER 1048576)
> - Date searches not(?)
> - Keywords not

Probably not for all of these.  The client can easily broaden these if it 
wanted a bit of fuzz.

>>> I doubt Google (or anyone else implementing fuzzy matching) supports
>>> substring matching.
> Not at least in the current public implementation:

Oh well...  I hope that they fix that.  I think that they would have a 
difficult time arguing that a search for "tokyo" should not match 
"neotokyo"... ;-)

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.

Reply

E-mail headers

From:	arnt@gulbrandsen.priv.no
To:	imap-protocol@localhost
Date:	Fri, 08 Jun 2018 12:34:41 -0000
Message-ID:	sWp9kbt6bwAYdSeuA+mt/w.md5@lochnagar.oryx.com permalink / raw / eml / mbox

Timo Sirainen writes:
> What do you think the fuzzy matching fields could be?
>
> ...
>  - FROM, TO, CC, BCC real name fields, user@domain maybe?

Both fuzzy, inexact and exact matching is useful. Addresses are 
important in email, which means there are many different useful things 
one can do with them ;)

> - HEADER x y?

I like the idea of fuzzy matching on unstructured fields. Not so keen on 
fuzzily matching structured fields.

>  ...
>  - Date searches not(?)

Date searches are slightly fuzzy now. I'm not sure my code handles 
timezone the way the RFC says to.

Arnt

Reply

mailing list archives

[Imap-protocol] SEARCH SUBJECT