The "sort threads by their first message's date" behavior is bad because
new messages can't be easily noticed in the mailbox. So, there should be
some way to specify "sort thread by their latest message's received
date" (and date-header?).
There has been talk about this before, but I guess no-one's done
anything about this? How about a new extension, eg. THREADEXT which
allows specifying the thread sorting in some way, eg:
1 thread (references) (internaldate) us-ascii all
Also another separate thing related to threads: Has anyone tried
optimizing the THREAD command in a way that after the initial threading,
the results would be stored somewhere and later when a new mail arrives,
the new thread list could be generated without having to rebuild
everything?
I haven't thought about this much yet, but I was hoping to find out a
way where I could just store the final thread reply and insert the new
mail in there with a single scan through all the messages in the mailbox
(looking at their message-id, in-reply-to, references and subject
headers but without actually having to store most of them to memory).
On Fri, 31 Mar 2006, Timo Sirainen wrote:
> The "sort threads by their first message's date" behavior is bad because
> new messages can't be easily noticed in the mailbox.
My MUA has a way to take me to new messages regardless of how they are
ordered in the mailbox. That might be a useful feature for your MUA.
> So, there should be
> some way to specify "sort thread by their latest message's received
> date" (and date-header?).
> There has been talk about this before, but I guess no-one's done
> anything about this? How about a new extension, eg. THREADEXT which
> allows specifying the thread sorting in some way
The way to define a new threading algorithm is...to define a new threading
algorithm! There is no need for THREADEXT that adds suboptions to an
algorithm when you just just define a new THREAD=xxx.
There is already a very nasty comment in Wikipedia about IMAP's complexity
(at least it now longer says "20 orders of magnitude more complex than
POP3"). I don't think that we should increase complexity by adding
sub-options to sub-options, which is effectively what adding options to
modify threading does.
Put another way: if a way of threading is valuable enough, there is plenty
of room in the namespace to define a new algorithm. If it isn't valuable
enough to define a new algorithm, then that cost serves as a desirable
barrier to keep bloat out of the protocol.
> Also another separate thing related to threads: Has anyone tried
> optimizing the THREAD command in a way that after the initial threading,
> the results would be stored somewhere and later when a new mail arrives,
> the new thread list could be generated without having to rebuild
> everything?
People have tried to work out how to do insertion in THREAD results. It
is not a simple task. A single message can completely change the thread
results. It's not just one link tying together two threads; a message can
have references that tie together an unbounded number of links.
In fact, THREAD was what shot down an insertion mechanism for SORT, when
it was pointed out that most users will thread rather than sort.
> I haven't thought about this much yet, but I was hoping to find out a
> way where I could just store the final thread reply and insert the new
> mail in there with a single scan through all the messages in the mailbox
> (looking at their message-id, in-reply-to, references and subject
> headers but without actually having to store most of them to memory).
The cost is not memory in the server. The cost is the network I/O cost of
transmitting the new thread results.
If you have an insight on how to express alterations to a thread tree that
does not entail greater complexity that is saved, lots of people would be
delighted to hear it (myself included).
-- Mark --
http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
On Fri, 2006-03-31 at 10:08 -0800, Mark Crispin wrote:
> On Fri, 31 Mar 2006, Timo Sirainen wrote:
> > The "sort threads by their first message's date" behavior is bad because
> > new messages can't be easily noticed in the mailbox.
>
> My MUA has a way to take me to new messages regardless of how they are
> ordered in the mailbox. That might be a useful feature for your MUA.
I like being able to see all the new mail at once without having to jump
around.
> > So, there should be
> > some way to specify "sort thread by their latest message's received
> > date" (and date-header?).
> > There has been talk about this before, but I guess no-one's done
> > anything about this? How about a new extension, eg. THREADEXT which
> > allows specifying the thread sorting in some way
>
> The way to define a new threading algorithm is...to define a new threading
> algorithm! There is no need for THREADEXT that adds suboptions to an
> algorithm when you just just define a new THREAD=xxx.
Well, that's fine with me too.
> > Also another separate thing related to threads: Has anyone tried
> > optimizing the THREAD command in a way that after the initial threading,
> > the results would be stored somewhere and later when a new mail arrives,
> > the new thread list could be generated without having to rebuild
> > everything?
>
> People have tried to work out how to do insertion in THREAD results. It
> is not a simple task. A single message can completely change the thread
> results. It's not just one link tying together two threads; a message can
> have references that tie together an unbounded number of links.
Aren't most new messages simple insertions? Either a reply to an
existing message, or a new message with no message-id references. Those
cases could be optimized and the other cases could then fallback to
slower way.
> If you have an insight on how to express alterations to a thread tree that
> does not entail greater complexity that is saved, lots of people would be
> delighted to hear it (myself included).
If most changes are simple insertions, then I think it should be pretty
easy for those cases. And in case something else happens to the threads,
it would return everything again.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20060401/bd20f284/attachment.sig>
I wrote this sort of code six years ago, for an offline reader. I
remember mrc telling me it was difficult, and then when I did it it was
not much more than 400 lines of code.
First, you need some threading code. The algorithm I used originally
came from trn, I think, although I pinched it from a friend of mine.
Once you have your threading algorithm (maybe half of my code back
then), you've two options for persistence:
1. Store everything. Use something like the THREAD=REFERENCES algorithm,
but leaving out steps 3 and 5, and make the main data structure
persistent. This is what I did in 2000.
2. Just store thread membership, not position. This is simpler. Make
your persistent data structure store a thread ID (e.g. an integer) and
the messages it has seen belonging to that thread. It should be able to
find the thread ID for a message-id quickly, and find all messages for
a thread ID. Updating this structure at injection time is easy. At read
time, you have to determine the detailed relationships within the
thread.
Arnt
On Sat, 1 Apr 2006, Timo Sirainen wrote:
> I like being able to see all the new mail at once without having to jump
> around.
In that case, you don't want a threaded view. You want a view that is in
message sequence (or UID) order.
> Aren't most new messages simple insertions? Either a reply to an
> existing message, or a new message with no message-id references.
I don't think that's a good assumption, but I haven't done a thorough
analysis.
> If most changes are simple insertions, then I think it should be pretty
> easy for those cases.
Not as easy as it sounds. The server must rethread the entire tree; the
message could have been received out of order and thus has pre-existing
messages that refer to it. Then the server must compare the new tree with
the old tree. Only then will it know if it is "just" a simple insertion.
Then, a means has to be defined by which the insertion is represented. Do
you do something like message part section numbers? Or do you send a
piece of tree.
Then it has to be defined how the client does this. And all of this has
to consider the various ways in which a tree can branch and where the
insertion takes place.
It's certainly possible, but it's not easy.
> And in case something else happens to the threads,
> it would return everything again.
Communication links are getting faster all the time. Against the ever
dwindling savings, you have to weigh the costs involved in getting a diff
mechanism defined and debugged (this includes getting multiple
implementations to interoperate correctly). We have a difficult time
getting multiple implementations of much simpler protocols to
interoperate.
Remember, just because we are technically competant enough to get such a
mechanism working does not mean that everybody who implements the protocol
will have the same competance. Past experience has made me quite
pessimistic.
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
On Fri, 2006-03-31 at 14:37 -0800, Mark Crispin wrote:
> On Sat, 1 Apr 2006, Timo Sirainen wrote:
> > I like being able to see all the new mail at once without having to jump
> > around.
>
> In that case, you don't want a threaded view. You want a view that is in
> message sequence (or UID) order.
Nope, it would still be nice to see the threads. I like pretty much how
OSX's Mail does it, except it uses ordererdsubject-like threading.
> > Aren't most new messages simple insertions? Either a reply to an
> > existing message, or a new message with no message-id references.
>
> I don't think that's a good assumption, but I haven't done a thorough
> analysis.
Here's an ugly analyzer script, I think it works correctly:
count=`printf "x select inbox\n"|./imap|grep EXISTS|sed 's/^\* \([0-9]*\) .*/\1/'`
echo "$count messages, threading.."
perl -e 'printf("x delete test\nx create test\n"); for ($i = 1; $i <= $ARGV[0]; $i++) { printf("x select inbox\nx copy $i test\nx select test\nx thread references us-ascii all\n"); }' $count|./imap > log
grep '^\* THREAD ' log|sed 's/^\* THREAD //'|tr -d '\r' > log2
echo "Out of $count new messages, there were this many non-insertions:"
perl -e '$prev = " "; $num = 1; while (<>) { s/[() \n]+/ /g; $new = $_; $new =~ s/$num //; if ($new ne $prev) { print "$new vs $prev\n"; } $prev = $_; $num++; }' < log2|wc -l
With my 787 message INBOX it found one non-insertion and with a 2000
mail mailing list it found 3 non-insertions.
> > If most changes are simple insertions, then I think it should be pretty
> > easy for those cases.
>
> Not as easy as it sounds. The server must rethread the entire tree; the
> message could have been received out of order and thus has pre-existing
> messages that refer to it. Then the server must compare the new tree with
> the old tree. Only then will it know if it is "just" a simple insertion.
Right. Although I think it's possible to do some optimizations in there
also, eg. if message has no message-id, in-reply-to or references
headers, and there's no existing message with same base subject, it can
always be inserted as a root node.
> Then, a means has to be defined by which the insertion is represented. Do
> you do something like message part section numbers? Or do you send a
> piece of tree.
>
> Then it has to be defined how the client does this. And all of this has
> to consider the various ways in which a tree can branch and where the
> insertion takes place.
I think two types of insertions would be enough: "insert as sibling to
message n" and "insert as the first child to message n". Root would
probably be message number 0. So an example reply would be:
* THREAD + 5=S3 6=C5
Meaning message 5 was added as a sibling to 3, and message 6 was added
as 5's first child.
I think that should be easy enough for the clients to handle?
> > And in case something else happens to the threads,
> > it would return everything again.
>
> Communication links are getting faster all the time. Against the ever
> dwindling savings, you have to weigh the costs involved in getting a diff
> mechanism defined and debugged (this includes getting multiple
> implementations to interoperate correctly). We have a difficult time
> getting multiple implementations of much simpler protocols to
> interoperate.
>
> Remember, just because we are technically competant enough to get such a
> mechanism working does not mean that everybody who implements the protocol
> will have the same competance. Past experience has made me quite
> pessimistic.
Well, I'm still mostly just interested in optimizing this internally in
my server so that the reply gets sent near instantly, but these are
somewhat related goals in that if I can figure out rules how to quickly
update the thread list internally, I can easily send out incremental
updates to clients as well.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20060401/5dae1623/attachment.sig>
Timo Sirainen writes:
> With my 787 message INBOX it found one non-insertion and with a 2000
> mail mailing list it found 3 non-insertions.
Let me guess: Mailing list threads where some thread participants are
cc'd, and mail servers delivering several messages in a thread
practically at the same time when a network connection comes up after
downtime.
Arnt
On Sat, 1 Apr 2006, Timo Sirainen wrote:
> I think two types of insertions would be enough: "insert as sibling to
> message n" and "insert as the first child to message n".
You need to define the semantics when UID THREAD is used. Also, you need
to define what happens if the new message has children (insertion as
opposed to append).
> I think that should be easy enough for the clients to handle?
You'd be surprised at what people find to be "not easy".
> Well, I'm still mostly just interested in optimizing this internally in
> my server so that the reply gets sent near instantly, but these are
> somewhat related goals in that if I can figure out rules how to quickly
> update the thread list internally, I can easily send out incremental
> updates to clients as well.
I don't understand. A thread should be nearly instantaneous, even for a
very large mailbox, if you use a hash table. It's probably also a good
idea to cache the data needed to thread so it's available the next time
you thread.
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Timo,
I have a patch for Pine that implements this idea for Pine by modifying
the client, not the server. What it done is to sort by Arrival and by
thread, then take the thread containing the last message in sort by
arrival and put that thread in the bottom of the list, then remove both
threads from both lists (threaded and by arrival) and repeat until we are
done. I call this sort the thread by arrival.
Replacing arrival sort by subject sort is also very useful, but it does
not seem to be very useful for other sort methods (e.g. From, or size).
You can find the patch in my web page (see below). Look for "Enhanced
Fancy Thread Interface".
--
Eduardo
http://www.math.washington.edu/~chappa/pine/
On Sat Apr 1 00:30:03 2006, Mark Crispin wrote:
> On Sat, 1 Apr 2006, Timo Sirainen wrote:
>> I think two types of insertions would be enough: "insert as
>> sibling to
>> message n" and "insert as the first child to message n".
>
> You need to define the semantics when UID THREAD is used. Also,
> you need to define what happens if the new message has children
> (insertion as opposed to append).
>
>
I vaguely suspect that rather than single identifiers (sequence
numbers or UIDs), you want tree fragments. That handles both "gap
plugging", where a message arrives out of order (and I see a few of
those on some mailing lists, especially where there's a mixture of
replying direct to me and only replying to the list going on), and
actual "grafting", where two previously unrelated trees are joined.
It also means you can unify most insertions down to a tree fragment
by the algorithm of finding the insertion point, and sending the tree
fragment starting at the parent, and ending at the first identified
descendent node. I'd imagine most server implementations would give
up and resend the whole tree in any complex cases.
>> I think that should be easy enough for the clients to handle?
>
> You'd be surprised at what people find to be "not easy".
>
>
True, but there's precious few clients which implement much more than
LITERAL+, CHILDREN, and perhaps one or two more. Better, these
messages are at least clear in their intent by a reasonable
application of common sense. ACL is fairly common, but I could only
find one client-side implementation of LISTRIGHTS.
>> Well, I'm still mostly just interested in optimizing this
>> internally in
>> my server so that the reply gets sent near instantly, but these are
>> somewhat related goals in that if I can figure out rules how to
>> quickly
>> update the thread list internally, I can easily send out
>> incremental
>> updates to clients as well.
>
> I don't understand. A thread should be nearly instantaneous, even
> for a very large mailbox, if you use a hash table. It's probably
> also a good idea to cache the data needed to thread so it's
> available the next time you thread.
JWZ's description of speeds on now fairly ancient hardware would
suggest that. In any case, optimizing to the level Mark suggests
would be a clearly useful first step.
Dave.
--
You see things; and you say "Why?"
But I dream things that never were; and I say "Why not?"
- George Bernard Shaw