On Monday, October 31, 2011 3:40 PM, "Brandon Long" <blong@google.com> wrote:
> On Sun, Oct 23, 2011 at 1:04 AM, Bron Gondwana <brong@fastmail.fm> wrote:
> > On Sat, Oct 22, 2011 at 01:56:55PM -0700, Brandon Long wrote:
> >> I think we have only one deliberate break from the spec, and that's that we
> >> don't do substring searches.
> >
> > I believe your envelopes are bogus too, and we had another issue which I
> > should see if I can dig up...
>
> Not bogus, but yes, there is a specific combination where we get it
> wrong. Thought it had been fixed, but I see the bug is still open.
I've seen it in the wild, so yeah - still there I think.
> > Oh, yeah:
> >
> > ======================
> > The bug is with gmail.
> >
> > http://tools.ietf.org/html/rfc3501#section-4.3.1
> >
> > ?8-bit textual and binary mail is supported through the use of a
> > ?[MIME-IMB] content transfer encoding. ?IMAP4rev1 implementations MAY
> > ?transmit 8-bit or multi-octet characters in literals, but SHOULD do
> > ?so only when the [CHARSET] is identified.
> >
> > Basically, they can transfer that as utf-8, but only if they send it as
> > a literal. ?Just quoting the string is bogus.
> > ======================
>
> I can believe this bug exists. It would be nice if we didn't have to
> do that (is there an extension for that?), but if you can give me
> specfics, I can file the bug.
I doubt there's an extention, and as Mark pointed out - you need to fall
back for other clients anyway.
> We don't live in a 7bit world, and haven't for a very long time.
> Having to scan outbound text for whether the 8bit is set will just be
> more lovely CPU down the drain.
Not really - you're already scanning to see if it's particularly long.
We shove everything more than 1024 bytes into a literal in Cyrus,
without even parsing it. And you need to parse the whole thing for
" characters anyway unless you're going to be REALLY bogus about it,
so your CPU argument doesn't hold any water. Easy to just set the
"must_send_literal" flag when you see the first 8 bit character, at
which point you can stop scanning and just shove the bytes onto the
wire with a {size}\r\n in front.
> > You send 8bit stuff inside quotes.
>
> And probably in error messages, too. Not even sure what to do about that.
Ahh, error messages. No, I don't know what to do about that either - but
I'm not so concerned so long as you don't eject endline characters in them.
rfc3501 says:
text = 1*TEXT-CHAR
TEXT-CHAR = <any CHAR except CR and LF>
It also said "any char is 7 bit US-ASCII unless otherwise specified", but
you're less likely to get into hot water here, because most things are just
scanning to the \r\n.
> >> I know there are other issues, like we sometimes don't get the size quite
> >> right (comes of storing the message with just LF line endings and having to
> >> convert to CRLF), but mostly we tend to bend the spec but not break them...
> >> matching the IMAP mailbox model with the gmail one is tough.
> >
> > Yeah, that's a bit messy. ?Any particular reason, other than saving a
> > few bytes?
>
> I'm sure at the beginning it was "why do we need these anyways, it'll
> save a few bytes". Unfortunately, once N because big enough, adding
> 1/65th to N is also very big. Though, with compression, probably not
> appreciable so.
Yeah, that's a messy path to go down. I really to like saving the
exact RFC822 blob, or alternatively enough information that you can
reconstruct it precisely. I can see benefits in decoding base64 and
QP rubbish into the end result and storing that instead, but I'd want
to keep a reverse transcoder ID and binary diff (if required) to bring
back the original bytes if needed - and of course a checksum to make
sure they really were the original bytes.
> The more amazing thing is how far along we went in coding before we
> actually had issues with it. Almost no clients actually cared that we
> only sent LF, except Outlook would eat characters when fixing
> soft-line breaks for quoted-printable messages since it just blindly
> assumed CRLF.
Cyrus rejects messages with a bare \n I think.
> >> I tried running Timo's tests against gmail, but its pretty hard to debug
> >> failures, and the set of tests isn't quite wide or separable. ?Its somewhere
> >> on the todo list to work that again.
> >
> > Improving the tests would be nice too. ?We have our own thing, cassandane,
> > which does a whole test harness for running multiple Cyrus instances and
> > testing things like replication and our "murder" clustering thing too.
> >
> >> I would say that our biggest issue with IMAP is that the amount of resources
> >> required for some of the commands is really high, which is hard to optimize
> >> for or to fairly share resources.
> >
> > body search is the real killer for us, everything else is pretty lightweight.
>
> Clients can ask for any random piece of information about every
> message in the store, and our stores can get very large. One of the
> syncing tools (offlineimap, iirc) randomly also asked for INTERNALDATE
> on every sync request (ie, UID FLAGS) even though it 1) never changes
> (by spec) and 2) they didn't actually use it. Doing that on a 1M
> message store every 5-10m when that data isn't in the "small" data,
> but required us to fetch the full meta data for every message...
> expensive. The smarter model is the adaptive meta-data model, but
> re-generating and re-storing the metadata for 1M messages also isn't
> cheap.
Yeah, bloody offlineimap. I'm glad that appears to be fixed in more
recent versions. Of course, Cyrus keeps the INTERNALDATE in the index
file, so it's trivially already present in the struct.
Of course offlineimap is probably a vanishingly small part of your
userbase, and also quite patchable - I suspect if you'd "fixed" it,
the patch would have been accepted.
> The real peach though is probably MULTIAPPEND which wants us to allow
> a virtual infinite amount of data upload in a single transaction.
> Even if we figured out a way to handle that given our constraints, we
> wouldn't want to do it since the worst possible outcome there is to
> throw the entire thing away on failure. Just spend a couple hours
> uploading it again!
Which is why I'd love to be able to upload the messages to a staging
location and then tag them with a:
(mailbox, uid, modseq, internaldate, flags, annotations)
set in a later small upload. But that's where having a custom replication
protocol wins over generic IMAP! It also means you can apply COPY and
UPLOAD intermingled, by calling RESERVE on the messages you're about
to copy.
Bron.
--
Bron Gondwana
brong@fastmail.fm