pan-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pan-devel] RFC: Detecting multiparts (was: .94 weirdness with detecting


From: Charles Kerr
Subject: [Pan-devel] RFC: Detecting multiparts (was: .94 weirdness with detecting attachments)
Date: Fri, 8 Aug 2003 10:49:09 -0700
User-agent: Mutt/1.5.4i

>>> Charles... I'm seeing the same behavior with non-binary groups. i.e.
>>> text-only documents published in multiple posts and labeled as such. For
>>> example, "some subject here 1/5" and "some subject here 2/5" and "some
>>> subject here 3/5" etc. will show up as "broken parts" because Pan is
>>> trying to "decode" non-binary multipart posts and is freaking out because
>>> there is no way to decode these....

>> I don't think this is the same bug: Chris is reporting that Pan is
>> incorrectly treating multiparts as non-multiparts; you're reporting that
>> Pan is incorrectly treating non-multiparts as multiparts. :)

> Oh, Ok... So, it's the "reverse" of Chris' bug :-)

Clearly Pan's letting false positives or false negatives through when
looking for multiparts.  Maybe we should revise the multipart detection code.

Here's a rough draft for a better detection scheme.  I'm posting it here
so that people can refine it and/or shoot holes in it.

background
----------

 * there are no standard headers, other than the Subject: header,
   to link multiparts together or even to denote binary attachments.

 * we can't thread properly until we've guessed the multipart state,
   so looking to other articles for context is problematic.

 * the best tools we have are the Subject: header, the group name,
   and the number of lines in the article.

tools
-----

 * likely_binary_group is true if the newsgroup name contains
   any of: "binaries", "fan", "mag", "sex", false otherwise

 * likely_binary_subject is true if the Subject: header contains
   any of: "jpeg" "jpg" "gif" "tiff" "png", false otherwise

 * part = 0, or if either "(x/y)" or "[x/y]" is in Subject:, then x.
   (Work backwards from the end of the string, in case someone's
   posting a set of multiparts and (x/y) appears in the Subject: twice)
 
 * parts = 0, or if either "(x/y)" or "[x/y]" is in Subject:, then y.
   (Work backwards here too)

 * lines = number of lines in article

 * is_reply = true if Subject: begins with "Re:", false otherwise

 * is_binary: true or false.  This is what we're trying to guess.

guessing
--------

  1. start with is_binary = false.

  2. if part > 0,
     and parts > 0,
     and parts >= part,
     set is_binary to true.

  3. if is_binary is true,
     and we're not in a likely binary group,
     and we don't have a likely binary subject,
     and parts > 1,
     then it's probably a set of text posts like John mentioned above.
     set is_binary to false.

  3. if is_binary is false,
     and we're in a likely_binary_group,
     and either lines>500 or we've got a likely binary subject,
     and both part and parts are 0,
     then it's likely a single-part binary where the user omitted the "(1/1)".
     set is_binary to true.

  4. if is_binary is true,
     and is_reply is true,
     and the part is 0 or 1,
     then it's probably a follow-up to a multipart (I've never seen a followup 
to a part > 1).
     set is_binary to false.
     UNLESS: once in a blue moon people will post binaries as follow-ups, so 
hedge our bets:
     leave is_binary as true if lines > 500.

  5. if is_binary is true,
     and the subject contains any of: "Frequently Asked Questions", "FAQ", 
"Weekly", "Monthly",
     then it's a FAQ or periodic posting being posted in pieces.
     set is_binary to false.

-- 
cheers,
Charles





reply via email to

[Prev in Thread] Current Thread [Next in Thread]