Re: Recognizing repeats in RSS feeds

info-gnus-english

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recognizing repeats in RSS feeds

From:	Desmond Rivet
Subject:	Re: Recognizing repeats in RSS feeds
Date:	Tue, 20 Jan 2009 20:22:16 -0500
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux)

Ted Zlatanov <tzz@lifelogs.com> writes:

> On Fri, 16 Jan 2009 13:12:37 -0500 Desmond Rivet
> <desmond_news@videotron.ca> wrote:
>
> DR> In addition to reading news and email, I use Gnus to keep track of
> DR> various RSS feeds.
>
> DR> For some of these feeds, certain articles will, over time, show up
> DR> repeatedly in my summary list.  I'm not sure why, but I assume it has
> DR> something to do with updates to the article itself.  Or maybe it happens
> DR> when someone posts a new comment on the article.  I don't know.
> ...
> DR> Is there any way to score a repeated (updated) article down, so that
> DR> they wouldn't show up in my group unless I asked?  I have no idea where
> DR> to even start with this; a simple push in the right direction would be
> DR> appreciated.
>
> You want to ignore updates which only affect irrelevant fields.  Here's
> how I do it:
>
> (setq nnrss-ignore-article-fields '(description slash:comments 
> slash:hit_parade))
>
> This works for me to eliminate duplicates completely; "description"
> changes very frequently on some sites for instance.  nnrss finds unique
> articles by taking all their fields that are not ignored and hashing the
> content.
>
> To find out exactly what's happening, set gnus-verbose to 10 and refresh
> a nnrss group.  You have to have a recent CVS Gnus to use this.  I added
> it fairly recently.  In *Messages* you'll see a full dump of the RSS
> segment that describes each article, and from that you can easily figure
> out what's causing duplicates.
>
> For example, here's one entry from the Dilbert Blog:
>
> nnrss: Making hash index of (item nil "
> " (title nil "From Blog to Reality: Three Interesting Things") "
> " (link nil 
> "http://dilbert.com/blog/entry/from_blog_to_reality_three_things/";) "
> " (description nil "...cut because it's too much text...") "
> " (pubDate nil "Fri, 16 Jan 2009 01:00:01 PST") "
> " (guid ((isPermaLink . "false")) "http://dilbert.com/blog/entry/203/";) "
> ")
>
> So the fields here are guid, pubDate, title, link, and description.
>
> If you need more help, tell us what feeds specifically are causing the
> problem and I can take a look.

Thanks for the reply. However, I'm somewhat confused (not by your
directions, but rather by what I'm seeing)

So, I've started examining my RSS feeds. I'll use Slashdot as an example
since alot of people read it.

What I did was the following :

1. made a backup of the directory that stores my downloaded rss feeds.

2. waited until my Slashdot group was updated and I got a repeated item.

3. compared a selected item from the saved backup Slashdot rss file to a
selected item from the current Slashdot rss file.  If I understand how
this works, there should be some sort of textual difference between the
old item and the new, yes?

(this is all very low tech, bear with me)

So, I picked a item at random from the current rss file, pasted the xml
fragment into a buffer, did the same with the saved rss file, and did a
diff.  I get the following:

11,12c11,12
< <slash:comments>770</slash:comments>
< <slash:hit_parade>770,762,595,490,138,86,71</slash:hit_parade>
---
> <slash:comments>757</slash:comments>
> <slash:hit_parade>757,749,587,482,133,83,69</slash:hit_parade>

So far, so good.  This tells me that the slash:comments and
slash:hit_parade fields are the culprits, right? So I do this in my
.gnus.el:

(setq nnrss-ignore-article-fields '(slash:comments slash:hit_parade))

And restart emacs.

However, I *still* get spurious updates of the same article in Slashdot.
So I take your advice and do this:

(setq gnus-verbose 10)

And hit M-g in Slashdot.  Picking another article at random, I see this:

nnrss: Making hash index of (item ((rdf:about . 
"http://it.slashdot.org/article.pl?sid=09/01/20/1930252&from=rss";)) "
" (title nil "Largest Data Breach Disclosed During Inauguration") "
" (link nil 
"http://rss.slashdot.org/~r/Slashdot/slashdot/~3/iHBmFGKE504/article.pl";) "
" (description nil "rmogull writes \"Brian Krebs over at <snip>") "
" (dc:creator nil "kdawson") "
" (dc:date nil "2009-01-20T19:44:00+00:00") "
" (dc:subject nil "security") "
" (slash:department nil "debit-cards-at-risk") "
" (slash:section nil "it") "
" (slash:comments nil "121") "
" (slash:hit_parade nil "121,117,99,80,24,16,13") "
" (feedburner:origLink nil 
"http://it.slashdot.org/article.pl?sid=09%2F01%2F20%2F1930252&from=rss";))

Note the presence of slash:comments and slash:hit_parade.  Am I to
understand that the slash:comments and slash:hit_parade fields are still
contributing to the hash?

I should mention I'm using GNU Emacs 23.0.60.1.

Thanks in advance for any insight!

-- 
Desmond Rivet

Pain is weakness leaving the body.

[Prev in Thread]

Current Thread

[Next in Thread]

Recognizing repeats in RSS feeds, Desmond Rivet, 2009/01/16
- Re: Recognizing repeats in RSS feeds, Robert D. Crawford, 2009/01/16
- Re: Recognizing repeats in RSS feeds, Ted Zlatanov, 2009/01/16
  - Re: Recognizing repeats in RSS feeds, Desmond Rivet <=
    - Re: Recognizing repeats in RSS feeds, Adam Sjøgren, 2009/01/21
    - Re: Recognizing repeats in RSS feeds, Desmond Rivet, 2009/01/21
    - Re: Recognizing repeats in RSS feeds, Ted Zlatanov, 2009/01/21
- Re: Recognizing repeats in RSS feeds, Mark Plaksin, 2009/01/21

Prev by Date: Re: emacsclient and gnus
Next by Date: Re: registry marks in gnus 5.13
Previous by thread: Re: Recognizing repeats in RSS feeds
Next by thread: Re: Recognizing repeats in RSS feeds
Index(es):
- Date
- Thread