[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Ifile-discuss] Re: html tag stripping
From: |
clemens fischer |
Subject: |
[Ifile-discuss] Re: html tag stripping |
Date: |
26 Jun 2003 10:12:12 +0200 |
User-agent: |
Gnus/5.1003 (Gnus v5.10.3) Emacs/21.3 (berkeley-unix) |
* David Bushong:
> Well, even if people are filtering non-email through it, it doesn't
> handle tagged input gracefully. An option to do a simple, naïve
> tag-strip seems like a win to me.
have you thought about piping emails through "sed -E 's/<.+>//g'" to
check if a naive approach suffices? i just tried it: it fails when a
tag is opened on one line and closed on another. also, sed(1)
unfortunately doesn't have non-greedy versions of RE closures, so that
a line having `<' somewhere and `>' lateron will have everything in
between stripped regardless of the balancing of tags: this looses
perfectly readable text.
you could try with another simple tool: sgrep(1) "Structured Grep":
http://www.cs.helsinki.fi/~jjaakkol/sgrep.html
ftp://ftp.cs.helsinki.fi/pub/Software/Local
Sgrep was created by Jani Jaakkola (address@hidden) and
Pekka Kilpeläinen (address@hidden).
it is meant to find balanced, SGML like markup, and you can customize
the output format. it has HTML-examples included. if you make it
with sgrep(1), please drop a few lines to this list.
clemens