emacs-orgmode
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Orgmode] org-feed XML entities and character encoding


From: Michael Brand
Subject: [Orgmode] org-feed XML entities and character encoding
Date: Tue, 10 Aug 2010 21:59:26 +0200
User-agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4

Hi all,

org-feed is becoming very useful for me, so far to manage the
episodes of podcasts. Now I have a patch and a request for help.

1. patch for an issue with XML entities
=======================================

I found that some XML entities in my feeds are not substituted. The
comments of two recent org-feed.el commits by David Maus
http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
and
http://repo.or.cz/w/org-mode.git/commitdiff/6875716e76acfbe1084a47e59d18a30a933d92b6
lead me to the thread
http://thread.gmane.org/gmane.emacs.orgmode/26352
and invited me to replace org-feed-unescape with xml-substitute-special
which converts more XML entities. The resulting patch below helps for
me but of course I would like it to be reviewed by an experienced elisp
programmer and org-feed user before being applied.

2. request for help about an issue with multibyte character encoding
====================================================================

There is an issue with multibyte characters that appear in the input
as unescaped, multibyte encoded characters (not as XML entities, as XML
entities multibyte characters are simply substituted correctly). I
looked for an example with a character encoding specified in the first
line of the XML feed like
<?xml version="1.0" encoding="utf-8"?>
and found one here:
http://www.openscreencast.de/blog/rss.xml

The W3C validator
http://validator.w3.org
seems to be happy with this feed but when fed into a feeds.org the
unescaped, multibyte encoded characters e. g. of the title `Screencast
076 [...]' get upset, even with `coding: utf-8-unix' in the first line
of the file feeds.org. Can someone please help to get this issue
resolved? If easily possible, like I expect it to be, generally for
all character encodings supported by Emacs? I would even like if
UTF-8 feeds like
http://pod.drs.ch/world_music_special_mpx.xml
that do not have the character encoding specified would work too.

Thanks

- Michael

------------------------------------------------------------
--- a/lisp/org-feed.el
+++ b/lisp/org-feed.el
@@ -99,6 +99,7 @@
 (declare-function xml-get-children "xml" (node child-name))
 (declare-function xml-get-attribute "xml" (node attribute))
 (declare-function xml-get-attribute-or-nil "xml" (node attribute))
+(declare-function xml-substitute-special "xml" (string))
 (defvar xml-entity-alist)

 (defgroup org-feed  nil
@@ -269,17 +270,6 @@
 (defvar org-feed-buffer "*Org feed*"
   "The buffer used to retrieve a feed.")

-(defun org-feed-unescape (s)
-  "Unescape protected entities in S."
-  (require 'xml)
-  (let ((re (concat "&\\("
-                   (mapconcat 'car xml-entity-alist "\\|")
-                   "\\);")))
-    (while (string-match re s)
-      (setq s (replace-match
-              (cdr (assoc (match-string 1 s) xml-entity-alist)) nil nil s)))
-    s))
-
 ;;;###autoload
 (defun org-feed-update-all ()
   "Get inbox items from all feeds in `org-feed-alist'."
@@ -613,6 +603,7 @@

 (defun org-feed-parse-rss-entry (entry)
   "Parse the `:item-full-text' field for xml tags and create new properties."
+  (require 'xml)
   (with-temp-buffer
     (insert (plist-get entry :item-full-text))
     (goto-char (point-min))
@@ -620,7 +611,7 @@
                              nil t)
       (setq entry (plist-put entry
                             (intern (concat ":" (match-string 1)))
-                            (org-feed-unescape (match-string 2)))))
+                            (xml-substitute-special (match-string 2)))))
     (goto-char (point-min))
     (unless (re-search-forward "isPermaLink[ \t]*=[ \t]*\"false\"" nil t)
       (setq entry (plist-put entry :guid-permalink t))))
@@ -633,7 +624,6 @@

 The `:item-full-text' property actually contains the sexp
 formatted as a string, not the original XML data."
-  (require 'xml)
   (with-current-buffer buffer
     (widen)
     (let ((feed (car (xml-parse-region (point-min) (point-max)))))
@@ -654,7 +644,7 @@
                            'href)))
     ;; Add <title/> as :title.
     (setq entry (plist-put entry :title
-                          (org-feed-unescape
+                          (xml-substitute-special
                            (car (xml-node-children
                                  (car (xml-get-children xml 'title)))))))
     (let* ((content (car (xml-get-children xml 'content)))
@@ -664,12 +654,12 @@
         ((string= type "text")
          ;; We like plain text.
          (setq entry (plist-put entry :description
-                                (org-feed-unescape
+                                (xml-substitute-special
                                  (car (xml-node-children content))))))
         ((string= type "html")
          ;; TODO: convert HTML to Org markup.
          (setq entry (plist-put entry :description
-                                (org-feed-unescape
+                                (xml-substitute-special
                                  (car (xml-node-children content))))))
         ((string= type "xhtml")
          ;; TODO: convert XHTML to Org markup.
------------------------------------------------------------




reply via email to

[Prev in Thread] Current Thread [Next in Thread]