[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: survey: string external representation

From: Thien-Thi Nguyen
Subject: Re: survey: string external representation
Date: Fri, 27 Jan 2012 11:27:30 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux)

Thanks to everyone who responded.  Based on the collected
information, i've cobbled together a runtime check for
‘sql-quote’.  It and some tests are in the attached program.
To play:

 guile -s normalize.scm
 guile -s normalize.scm stupid

The code assumes Guile 2 DTRT, but if you have doubts, you can

 sed -i 's/guile-2/&-not-really/' normalize.scm

to disable that assumption.  In any case, the program should exit
successfully, indicating smooth ‘write’ / ‘read’ round-tripping.
This is so (both w/ and w/o "stupid") for Guile and 1.8.7.

;; -*- mode: scheme; coding: utf-8 -*-

(define EXIT-VALUE #t)                  ; optimism

(define STUPID? (false-if-exception (string=? "stupid" (cadr (command-line)))))

;; PostgreSQL groks ‘\xXX’ as an octet w/ hex value XX.
;; It also groks raw octets.  This is all fine and good.
;; The problem arises when there is a mix of contiguous
;; raw and \x representations, intended to represent a
;; UTF-8 (say) encoded character.
;; It seems Guile
;; - 1.4 DTRT by doing nothing;
;; - 1.6 ???;
;; - 1.8 fails by \x-escaping inconsistently;
;; - 2.0 doesn't have this problem.

  (define normalize identity))
   (srfi srfi-13)
   (srfi srfi-14))
  (define normalize
    (or (let* ((ego (char-set
                     ;; These are not strictly necessary for
                     ;; PostgreSQL, but we include them for
                     ;; (Scheme-only) round-trip testing.
                     ;; Doubtlessly, what doubtful ego!
                     #\" #\\))
               (ugh (ucs-range->char-set #o177 #o400 #t ego)))
          (and (not (char-set-every
                     (lambda (ch)
                       ;; Does the octet xrep unmolested?
                       (char=? ch (string-ref (object->string (string ch)) 1)))
                     (char-set-difference ugh ego)))
               (or (not STUPID?)
                   (begin (set! ugh ego)
               ;; Lame.
               (lambda (s)
                 (define backslash-x
                   (let ((v (make-vector 256)))
                      (lambda (ch)
                        (let ((i (char->integer ch)))
                           v i (string-append
                                "\\x" (number->string i 16)))))
                     ;; backslash-x
                     (lambda (ch)
                       (vector-ref v (char->integer ch)))))
                 (let loop ((start 0) (acc '()))
                   (cond ((string-index s ugh start)
                          => (lambda (idx)
                               (loop (1+ idx)
                                     (cons* (backslash-x (string-ref s idx))
                                            (substring/shared s start idx)
                         ((zero? start)
                           acc (substring/shared s start))))))))
        ;; Cool.

(define (try s)
   #t "ORIG:\t~S~%NORM:\t~S~%=>\t~A~%~%"
   s (normalize s)
   (let ((round (with-input-from-string
                      (lambda ()
                        (if (eq? identity normalize)
                            (write s)
                            (begin (display #\")
                                   (display (normalize s))
                                   (display #\")))))
     (cond ((equal? s round) 'SAME)
            (set! EXIT-VALUE #f)        ;-O
             "DIFF: [" (number->string (string-length round))
             "]|" round "|"))))))

(simple-format #t "Guile ~A~% LANG: ~S~% normalize: ~S~A~%~%"
               (version) (getenv "LANG") (procedure-name normalize)
               (if (and STUPID? (not (eq? normalize identity)))
                   " (but we stupidly revert to degeneracy)"

(try "")
(try (list->string (map integer->char (iota 256))))
(try "U+2002: | | (utf-8: E2 80 82)")
(try "U+232C: |⌬| (utf-8: E2 80 82)")
(try "U+1D7FF: |𝟿| (utf-8: F0 9D 9F BF)")
(try "U+2F9B2: |䕫| (utf-8: F0 AF A6 B2)")
(try "U+2F9BC: |蜨| (utf-8: F0 AF A6 BC)")


reply via email to

[Prev in Thread] Current Thread [Next in Thread]