pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Filename Encoding


From: John Darrington
Subject: Re: Filename Encoding
Date: Wed, 11 Dec 2013 20:23:20 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Dec 11, 2013 at 07:38:46AM -0800, Ben Pfaff wrote:
     On Wed, Dec 11, 2013 at 09:05:16AM +0100, John Darrington wrote:
     > On Tue, Dec 10, 2013 at 12:38:04PM -0800, Ben Pfaff wrote:
     > 
     >      I understand now.  However, in other places in PSPP, and in 
particular
     >      in syntax and the output engine, we tend to convert everything we
     >      receive externally into UTF-8 for internal processing, and then 
convert
     >      back to other encodings as necessary.  It would be convenient for 
some
     >      purposes to do this for filenames also (e.g. to include file names 
in
     >      output), and it would avoid needing to keep around two pieces of
     >      information (file name plus encoding) when one (UTF-8 file name) 
would
     >      do.  
     > 
     >      Do you think that storing file name plus encoding is superior?
     > 
     > Both solutions have advantages and disadvantages.
     > 
     > The converting-all-filenames-to-utf8 solution has two disadvantages that 
I
     > can see:
     > 
     > *.  Unnecessary recoding - often it will be necessary to convert from 
"filename encoding"
     >     to utf8 and then, back to "filename encoding".
     
     Is the concern here about performance, or something else?  I doubt that
     there is a real performance problem with doing one or two conversions of
     a file name, once per file open.  Also, on GNU/Linux the filename
     encoding is UTF-8 anyway, so there is no actual conversion.

Performance wouldn't be an issue.  I was more concerned about clean code. and 
programming
effort. Possibility of memory leaks ... and general elegence.
     
     > *.  The bigger disadvantage, is that it will be very easy simply to 
forget to do
     > the necessary conversion.  If the programmer forgets - the compiler 
won't complain - 
     > it is just a char *   - Passing a struct file_handle * one cannot forget 
- there'll
     > be a compiler error.
     
     That's true.  In data, we use uint8_t instead of char to remind
     ourselves that the data is in the dictionary encoding.  We could use
     int8_t for UTF-8 data, but that doesn't match either libunistring or
     glib practice so it would probably cause a lot of friction at
     interfaces.


Like you say, I don't think we can do that trick here because of what the 
libraries expect.

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]