[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Swftools-common] swfstrings and japanese text
From: |
Con Kolivas |
Subject: |
Re: [Swftools-common] swfstrings and japanese text |
Date: |
Thu, 29 Jun 2006 22:37:42 +1000 |
User-agent: |
KMail/1.9.3 |
On Tuesday 27 June 2006 23:17, Matthias Kramm wrote:
> On Tue, Jun 27, 2006 at 07:11:36PM +1000, Con Kolivas wrote:
> > One query; I can't seem to extract japanese text (kanji) with swfstrings
> > apart from the font name which is correctly displayed in kanji. Most of
> > the static text is ignored and nothing follows.
>
> That's an interesting feature request :)
> Well, so far swfstrings only extracts text in the standard codepage
> (iso8859-1). There's no UTF-8 output yet.
>
> I guess I'll add it to the TODO list.
>
> Do you happen to have any simple Kanji encoded sample-SWFs?
(sample sent offlist)..
I've been looking at your code myself to see if I could help and tracked down
your output line (in v0.7.0) to
swfextc.c:466
printf("%c", code);
which is obviously only going to work for ascii codes up to 127 since UTF8 is
variable length and probably needs a %lc passed a wchar_t. All of this is new
to me so I'm not sure if it's obvious to others or not who might also find it
interesting. I've never really hacked on this sort of code before.
I thought you might find this information helpful for UTF8 output:
UTF-8 encoding is variable-length, and characters are encoded with one, two,
three, or four bytes. The first 128 characters of Unicode (BMP), U+0000
through U+007F, are encoded with a single byte, and are equivalent to ASCII.
U+0080 through U+07FF (BMP) are encoded with two bytes, and U+0800 through
U+FFFF (still BMP) are encoded with three bytes. The 1,048,576 characters of
the 16 Supplementary Planes are encoded with four bytes.
(from http://www-128.ibm.com/developerworks/java/library/j-u-encode.html)
Thanks!
--
-ck