Re: [Fastcgipp-users] UTF-8 POST value

The wide character template of fastcgi++ only really works properly if
the input data is utf8. The default for web transmission is actually
iso8859. iso8859 is a fixed 8bit character set so there is no need to
use wide characters with that. The Ñ character and any other Latin
characters are displayed without the use of variable size characters.

I haven't looked at any code so I am just taking a guess but I bet the
urlencoded data is actually iso8859 encoded. There is no problem
converting that to wide character unicode until you start using
non-ascii iso8859 characters. This is of course because utf8 is ascii
compatible. Ascii can be utf8 but iso8859 can't. Well, the ascii part of
iso8859 can, but not the special characters. The charToString function
would get messed up if it runs in to non-utf8 characters when it is
called with wchar_t because it does code conversion from utf8 to some
sort of wide character unicode like utf32 or utf16.

> The problem in this case is occurring in fillPostsUrlEncoded, but may
> point to somewhere else. fillPostsUrlEncoded is short and basically
> copies the post data into a string as follows:
>
> std::basic_string<charT> queryString;
> boost::scoped_array<char> buffer(new char[size]);
> memcpy (buffer.get(), data, size);
> charToString (buffer.get(), size, queryString);
> doFillPostsUrlEncoded(queryString);
>
> I think the problem might be in charToString (or my use of it) as
> that's where the data is corrupted. Eddie, any thoughts WRT this? Will
> do further testing.

As I said above, this is very likely due to charToString trying to code
convert non-utf8 data.

So if the incoming data is utf-8 encoded, charToString should do the right thing? I tried setting the content type to "application/x-www-form-urlencoded; charset=utf-8" and still got the same results.

So, I've discovered a problem of sorts. Currently when parsing url-encoded request data I am not calling percentEscapedToRealBytes before charToString (as I gather is the right way to do it) because I tokenize the request string by "=" and "&". I call percentEscapedToRealBytes on each key and value only after tokenizing them just in case the characters "=" or "&" appear in either the key or value. So, the problem is, either I break special characters when using wchar_t (which is not acceptable), or I strictly disallow "=" and "&" characters in urlencoded request data, except as intended (which seems reasonable).

From:	Axel von Bertoldi
Subject:	Re: [Fastcgipp-users] UTF-8 POST value
Date:	Wed, 17 Mar 2010 14:43:47 -0600