[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Questions about how to implement pickles/utf8.pk
From: |
Mohammad-Reza Nabipoor |
Subject: |
Questions about how to implement pickles/utf8.pk |
Date: |
Wed, 16 Sep 2020 23:40:55 +0430 |
Hi,
I tried to write a pickle to poke UTF8 (`utf8.pk`). I came up with two different
types:
```poke
/*
* len from to byte[0] byte[1] byte[2] byte[3]
* 1 U+0000 U+007F 0xxxxxxx
* 2 U+0080 U+07FF 110xxxxx 10xxxxxx
* 3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
* 4 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
*
* ref: https://en.wikipedia.org/wiki/UTF-8
*/
deftype UTF8_CodePoint = uint<21>;
deftype UTF8_1 =
union
{
byte[1] d1 : (d1[0] & 0x80) == 0;
byte[2] d2 : (d2[0] & 0xe0) == 0xc0 && (d2[1] & 0xc0) == 0x80;
byte[3] d3 : (d3[0] & 0xf0) == 0xe0 &&
(d3[1] & 0xc0) == 0x80 &&
(d3[2] & 0xc0) == 0x80;
byte[4] d4 : (d4[0] & 0xf8) == 0xf0 &&
(d4[1] & 0xc0) == 0x80 &&
(d4[2] & 0xc0) == 0x80 &&
(d4[3] & 0xc0) == 0x80;
};
deftype UTF8_2 =
union
{
struct
{
byte[1] d: (d[0] & 0x80) == 0;
};
struct
{
byte[2] d : (d[0] & 0xe0) == 0xc0 && (d[1] & 0xc0) == 0x80;
};
struct
{
byte[3] d : (d[0] & 0xf0) == 0xe0 &&
(d[1] & 0xc0) == 0x80 &&
(d[2] & 0xc0) == 0x80;
};
struct
{
byte[4] d : (d[0] & 0xf8) == 0xf0 &&
(d[1] & 0xc0) == 0x80 &&
(d[2] & 0xc0) == 0x80 &&
(d[3] & 0xc0) == 0x80;
};
};
```
## Question 1
How can I figure out the active field in a union?
For these types (`UTF8_*`) I can find it by `size` attribute of the instance.
But I think there should be a general mechanism.
An example:
```poke
defun utf8_decode = (UTF8_1 x) UTF8_CodePoint:
{
if (x'size == 1#B)
return x.d1[0];
if (x'size == 2#B)
return (x.d2[0] & 0x1f) <<. 6 | (x.d2[1] & 0x3f);
if (x'size == 3#B)
return (x.d3[0] & 0x0f) <<. 12 | (x.d3[1] & 0x3f) <<. 6 | (x.d3[2] &
0x3f);
if (x'size == 4#B)
return (x.d4[0] & 0x07) <<. 18 | (x.d4[1] & 0x3f) <<. 12 |
(x.d4[2] & 0x3f) <<. 6 | (x.d4[3] & 0x3f);
}
```
## Question 2
If I want to define a `decode` method for `UTF8_1` instead of the `utf8_decode`
fucntion, how can I access the `size` attribute?
## Question 3
I prefer the `UTF8_2` over the `UTF8_1`, because always I have to deal with only
one field. From the user POV, it's an array with variable length (1-4).
How can I access the `d` field?
Or if you think that my question is insane, could you please explain why?
Currently this doesn't work:
```poke
// Inside `UTF8_2` type
method decode = UTF8_CodePoint:
{
if (d'size == 1#B)
return d;
if (d'size == 2#B)
return (d[0] & 0x1f) <<. 6 | (d[1] & 0x3f);
if (d'size == 3#B)
return (d[0] & 0x0f) <<. 12 | (d[1] & 0x3f) <<. 6 | (d[2] & 0x3f);
if (d'size == 4#B)
return (d[0] & 0x07) <<. 18 | (d[1] & 0x3f) <<. 12 |
(d[2] & 0x3f) <<. 6 | (d[3] & 0x3f);
}
```
## Question 4
I cannot write `utf8_encode` function for `UTF8_1`, because union construction
does not work (like the problem for pinned structs [Bug 26527][2]).
How do you write an encode function?
This does not work:
```poke
defun utf8_encode = (UTF8_CodePoint cp) UTF8_1:
{
if (cp < 0x7f)
return UTF8_1 {d1 = [cp as byte]};
if (cp < 0x7ff)
return UTF8_1 {
d2 = [
(0xc0 | (cp .>> 6 & 0x1f)) as byte,
(0x80 | (cp & 0x3f)) as byte,
]
};
if (cp < 0xffff)
return UTF8_1 {
d3 = [
(0xe0 | (cp .>> 12 & 0x0f)) as byte,
(0x80 | (cp .>> 6 & 0x3f)) as byte,
(0x80 | (cp & 0x3f)) as byte,
]
};
return UTF8_1 {
d4 = [
(0xf0 | (cp .>> 18 & 0x07)) as byte,
(0x80 | (cp .>> 12 & 0x3f)) as byte,
(0x80 | (cp .>> 6 & 0x3f)) as byte,
(0x80 | (cp & 0x3f)) as byte,
]
};
}
```
## Question 5
Is there any other approach to poke UTF8?
BTW you can download `utf8.pk` at [1] (it's plain-text, you can `wget` it).
(I don't attach it because I've changed some names)
Regards,
Mohammad-Reza
[1]: https://gitlab.com/mnabipoor/poke-journey/-/raw/master/pickles/utf8.pk
[2]: https://sourceware.org/bugzilla/show_bug.cgi?id=26527
- Questions about how to implement pickles/utf8.pk,
Mohammad-Reza Nabipoor <=