poke-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Questions about how to implement pickles/utf8.pk


From: Mohammad-Reza Nabipoor
Subject: Questions about how to implement pickles/utf8.pk
Date: Wed, 16 Sep 2020 23:40:55 +0430

Hi,

I tried to write a pickle to poke UTF8 (`utf8.pk`). I came up with two different
types:


```poke
/*
 * len from    to       byte[0]   byte[1]   byte[2]   byte[3]
 * 1   U+0000  U+007F   0xxxxxxx
 * 2   U+0080  U+07FF   110xxxxx  10xxxxxx
 * 3   U+0800  U+FFFF   1110xxxx  10xxxxxx  10xxxxxx
 * 4   U+10000 U+10FFFF 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
 *
 * ref: https://en.wikipedia.org/wiki/UTF-8
 */

deftype UTF8_CodePoint = uint<21>;

deftype UTF8_1 =
  union
  {
    byte[1] d1 : (d1[0] & 0x80) == 0;

    byte[2] d2 : (d2[0] & 0xe0) == 0xc0 && (d2[1] & 0xc0) == 0x80;

    byte[3] d3 : (d3[0] & 0xf0) == 0xe0 &&
                 (d3[1] & 0xc0) == 0x80 &&
                 (d3[2] & 0xc0) == 0x80;

    byte[4] d4 : (d4[0] & 0xf8) == 0xf0 &&
                 (d4[1] & 0xc0) == 0x80 &&
                 (d4[2] & 0xc0) == 0x80 &&
                 (d4[3] & 0xc0) == 0x80;
  };

deftype UTF8_2 =
  union
  {
    struct
    {
      byte[1] d: (d[0] & 0x80) == 0;
    };

    struct
    {
      byte[2] d : (d[0] & 0xe0) == 0xc0 && (d[1] & 0xc0) == 0x80;
    };

    struct
    {
      byte[3] d : (d[0] & 0xf0) == 0xe0 &&
                  (d[1] & 0xc0) == 0x80 &&
                  (d[2] & 0xc0) == 0x80;
    };

    struct
    {
      byte[4] d : (d[0] & 0xf8) == 0xf0 &&
                  (d[1] & 0xc0) == 0x80 &&
                  (d[2] & 0xc0) == 0x80 &&
                  (d[3] & 0xc0) == 0x80;
    };
  };
```


## Question 1

How can I figure out the active field in a union?

For these types (`UTF8_*`) I can find it by `size` attribute of the instance.
But I think there should be a general mechanism.

An example:

```poke
defun utf8_decode = (UTF8_1 x) UTF8_CodePoint:
  {
    if (x'size == 1#B)
      return x.d1[0];

    if (x'size == 2#B)
      return (x.d2[0] & 0x1f) <<. 6 | (x.d2[1] & 0x3f);

    if (x'size == 3#B)
      return (x.d3[0] & 0x0f) <<. 12 | (x.d3[1] & 0x3f) <<. 6 | (x.d3[2] & 
0x3f);

    if (x'size == 4#B)
      return (x.d4[0] & 0x07) <<. 18 | (x.d4[1] & 0x3f) <<. 12 |
             (x.d4[2] & 0x3f) <<. 6  | (x.d4[3] & 0x3f);
  }

```


## Question 2

If I want to define a `decode` method for `UTF8_1` instead of the `utf8_decode`
fucntion, how can I access the `size` attribute?


## Question 3

I prefer the `UTF8_2` over the `UTF8_1`, because always I have to deal with only
one field. From the user POV, it's an array with variable length (1-4).

How can I access the `d` field?
Or if you think that my question is insane, could you please explain why?

Currently this doesn't work:

```poke
    // Inside `UTF8_2` type
    method decode = UTF8_CodePoint:
      {
        if (d'size == 1#B)
          return d;

        if (d'size == 2#B)
          return (d[0] & 0x1f) <<. 6 | (d[1] & 0x3f);

        if (d'size == 3#B)
          return (d[0] & 0x0f) <<. 12  | (d[1] & 0x3f) <<. 6 | (d[2] & 0x3f);

        if (d'size == 4#B)
          return (d[0] & 0x07) <<. 18 | (d[1] & 0x3f) <<. 12 |
                 (d[2] & 0x3f) <<. 6  | (d[3] & 0x3f);
      }
```


## Question 4

I cannot write `utf8_encode` function for `UTF8_1`, because union construction
does not work (like the problem for pinned structs [Bug 26527][2]).

How do you write an encode function?

This does not work:

```poke
defun utf8_encode = (UTF8_CodePoint cp) UTF8_1:
  {
    if (cp < 0x7f)
      return UTF8_1 {d1 = [cp as byte]};

    if (cp < 0x7ff)
      return UTF8_1 {
        d2 = [
          (0xc0 | (cp .>> 6 & 0x1f)) as byte,
          (0x80 | (cp       & 0x3f)) as byte,
        ]
      };

    if (cp < 0xffff)
      return UTF8_1 {
        d3 = [
          (0xe0 | (cp .>> 12 & 0x0f)) as byte,
          (0x80 | (cp .>> 6  & 0x3f)) as byte,
          (0x80 | (cp        & 0x3f)) as byte,
        ]
      };

    return UTF8_1 {
      d4 = [
          (0xf0 | (cp .>> 18 & 0x07)) as byte,
          (0x80 | (cp .>> 12 & 0x3f)) as byte,
          (0x80 | (cp .>> 6  & 0x3f)) as byte,
          (0x80 | (cp        & 0x3f)) as byte,
      ]
    };
  }
```

## Question 5

Is there any other approach to poke UTF8?


BTW you can download `utf8.pk` at [1] (it's plain-text, you can `wget` it).
(I don't attach it because I've changed some names)


Regards,
Mohammad-Reza


[1]: https://gitlab.com/mnabipoor/poke-journey/-/raw/master/pickles/utf8.pk
[2]: https://sourceware.org/bugzilla/show_bug.cgi?id=26527


reply via email to

[Prev in Thread] Current Thread [Next in Thread]