bug-binutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Request: utf-8 support in `strings`


From: hackerb9
Subject: Request: utf-8 support in `strings`
Date: Fri, 23 Aug 2019 04:45:13 -0700

"Strings" is a fabulous and under-appreciated program. However, it only works on text which has a fixed number of bytes per character such as Latin-1 or UTF-16. Since almost all Unix systems now default to UTF-8 (which can be from 1 to 6 bytes long), it would be wonderful if strings was updated to handle it.

This may sound difficult or even impossible given the flexibility of UTF-8. Perhaps so, but it is certainly possible for strings to have better UTF-8 support than it does now. (None.)

Some people on the web have used `strings -eS` as a kludge for UTF-8, but it doesn't work for me. I'm not surprised as the man page says that flag is for *single* 8-bit byte characters and does not mention UTF-8. 

At the moment, I have a multi-gigabyte coredump I need to search for text strings and I've resorted to using Emacs because it works, albeit slowly. I do not know how Emacs manages it, but using C-x RET r to set the coding system for the visited file to be UTF-8 correctly parses all the strings in the corefile.

Other projects, such as wireshark, have implemented UTF-8 character detection in binary streams by quickly ruling out invalid sequences using heuristics:

> an octet sequence that begins with an octet with the uppermost bit set and the bit below it clear is invalid and doesn't correspond to a code point in Unicode;
>
> an octet sequence that begins with an octet with the uppermost two bits set, and where the 1 bits below it indicate that the sequence is N bytes long, but that has fewer than N-1 octets-with-10-at-the-top following it (either because it's terminated by an octet that doesn't have 10 at the top or it's terminated by the end of the string), is invalid and doesn't correspond to a code point in Unicode;
>
> an octet sequence which doesn't have the two problems above but that produces a value that's not a valid Unicode code point is invalid and (by definition) doesn't correspond to a code point in Unicode;

(From https://wiki.wireshark.org/Development/StringHandling)

Please consider adding UTF-8 support to `strings`. If there is some obstacle, let me know and I will see what I can do to help.

Thank you,

—b9

reply via email to

[Prev in Thread] Current Thread [Next in Thread]