bug-make
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Use UTF-8 active code page for Windows host.


From: Costas Argyris
Subject: Re: [PATCH] Use UTF-8 active code page for Windows host.
Date: Sun, 19 Mar 2023 21:25:30 +0000

That's not a good experiment, IMO: the only non-ASCII character here
is U+274E, which has no case variants.  And the characters whose
letter-case you tried to change are all ASCII, so their case
conversions are unaffected by the locale.


OK I think this is a better one, it is using U+03B2 and U+0392 which
are the lower and upper case of the same letter (β and Β).

I create a file src.β first:

touch src.β

and then run the following UTF-8 encoded Makefile:

hello :
@gcc ©\src.c -o ©\src.exe

ifneq ("$(wildcard src.β)","")
@echo src.β exists
else
@echo src.β does NOT exist
endif



ifneq ("$(wildcard src.Β)","")
@echo src.Β exists
else
@echo src.Β does NOT exist
endif



ifneq ("$(wildcard src.βΒ)","")
@echo src.βΒ exists
else
@echo src.βΒ does NOT exist
endif

and the output of Make is:

C:\Users\cargyris\temp>make -f utf8.mk
src.β exists
src.Β exists
src.βΒ does NOT exist

which shows that it finds the one with the upper case extension as well,
despite the fact that it exists in the file system as a lower case extension.

My guess would be that only characters within the locale, defined by
the ANSI codepage, are supported by locale-aware functions in the C
runtime.  That's because this is what happens even if you use "wide"
Unicode APIs and/or functions like _wcsicmp that accept wchar_t
characters: they all support only the characters of the current locale
set by 'setlocale'.  I don't expect that to change just because UTF-8
is used on the outside: internally, everything is converted to UTF-16,
i.e. to the Windows flavor of wchar_t.

When the manifest is used to set the active code page of the process
to UTF-8, the current ANSI code page does become UTF-8, so that
might explain why the above example is working.

As mentioned in:

https://learn.microsoft.com/en-us/cpp/text/locales-and-code-pages?view=msvc-170

"Also, the run-time library might obtain and use the value of the operating system code page, which is constant for the duration of the program's execution."

This seems to be offering some kind of confirmation.

But this one looks most relevant to your point:

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170#utf-8-support

"Starting in Windows 10 version 1803 (10.0.17134.0), the Universal C Runtime supports using a UTF-8 code page. The change means that char strings passed to C runtime functions can expect strings in the UTF-8 encoding. To enable UTF-8 mode, use ".UTF8" as the code page when using setlocale. For example, setlocale(LC_ALL, ".UTF8") will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page."

src/main.c:1245 has:

setlocale (LC_ALL, "");

so this could be changed to:

setlocale (LC_ALL, ".UTF8")

conditionally on the Windows version above, but I'm not sure if that is even
necessary, given the UTF-8 manifest change.

From reading the above doc my understanding is that embedding the UTF-8
manifest has an effect that covers the C runtime as well.    For example:

"UTF-8 mode is also enabled for functions that have historically translated char strings using the default Windows ANSI code page (ACP). For example, calling _mkdir("😊") while using a UTF-8 code page will correctly produce a directory with that emoji as the folder name, instead of requiring the ACP to be changed to UTF-8 before running your program. Likewise, calling _getcwd() in that folder will return a UTF-8 encoded string. For compatibility, the ACP is still used if the C locale code page isn't set to UTF-8."

I have highlighted the important parts in bold.

My point is, with the manifest embedded at build time, ACP will be UTF-8
already when the program (Make) runs, so no need to do anything more.

This advice is for how to use UTF-8 in the C runtime if you don't have
ACP == UTF-8.

The Unicode -W APIs are different compared to the -A APIs in that
they don't even look at the current ANSI code page, they just use UTF-16.


On Sun, 19 Mar 2023 at 17:01, Eli Zaretskii <eliz@gnu.org> wrote:
> From: Costas Argyris <costas.argyris@gmail.com>
> Date: Sun, 19 Mar 2023 16:34:54 +0000
> Cc: bug-make@gnu.org, psmith@gnu.org
>
> > OK, but how is the make.exe you produced built?
>
> I actually did what you suggested but was somewhat confused with the
> result.    Usually I do this with 'ldd', but both msvcrt.dll and ucrtbase.dll
> show up in 'ldd make.exe' output, and I wasn't sure what to think of it.
>
> However, your approach with objdump gives fewer results and only
> lists msvcrt.dll, not ucrtbase.dll:
>
> C:\Users\cargyris\temp>objdump -p make.exe | grep "DLL Name:"
>         DLL Name: ADVAPI32.dll
>         DLL Name: KERNEL32.dll
>         DLL Name: msvcrt.dll
>         DLL Name: USER32.dll
>
> So I guess MSVCRT is enough, i.e. no need for UCRT.

Yes, thanks.

> > If you try using in a Makefile file names with non-ASCII
> > characters outside of the current ANSI codepage, does Make succeed to
> > recognize files mentioned in the Makefile whose letter-case is
> > different from what is seen in the file system?
>
> I think it does, here is the experiment:
>
> C:\Users\cargyris\temp>ls ❎
>  src.c
>
> There is only src.c in that folder.
>
> Makefile utf8.mk is UTF-8 encoded and has this content that
> checks for the existence of:
>
> ❎\src.c
> ❎\src.C
> ❎\src.cs
>
> where ❎ is outside the ANSI codepage (1252).

That's not a good experiment, IMO: the only non-ASCII character here
is U+274E, which has no case variants.  And the characters whose
letter-case you tried to change are all ASCII, so their case
conversions are unaffected by the locale.

> If I understand this correctly, both src.c and src.C should be found,
> but not src.cs (just to show a negative case as well).

In addition, I'm not sure Make actually compares file names somewhere,
I think it just calls 'stat', and that is of course case-insensitive
(because the filesystem is on the base level).

My guess would be that only characters within the locale, defined by
the ANSI codepage, are supported by locale-aware functions in the C
runtime.  That's because this is what happens even if you use "wide"
Unicode APIs and/or functions like _wcsicmp that accept wchar_t
characters: they all support only the characters of the current locale
set by 'setlocale'.  I don't expect that to change just because UTF-8
is used on the outside: internally, everything is converted to UTF-16,
i.e. to the Windows flavor of wchar_t.

> > Btw, there's one aspect where Make on MS-Windows will probably fall
> > short of modern Posix systems: the display of non-ASCII characters on
> > the screen.
>
> Indeed, some thoughts on that:
>
> 1) As you know, this is only affecting the visual aspect of the logs, not the
> inner workings of Make.    This could confuse users because they would
> be seeing "errors" on the screen, without there being any real errors.
> Perhaps a mention in the doc or release notes could remedy that.
>
> 2) To some extent (maybe even completely, I don't know) this can be
> mitigated with using PowerShell instead of the classic Command Prompt.
> This seems to be working in this case at least:

This could be just sheer luck: PowerShell uses a font that supports
that particular character.  The basic problem here is that "Command
Prompt" windows don't allow to configure more than one font for
displaying characters, and a single font can never support more than a
few scripts.  If PowerShell doesn't allow more than a single font in
its windows, it will suffer from the same problem.

> If anything, it could be worth a mention in the doc.

Yes, of course.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]