Skip to content

[lldb][windows] force the console to use a UTF-8 codepage #149493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

charles-zablit
Copy link
Contributor

This patch sets the codepage of the parent Windows console to utf-8 and resets it back to the original codepage once lldb exits.

This fixes a rendering issue where the characters defined in DiagnosticsRendering.cpp ("╰" for instance) are not rendered properly on Windows out of the box, because the default codepage is not utf-8.

This solution is based on this SO thread and this patch downstream.

rdar://156064500

@llvmbot
Copy link
Member

llvmbot commented Jul 18, 2025

@llvm/pr-subscribers-lldb

Author: Charles Zablit (charles-zablit)

Changes

This patch sets the codepage of the parent Windows console to utf-8 and resets it back to the original codepage once lldb exits.

This fixes a rendering issue where the characters defined in DiagnosticsRendering.cpp ("╰" for instance) are not rendered properly on Windows out of the box, because the default codepage is not utf-8.

This solution is based on this SO thread and this patch downstream.

rdar://156064500


Full diff: https://github.com/llvm/llvm-project/pull/149493.diff

2 Files Affected:

  • (modified) lldb/source/Plugins/Platform/Windows/PlatformWindows.cpp (+20)
  • (modified) lldb/source/Plugins/Platform/Windows/PlatformWindows.h (+8)
diff --git a/lldb/source/Plugins/Platform/Windows/PlatformWindows.cpp b/lldb/source/Plugins/Platform/Windows/PlatformWindows.cpp
index c0c26cc5f1954..d3e981de81313 100644
--- a/lldb/source/Plugins/Platform/Windows/PlatformWindows.cpp
+++ b/lldb/source/Plugins/Platform/Windows/PlatformWindows.cpp
@@ -41,6 +41,10 @@ LLDB_PLUGIN_DEFINE(PlatformWindows)
 
 static uint32_t g_initialize_count = 0;
 
+#if defined(_WIN32)
+std::optional<UINT> g_prev_console_cp = std::nullopt;
+#endif
+
 PlatformSP PlatformWindows::CreateInstance(bool force,
                                            const lldb_private::ArchSpec *arch) {
   // The only time we create an instance is when we are creating a remote
@@ -98,6 +102,7 @@ void PlatformWindows::Initialize() {
     default_platform_sp->SetSystemArchitecture(HostInfo::GetArchitecture());
     Platform::SetHostPlatform(default_platform_sp);
 #endif
+    SetConsoleCodePage();
     PluginManager::RegisterPlugin(
         PlatformWindows::GetPluginNameStatic(false),
         PlatformWindows::GetPluginDescriptionStatic(false),
@@ -108,6 +113,7 @@ void PlatformWindows::Initialize() {
 void PlatformWindows::Terminate() {
   if (g_initialize_count > 0) {
     if (--g_initialize_count == 0) {
+      ResetConsoleCodePage();
       PluginManager::UnregisterPlugin(PlatformWindows::CreateInstance);
     }
   }
@@ -808,3 +814,17 @@ extern "C" {
 
   return Status();
 }
+
+void PlatformWindows::SetConsoleCodePage() {
+  #if defined(_WIN32)
+    g_prev_console_cp = GetConsoleOutputCP();
+    SetConsoleOutputCP(CP_UTF8);
+  #endif
+}
+
+void PlatformWindows::ResetConsoleCodePage() {
+  #if defined(_WIN32)
+  if (g_prev_console_cp)
+    SetConsoleOutputCP(*g_prev_console_cp);
+  #endif
+}
diff --git a/lldb/source/Plugins/Platform/Windows/PlatformWindows.h b/lldb/source/Plugins/Platform/Windows/PlatformWindows.h
index 771133f341e90..d14aa52e5e1c8 100644
--- a/lldb/source/Plugins/Platform/Windows/PlatformWindows.h
+++ b/lldb/source/Plugins/Platform/Windows/PlatformWindows.h
@@ -80,6 +80,14 @@ class PlatformWindows : public RemoteAwarePlatform {
   size_t GetSoftwareBreakpointTrapOpcode(Target &target,
                                          BreakpointSite *bp_site) override;
 
+  /// Set the current console's code page to UTF-8 and store the previous
+  /// codepage in \a g_prev_console_cp.
+  static void SetConsoleCodePage();
+
+  /// Reset the current console's code page to the value stored
+  /// in \a g_prev_console_cp if any.
+  static void ResetConsoleCodePage();
+
   std::vector<ArchSpec> m_supported_architectures;
 
 private:

@charles-zablit
Copy link
Contributor Author

Before

Screenshot 2025-07-18 at 12 24 31

After

Screenshot 2025-07-18 at 12 24 13

Copy link

github-actions bot commented Jul 18, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@DavidSpickett
Copy link
Collaborator

I opened an issue for this #142568.

Where @Nerixyz mentions that SetConsoleOutputCP might have problems in cmd.exe (which probably means conhost, the original windows terminal host, as opposed to "windows terminal", the new one).

@DavidSpickett DavidSpickett changed the title [windows][lldb] force the console to use a UTF-8 codepage [lldb][windows] force the console to use a UTF-8 codepage Jul 18, 2025
@DavidSpickett
Copy link
Collaborator

Windows Terminal is the default on Windows 10 at least. I think buildbot launches things in conhost, but if utf-8 there was a problem for tests, we would have seen it before now.

@charles-zablit
Copy link
Contributor Author

charles-zablit commented Jul 18, 2025

I opened an issue for this #142568.

Where @Nerixyz mentions that SetConsoleOutputCP might have problems in cmd.exe (which probably means conhost, the original windows terminal host, as opposed to "windows terminal", the new one).

From my understanding, the original issue in jq is that they did not reset the codepage after the program had exited. My patch does reset it if lldb gracefully exits.

The /utf-8 approach seems promising as well but I find it suspicious that no other project uses it.

Another temporary fix for this while we figure out a long term solution could be to force the use of the ANSI characters on windows.

@charles-zablit
Copy link
Contributor Author

charles-zablit commented Jul 18, 2025

Windows Terminal is the default on Windows 10 at least. I think buildbot launches things in conhost, but if utf-8 there was a problem for tests, we would have seen it before now.

Maybe I misunderstood your comment, but I was able to reproduce this with the latest release of lldb in the Windows 11 terminal.

Screenshot 2025-07-18 at 13 06 14

@DavidSpickett
Copy link
Collaborator

Yes that makes sense, Windows Terminal doesn't default to utf-8 either. I was thinking of something else.

What if we added this new code, and it tried to set utf-8 and a test relied on that. However, this is not a problem because we already test this annotation feature via conhost and it has no problems. So even if the calls do nothing, it doesn't matter.

In other words: tests on Windows aren't scraping the output of the terminal, they'll be reading strings internally and not care about the code page.

Which is a good thing.

@charles-zablit
Copy link
Contributor Author

charles-zablit commented Jul 18, 2025

From my understanding of the thread you linked, there are 3 ways to approach this:

  1. Switch to ASCII characters on Windows instead of the "╰" character. This is by far the easiest way to fix this specific rendering issue, but does not address the root issue. Debugging a program with non ASCII characters will break.
  2. Set the code page when lldb starts. Reset the codepage when it exits. This used to be a no-go because it would cause some resizing in CMD.exe, but that was over 6 years ago. Terminal is the default console in Windows 11 as of 2022.
  3. Use /execution-charset:utf-8 as @Nerixyz suggested. I will start a build of lldb with this change. If this does not have the same problems as manually setting the code page, this sounds like the most appealing solution.
  4. (bonus) Go the Python way and build a wrapper using WriteConsoleW, which would properly address the issue. This would however require a lot of engineering (this resolved Less than ideal handling of variable names in Cyrillic alphabet #35615).

@charles-zablit
Copy link
Contributor Author

I added add_compile_options(/execution-charset:utf-8) to llvm-project\lldb\CMakeLists.txt however that did not fix the issue.

@Nerixyz
Copy link
Contributor

Nerixyz commented Jul 18, 2025

I added add_compile_options(/execution-charset:utf-8) to llvm-project\lldb\CMakeLists.txt however that did not fix the issue.

Then setting the code page is probably the best idea, requiring the least amount of effort.

As far as I know, WriteConsoleW is the proper way to get Unicode on the console (this resolved #35615). However, that would require a new output, because currently, everything goes through the stdout FD and the C API.

@charles-zablit
Copy link
Contributor Author

As far as I know, WriteConsoleW is the proper way to get Unicode on the console (this resolved #35615). However, that would require a new output, because currently, everything goes through the stdout FD and the C API.

Thanks for clarifying, I corrected the 4th option.

2b8c692 does not look like such big of a change, is lldb's mechanism for printing so different from clang's? I can't find where a Stream actually gets "printed" to the stdout.

@Nerixyz
Copy link
Contributor

Nerixyz commented Jul 18, 2025

I can't find where a Stream actually gets "printed" to the stdout.

A NativeFile stream is used, which is created here.

2b8c692 does not look like such big of a change, is lldb's mechanism for printing so different from clang's?

I agree, that would probably be of a similar size here (i.e. check for the file handle and call the windows impl if needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants