In 2021 I found a huge memory leak in VS code, totalling around 64 GB when I first saw it, but with no actual limit on how high it could go. I found this leak despite two obstacles that should have made the discovery impossible:
- The memory leak didn’t show up in Task Manager – there was no process whose memory consumption was increasing.
- I had never used VS Code. In fact, I have still never used it.
So how did this work? How did I find an invisible memory leak in a tool that I have never used?
This was during lockdown and my whole team was working from home. In order to maintain connection between teammates and in order to continue transferring knowledge from senior developers to junior developers we were doing regular pair-programming sessions. I was watching a coworker use VS Code for… I don’t remember what… and I noticed something strange.
So many of my blog posts start this way. “This doesn’t look right”, or “huh – that’s weird”, or some variation on that theme. In this case I noticed that the process IDs on her system had seven digits.
That was it. And as soon as I saw that I knew that there was a process-handle leak on her system and I was pretty sure that I would find it. Honestly, the rest of this story is pretty boring because it was so easy.
You see, Windows process IDs are just numbers. For obscure technical reasons they are always multiples of four. When a process goes away its ID is eligible for reuse immediately. Even if there is a delay before the process ID (PID) is reused there is no reason for the highest PID to be much more than four times the maximum number of processes that were running at one time. If we assume a system with 2,000 processes running (according to pslist my system currently has 261) then PIDs should be four decimal digits. Five decimal digits would be peculiar. But seven decimal digits? That implies at least a quarter-million processes. The PIDs I was seeing on her system were mostly around four million, which implies a million processes. Nope. I do not believe that there were that many processes.
It turns out that “when a process goes away its ID is eligible for reuse” is not quite right. If somebody still has a handle to that process then its PID will be retained by the OS. Forever. So it was quite obvious what was happening. Somebody was getting a handle to processes and then wasn’t closing them. It was a handle leak.
The techniques to investigate handle leaks of all kinds. Therefore this time I just followed my own recipe. Task Manager showed me which process was leaking handles:
And an ETW trace gave me a call stack for the leaking code within the hour (this image stolen from the github issue):
The bug was pretty straightforward. A call to OpenProcess was made, and there was no corresponding call to CloseProcess. And because of this a boundless amount of memory – roughly 64 KiB for each missing CloseProcess call – was leaked. A tiny mistake, with consequences that could easily consume all of the memory on a high-end machine.
This is the buggy code (yay open source!):
void GetProcessMemoryUsage(ProcessInfo process_info[1024], uint32_t* process_count) {
DWORD pid = process_info[*process_count].pid;
HANDLE hProcess;
PROCESS_MEMORY_COUNTERS pmc;
hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, false, pid);
if (hProcess == NULL) {
return;
}
if (GetProcessMemoryInfo(hProcess, &pmc, sizeof(pmc))) {
process_info[*process_count].memory = (DWORD)pmc.WorkingSetSize;
}
}
And this is the code with the fix – the bold-faced line was added to fix the leak:
void GetProcessMemoryUsage(ProcessInfo& process_info) {
DWORD pid = process_info.pid;
HANDLE hProcess;
PROCESS_MEMORY_COUNTERS pmc;
hProcess = OpenProcess(PROCESS_QUERY_INFORMATION | PROCESS_VM_READ, false, pid);
if (hProcess == NULL) {
return;
}
if (GetProcessMemoryInfo(hProcess, &pmc, sizeof(pmc))) {
process_info.memory = (DWORD)pmc.WorkingSetSize;
}
CloseHandle(hProcess);
}
That’s it. One missing line of code is all that it takes to waste tens of GB of memory.
The bug was found back when I still used Twitter so I github issue based on my report. I stopped using twitter a couple of years later and then my account got banned (due to not being used?) and then deleted, so now that bug report along with everything else I ever posted is gone. That’s pretty sad actually. Yet another reason for me to dislike the owner of Twitter.
The bug was fixed within a few days of the report. Maybe The Great Software Quality Collapse hadn’t quite started then. Or maybe I got lucky.
Anyway, if you don’t want me posting embarrassing stories about your software on my blog or on bsky then be sure to leave the Handles column open in Task Manager and pay attention if you ever see it getting too high in a process that you are responsible for.
Sometimes I think it would be nice to have limits on resources in order to more automatically find mistakes like this. If processes were automatically crashed (with crash dumps) whenever memory or handles exceeded some limit then bugs like this would be found during testing. The limits could be set higher for software that needs it, but 10,000 handles and 4 GiB RAM would be more than enough for most software when operating correctly. The tradeoff would be more crashes in the short term but fewer leaks in the long term. I doubt it will ever happen, but if this mode existed as a per-machine opt-in then I would enable it.
This is an identical problem I ran into: A process had opened a handle to a subprocess to collect some information and had one code path that failed to close the handle. (Which is why we should all start using raii objects in C++). This went out in a commercial product!Instead of developing a tool like you :-), I used sysinternals process explorer, to find the dangling handles. While the tool does pinpoint where the handle leaks from, knowing the code, it was pretty straight forward to hone in on it.
Oh, and a one-line fix solved the problem too 😉
+1 to RAII, that naked
OpenProcess() call is like a naked “new”. I’d hope that static analysis could’ve picked this up.Now we have AI, I asked ChatGPT to review the code:
Process handle not closed
You call
OpenProcess, but never callCloseHandle(hProcess).That leaks a handle every time the function is called.
(It didn’t like the *process_count dereference without bounds checking either)
Good to see another blog post! FWIW, one of the Web archive versions has the Twitter thread, even some pictures are there: https://web.archive.org/web/20220506224017/https://twitter.com/BruceDawson0xB/status/1447668569626476548
Thanks Bruce, I love your posts! And you made me curious. I rebooted my machine and went straight into proc exp. Highest PID is 2242 and 260 processes are running.
So I suppose something is leaking handles, a lot. And I even haven’t started VS Code yet. 😜
Sorry Bruce, there was a typo and because I could not post the entire proc exp image here, I just mistyped the number. The highest PID was 22424, which indicated about 5600 processes, right after the boot.
Will run FindZombieHandles right now.
I did run it. Nothing explains this case:
23 total zombie processes.
14 total zombie threads.
11 zombies held by HPPrintScanDoctorService.exe(5080)
11 zombies of HPSUPD-Win32Exe.exe – process handle count: 11 – thread handle count: 11
2 zombies held by WMIRegistrationService.exe(5636)
2 zombies of mofcomp.exe – process handle count: 2 – thread handle count: 0
1 zombie held by com.docker.backend.exe(21500)
1 zombie of wsl.exe – process handle count: 1 – thread handle count: 0
1 zombie held by devenv.exe(8620)
1 zombie of PerfWatson2.exe – process handle count: 1 – thread handle count: 1
1 zombie held by vmcompute.exe(4592)
1 zombie of vmwp.exe – process handle count: 1 – thread handle count: 0
1 zombie held by NVDisplay.Container.exe(2728)
1 zombie of dbInstaller.exe – process handle count: 1 – thread handle count: 1
1 zombie held by svchost.exe(2580)
1 zombie of userinit.exe – process handle count: 1 – thread handle count: 0
@Bruce do you have a consolidated approach to identify the abundance of PIDs during boot?
Is it running Windows Performance Recorder through a boot cycle and then use some Randomascii view in WPA? I clicked through the links in your post but was not sure what the latest “how I’ve done it and it worked” actually was.
By chance I was looking at the handle count of the system process and it is leaking handles. after about an hour I am already at 6715 handles and it keeps increasing. It does drop sometimes but I will look after it and see if is slowly rising.
I followed your advice and added the Handles column to Task Manager and the biggest offender on my machine right now is OUTLOOK.EXE at 14,537 handles!
I wish there was an online course that taught this stuff and WinDbg without having extensive knowledge of assembly and Windows internals as a prerequisite.
To be clear, I want to also learn x64 assembly and Windows internals. I just want it all in one big course or sequence of courses.
Another +1 to RAII here. Not only does it help release resources, it also communicate intent.
For example here it sort of looks like *buffer is never released: https://github.com/microsoft/vscode-windows-process-tree/blob/bc0ee891ca3df19dad46b023e3bb1266dfd1a205/src/process_commandline.cc#L51C11-L70C12
Is that true? I don’t know and I don’t really have time to investigate. It was just the first line of the first file I looked at after clicking your link. I’ve spent ore time on this comment. With RAII it would be obvious and no more thought needed.