Making VirtualAlloc Arbitrarily Slower

As part of my continuing exploration of things that are slower than they should be I recently came across a problem with VirtualAlloc. If you use the wrong flag with this function then, depending on how much memory you allocate, you may find it running 1,000 times slower, or perhaps even far worse.

This is the zeroth entry in an ongoing series about how to make Windows slower. The next part is Making Windows Slower Part 1: File Access.

The gory details

VirtualAlloc is a Windows function used for low-level memory allocation. It lets you reserve regions of virtual address space and commit pages within those regions.

I recently received a report that an application running on one of our servers was starting up significantly more slowly on machines with more memory. By ‘more slowly’ I mean ridiculously slowly. When we upgraded to 288 GB (yep, that’s a metric boatload of memory) the startup time went from 3 minutes to 25 minutes.

Ouch.

This application reserves most of memory for a cache and it does it in 1 MB chunks, so on a 288 GB machine it was doing 250,000 calls to VirtualAlloc. It is these memory allocations that were taking 25 minutes. And honestly, even 3 minutes is far too long.

A bit of work with ETW tracing together with an inspired guess quickly led to the solution:

Don’t use the VirtualAlloc MEM_TOP_DOWN flag.

The MEM_TOP_DOWN flag tells VirtualAlloc to allocate from the top of the address space. This is helpful when looking for bugs in an application that has just been ported to 64-bit because it ensures that the top 32-bits of addresses are necessary.

Nowhere is there any warning that VirtualAlloc with this flag does not scale well. It appears to use an O(n^2) algorithm where ‘n’ is related to the number of allocations you have made with this flag. From profiling it looks like it is slowly and methodically scanning a list of the reserved memory trying to ensure that it finds the hole with the highest address.

When this flag is removed the time went from 25 minutes to instantaneous. Okay, it probably wasn’t instantaneous but it was so fast that it wasn’t worth measuring.

Testing

I decided to run some tests. I don’t have a 288 GB server so I had to run tests on my underpowered 8 GB laptop. After some experimentation I ended up with code that would VirtualAlloc a bunch of 64-KB address ranges, then free them, and then do it again. All allocations were done with the flags MEM_COMMIT | MEM_RESERVE | MEM_TOP_DOWN. Each time it allocated more, going from 1,000 to 64,000. I made my test program a 64-bit app so that it could easily handle the ~4 GB of address space this required.

I dropped the data into Excel and it made the most perfect O(n^2) graph I have ever seen:

I then tried it without the MEM_TOP_DOWN flag and the results were strikingly different:

Not only has the graph gone from quadratic to linear, but check out the change in the scale. With MEM_TOP_DOWN VirtualAlloc is, for large numbers of allocations, over a thousand times slower, and getting slower all the time!

It’s amusing to put both results on one chart just to see the blue graph (VirtualAlloc without MEM_TOP_DOWN) being indistinguishable from a horizontal line at zero.

Just to be thorough I grabbed an ETW trace. ETW is great for this kind of stuff because it can grab full call stacks all the way from the implementation of VirtualAlloc down to my test code. Here’s the relevant excerpt of the CPU Usage (Sampling) Table:

Over the 8.23 seconds that I looked at the sampling shows that approximately 8,155 milliseconds (8.155 seconds) was spent trying to find the next address, instead of actually adjusting page tables and allocating memory.

Conclusion

Don’t use MEM_TOP_DOWN, except for testing. That’s pretty obvious.

Here’s the test code. No warranty. Don’t remove the copyright notice. Feel free to otherwise use and modify. Compile as 64-bit.

/ Copyright Bruce Dawson

const size_t MaxAllocs = 64000;

__int64 GetQPC()
{
LARGE_INTEGER time;
QueryPerformanceCounter( &time );
return time.QuadPart;
}

double GetQPF()
{
LARGE_INTEGER frequency;
QueryPerformanceFrequency( &frequency );
return (double)frequency.QuadPart;
}

void* pMem[MaxAllocs];

void TimeAllocTopDown(bool topDown, const size_t pageSizeBytes, size_t count)
{
DWORD flags = MEM_COMMIT | MEM_RESERVE;
if (topDown)
flags |= MEM_TOP_DOWN;
assert(count <= _countof(pMem));
__int64 start = GetQPC();
for (size_t i = 0; i < count; ++i)
{
pMem[i] = VirtualAlloc( NULL, pageSizeBytes, flags, PAGE_READWRITE );
assert(pMem[i]);
}
__int64 elapsed = GetQPC() – start;
if (topDown)
printf(“Topdown”);
else
printf(“Normal “);
printf(” allocs took %7.4f s for %5lld allocs of %lld KB blocks (total is %lld MB).\n”, elapsed / GetQPF(), (long long)count, (long long)pageSizeBytes / 1024, (long long)(count * pageSizeBytes) / (1024 * 1024));

    / Free the memory.
for (size_t i = 0; i < _countof(pMem); ++i)
{
if (pMem[i])
VirtualFree( pMem[i], 0, MEM_RELEASE );
pMem[i] = 0;
}
}

int _tmain(int argc, _TCHAR* argv[])
{
bool topDown = true;
for (size_t blockSize = 64 * 1024; blockSize <= 64 * 1024; blockSize += blockSize)
{
for (int i = 0; i <= 64; ++i)
{
TimeAllocTopDown(topDown, blockSize, i * 1000);
}
return 0;
}

Unknown's avatar

About brucedawson

I'm a retired programmer, ex-Google/Valve/Microsoft/etc., who was focused on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in . Bookmark the permalink.

20 Responses to Making VirtualAlloc Arbitrarily Slower

  1. Karl's avatar Karl says:

    I think the purpose of the MEM_TOP_DOWN flag is for when you allocate a region of space that is going to be used for a long time, not for situations where it needs to be reallocated lots of times. It allows you to allocate a region of address space that is located far away from where more frequent allocation and deallocation happen.

    For example: In a program I’m currently writing I have a facility to edit a bunch of text documents. These documents are backed by memory mapped files (when saved) but the edit history of are backed by the pages in a virtual memory region. Now this virtual memory region is supposed to exist for as long as the application is running now it is useful to put this region in the far end of the process user-mode address space, away from all the main action of the application. So I use the MEM_TOP_DOWN flag to put a region (large enough to support an extremely heavy use of my editing facility) at the far end of the address space. I don’t use this flag when committing pages from within this region. So far I use this flag once in my application life time and I can’t see any bad impact on my application.

    So I think in your situation you are probably right in avoiding this flag. But it is a useful flag when creating long lived regions.

    Regards
    Karl

  2. brucedawson's avatar Karl says:

    It is tru that these addresses wont fit in 32. The purpose in my case is to avoid external fragmentation of the address space. This is usually more of an issue on 32bit systems. And although the 64bit virtual address space is absolutely huge, the user-mode partition is limited to about ~8192 GB on x64 Windows. If you use a lot of your address space then it may be a problem with external fragmentation to reserve a long lived region of space in the middle of the address space. Things would have to be allocated/reserved around this region and some things might not fit “under” the region which may waste that address space. It is generally not a big problem on 64bit applications as the address space is so big, but if you can live with 64bit pointers using this flag is not a bad thing in my opinion.

    Note on 64bit windows you are not guaranteed to get ’32bit sized pointers’ by not using this flag, it entirely depends on how much of the your address space is already in use and the size of the region that you are reserving.

    • Karl's avatar Karl says:

      I think your point is still very valid for the situation that you describe and for 64bit windows this flag is not really necessary unless you are using extreme amounts of address space.

  3. Brock Williams's avatar L.C. says:

    Thanks very much for bringing this to your readers’ attention!

    FYI dlmalloc (which is a commonly used memory allocator replacement) calls VirtualAlloc with MEM_TOP_DOWN for any large allocations (by default anything >= 256k). Having such a large threshold puts a very reasonable upper bounds on the maximum number of these allocations you can have in 32-bit and that thankfully limits the penalty of doing N+1 large allocs. In my testing with 1GB worth of 256k top-down allocations, the penalty for doing another was less than 0.2ms.

    However, if you’re in 64-bit you may want to increase dlmalloc’s DEFAULT_MMAP_THRESHOLD to compensate for the likely increase in the number of simultaneous large allocations due to the increased address space.

  4. dithermaster's avatar dithermaster says:

    Two things:
    1. I think you mean “does it in 1 MB chunks” instead of “does it in 1 GB chunks” because the latter would only make 288 calls instead of 250,000.
    2. For 64-bit pointer truncation testing we use the “AllocationPreference” registry setting (described here Reply

    • brucedawson's avatar dithermaster says:

      Worked for me; it’s what we used for 64-bit development to detect pointer truncation.
      Note: Changing the AllocationPreference requires a restart to take effect.

      • brucedawson's avatar dithermaster says:

        Thanks for testing this. I’m also sad to see that it is also O(N^2)! Your technique in the linked post ensures high-bits-set pointers but as you’ve said, it sure would be nice if there was a non-performance-robbing simple flag that did it. Cheers.

  5. keithinhanoi's avatar

    This site uses Akismet to reduce spam. Learn how your comment data is processed.