Hidden Costs of Memory Allocation

It’s important to understand the cost of memory allocations, but this cost can be surprisingly tricky to measure. It seems reasonable to measure this cost by wrapping calls to new[] and delete[] with timers. However, for large buffers these timers may miss over 99% of the true cost of these operations, and these hidden costs are larger than I had expected.

Further complicating these measurements, it turns out that some of the cost may be charged to another process and will therefore not show up in any timings that you might plausibly make.

In this (Windows specific) post I’m going to explain what these hidden costs are, how they hide, how to measure them, and what you should do about it.

Quick test

Which of these do you need to measure to fully record the cost of freeing and reallocating memory rather than reusing it?

  1. Time spent allocating and freeing the memory
  2. Time spent using the memory
  3. CPU time consumed in the system process
  4. All of the above

Spoiler alert, the answer is #4. For sufficiently large blocks of memory the time to allocate and free memory is only part of the CPU cost associated with the allocation. Details to follow.

I like big blocks and I cannot lie

In this post I am exclusively focusing on allocating large blocks of memory – those that are large enough that the heap delegates the allocation to the operating system. I am ignoring the low-fragmentation heap, heap contention, and heap fragmentation in order to focus on the operating system issues of allocating large blocks of memory. This is a real issue that I have encountered multiple times, and fixing it can give an easy performance win, but I don’t want to try to account for all memory allocation issues in one post.

Measuring allocating and freeing

Measuring the cost of allocating and freeing memory seems easy. Just call new[]/delete[] in a loop for various sized blocks of memory and measure how long they take. In my tests, for sizes ranging from 8 MB to 32 MB, the cost for a new[]/delete[] pair averaged about 7.5 μs (microseconds), split into ~5.0 μs for the allocation and ~2.5 μs for the free. For the large allocations I was testing the size of the allocation did not seem to significantly affect the results.

Freeing is about to get slower

Allocating memory without using it is a bit artificial, so the next thing to do is to write to the entire block of memory before freeing it. Pause for a moment to think about what effect this should have on the cost of allocating and freeing memory. I’ll wait.

The cost of allocating the memory goes up slightly in this scenario, presumably because writing to the memory flushes all sorts of useful data from the CPU’s caches, making subsequent allocations slightly slower. This effect adds about 8.0 μs to the allocation cost.

Freeing the memory gets a lot more expensive. When the memory is written before freeing the cost to free the memory is about 75 μs per MB – a huge jump from the ~2.5 μs to free unused memory. This means that it takes about 2,400 μs (2.4 ms) to free 32 MB of memory that has been used!

Memory, on demand

The reason for this seemingly peculiar behavior has to do with the laziness of the Windows operating system. For large allocations (a MB or larger I believe) the Windows heap (new/new[], malloc/HeapAlloc) will use the system VirtualAlloc function (with MEM_COMMIT | MEM_RESERVE) to request memory. This will reserve address space and allocate commit charges, but doesn’t actually commit any pages. The pointer returned by VirtualAlloc is basically just a promissory note – the operating system promises that when pages are touched there will be memory there but, as the documentation says “Actual physical pages are not allocated unless/until the virtual addresses are actually accessed.” This is often a good thing because applications that allocate more memory than they need may end up being less wasteful than intended, and Linux behaves similarly.

So, when memory is allocated and freed without being touched the free operation has to do very little. However if the memory has been touched then the pages have been faulted in and when the memory is freed they actually have to be removed. Removing those pages from the process’ address space is what takes 75 μs per MB.

Faulty towers

If removing the pages from the process’ address space takes 75 μs per MB then presumably it also takes some time to bring the pages in to the address space. This is more difficult to measure because it happens on demand, whenever a page is first accessed. The easiest way to measure this is to measure how long it takes to write to a freshly allocated block of memory, and compare it to how long it takes to write to a previously written block of memory.

This is a perilous measurement. There are many different effects that you can end up measuring. If the memory blocks are too small then you may mostly measure cache effects. Beyond that you may be measuring TLB effects. If you don’t have enough memory then pages may be removed from your working set and then need to be paged back in, and other processes can easily interfere with the results. CPU frequency changes can distort the results.

But, I did my best. My CPU’s L3 cache is 6 MB so I tested with buffers from 8 MB to 32 MB to avoid cache effects. I have lots of memory, and I shut down as many processes as possible to avoid interference. I ran the tests multiple times. I used the high-performance power profile and ran a background busy thread to keep the CPU frequencies consistently high. I also recorded ETW (xperf) profiles and examined them to identify sources of error.

The result was pretty clear. Faulting in the pages for first-time use costs a minimum of ~175 μs per MB. In some situations the cost is a lot more, for reasons that I don’t understand, but for allocations of 8+ MB you should assume a minimum cost of ~175 μs per MB when memory is first used.

Page faults, soft versus hard

There is a difference between a soft/minor page fault and a hard/major page fault. A soft page fault is when a page is faulted in from memory. This happens when a freshly allocated page is first referenced and it also happens when a page is faulted in from the standby list – perhaps a page that was trimmed from the working set, or perhaps a memory mapped file that was in the system disk cache. Soft page faults, while expensive in this context, are still quite cheap.

A hard page fault is a page fault that requires going to disk, perhaps to page in a memory-mapped file or to retrieve data from the swap file. These are more expensive because, you know, disks are slow. Hard faults can easily cost over a second per MB. But those are hard faults and this post is entirely about soft faults, which are much cheaper.

(for more details see http://en.wikipedia.org/wiki/Page_fault)

Zero this

But wait, there is yet another cost of allocating memory that we have not accounted for.

For security reasons it is critical that freshly mapped pages be zeroed, since otherwise information would leak between processes – all modern operating systems do this. Zeroing pages isn’t super expensive, but it isn’t free.

On Windows the OS tries to keep a pool of zeroed pages available. These pages can then be quickly faulted in to the processes that need them. But this pool must be replenished by taking free pages and zeroing them. What this means is the System process has a low-priority thread that zeroes pages when they are freed. If it is able to keep up then all of the zeroing of memory after freeing will be done outside of your process, and the true cost is thoroughly hidden.

(see http://blogs.msdn.com/b/tims/archive/2010/10/29/pdc10-mysteries-of-windows-memory-management-revealed-part-two.aspx for more information on the life-cycle of a memory page)

However nothing can hide from the all-seeing eye of ETW (xperf) traces. The CPU Usage (Precise) graph shows my test process in red and the system process in green. At the end of each test loop when my test process frees memory you can see the system process spring to life:

This means that the only way to measure the true cost of your inexpensive memory allocations is to record a trace with WPT and look both in your process (for KiPageFault) and in the system process (for time spent in the zero page thread in KeZeroMemory – KeZeroPages on Windows 8.1).

If the zero-page thread can’t keep up then the pages may be zeroed on demand in the context of your process, in which case KeZeroMemory/KeZeroPages may show up. This is more likely to happen on machines with fewer CPUs.

The CPU Usage (Sampled) data tells you what the zero page thread is doing. Then, once you know that thread 8 is always the zero-page thread, you can use the CPU Usage (Precise) data to measure exactly how long that thread wakes up each time it finds work. This works out to ~150 μs per MB of memory that is touched and then freed.

Total costs

In conclusion…

A naive measurement of the cost of allocating and freeing large blocks of memory would conclude that it costs about 7.5 μs for each alloc/free pair. However there are three separate per-MB costs for large allocations. As the chart shows these add up to approximately 400 μs per MB. So, allocating an 8 MB buffer every frame (perhaps to hold a 1080p RGBA image) can easily waste 3.2 ms (3,200 μs) of CPU time per frame. Some of this will be hidden in the system process where it won’t hurt performance on many-core developer machines, but will harm frame-rate on dual-core customer machines.

If you don’t like paying the cost of page faults when you use memory then you can force it to happen at allocation time by using PrefetchVirtualMemory is the equivalent for memory-mapped files, and similar cautions apply.

It’s worth mentioning that most of these numbers are minimum costs. In particular the time to fault in pages is sometimes much higher, for reasons that I don’t understand.

Real-time zero-page monitoring

Recording an ETW trace is the most accurate and reliable way to find out what is going on, performance-wise, in your process. I record a lot of traces and I learn a lot. Lately I’ve made a habit of looking for KiPageFault in my processes, and looking for heavy activity in the zero-page thread.

But this is tedious and time-consuming. It can be very helpful to monitor the zero-page thread’s activity in real-time. Activity in this thread is a proxy for heavy usage of freshly allocated memory and makes it easy to correlate this with the actions I am taking.

It turns out that this monitoring is easy.

The information I am about to supply makes use of undocumented Windows details that may change in future, present, or past versions of Windows. This information should only be used for diagnostic purposes. It works for me, for now, but if you ship a product that uses this information then I will ask Raymond Chen to form a vigilante group with me to hunt you down and corrupt random bytes in your compiler.

Note that Windows 10 does, indeed, change the assumption below. You were warned.

I have observed (see disclaimer above) that in my tests the zero-page thread always has thread ID 8. Therefore we can easily monitor the activity level of the zero-page thread by:

  1. Using OpenThread to get a handle to thread ID 8
  2. Using QueryThreadCycleTime to get the cycles used by this thread
  3. Sitting in a loop calling Sleep(1000) and printing the zero-page cycles consumed in the last second.

That’s it. The program has to run as administrator and it is clearly a horrible hack, but it has helped me identify code that is reallocating memory far too frequently. If the zero-page thread gets significantly busier whenever your code is running then it might be worth recording a trace and looking for KiPageFault.

Test setup

All tests were done on my four-core eight-thread Intel(R) Core(TM) i7-2720QM CPU @ 2.20 GHz (Turboboost up to 3.3 GHz) with 8 GB RAM, running Windows 7 SP1. Spot tests were run on Windows 8.1 and the results were similar.

Crudely designed test code available here.

Linux

I did some spot tests on Linux but I was unable to cleanly measure the cost of mapping pages in and out of memory. I did see the kernel zeroing memory as it was mapped in after first being written – watch for clear_page_c_e or similar functions when you run perf top. A bit of searching shows that developers have noticed the cost of clear_page_c_e before. Linux seems to clear pages on-demand instead of in a dedicated thread, which has pros and cons.

Related posts

Making VirtualAlloc Arbitrarily Slower

64-Bit Made Easy

If you want to know more about how to use ETW to do this type of deep investigation of Windows performance then check out ETW Central.

Arseny Kapoulkine wrote an awesome blog post that explains other issues with soft page faults, such as apparent serialization in the kernel which can make them even more expensive!

 

Unknown's avatar

About brucedawson

I'm a retired programmer, ex-Google/Valve/Microsoft/etc., who was focused on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048
This entry was posted in , permalink.

34 Responses to Hidden Costs of Memory Allocation

  1. Marek's avatar Marek says:

    Nice article. Time unit rant: My eyes hurt when I see us as unit of time, the right prefix is µ – Greek small letter Mu and time unit is µs. We live in 21st century and computers are able to understand and render UNICODE quite well.

    • brucedawson's avatar Dimi says:

      Hmmm, hurting eyes, you are either a nitpicking whiner, or you need your eyes examined by a doctor. This is not an academic paper or something, we live in the 21st century, where people should be able to get the obvious.

  2. Karim Oumghar's avatar Xi Yang says:

    You may like to read this paper “why nothing matters, the impact of zeroing” http://users.cecs.anu.edu.au/~steveb/downloads/pdf/zero-oopsla-2011.pdf

    • brucedawson's avatar Marc Sherman says:

      Very informative, thanks!

    • Georges Khalil's avatar Georges Khalil says:

      Great article Bruce!

    • Arseny Kapoulkine (@zeuxcg)'s avatar kominek says:

      It seems that would make bugs which allow attackers to read past the end of buffers (heartbleed?) more dangerous.

      If programs are willing to reuse their old unzeroed pages, and they’ll actually receive a performance benefit from doing so, they can always manage some of their own memory.

      • Julien Oster's avatar Julien Oster says:

        Correct me if I’m wrong, but I was under the impression that malloc() and friends did exactly that: they reuse the pages of the process without zeroing them, unless done explicitly (commonly by using cmalloc(), which is the only one mentioning zeroing pages in my man page here). It’s only new pages that are zeroed unconditionally, they come from outside.

      • dariomanesku's avatar dariomanesku says:

        One thing keeps puzzling me, your findings say that freeing 32MB of memory costs about 2.4ms if that memory was used. That seems to me like A LOT. If we pretend that the cost is the same for bigger sizes, in the context of gamedev, that means that when you load 7 textures of 32MB, freeing them right away would cost around 16ms which is the time of one entire frame!

        The question would be: what exactly happens, or should I better say, what exactly needs to happen when the memory is freed? Where is all that time being spent? It seems to me like removing memory pages from process’s address space should be something simple/fast? What are your comments on this?

        • brucedawson's avatar kami says:

          When I was still spoiled by Java and wrote my first C++ code, I wondered why it was so ridiculously slow. The Java VM grabs a large chunk of memory and holds on to it. All the garbage collection etc. is an internal affair.

          So I just commented out all delete/delete[] calls and it improved performance by orders of magnitude. The lessons I learned:
          – Reuse memory (placement new is your friend)
          – Make buffers as long living as you can, e.g. reserve the worst case amount of memory and run with it, the real cost comes with use, so allocating more memory than needed is cheap
          – If you have dynamic data structures like trees, you may want to overload new/new[] and delete/delete[] to hold on to your memory and reuse freed chunks, since the OS already (sometimes) does that transparently you might want to do some profiling before investing the time
          – Keep things on the stack whenever possible

          • brucedawson's avatar Aurélien says:

            In modern C++ we tend to use higher approaches like memory pools (like the one proposed by Boost). The std::allocator approach was a failure, although some members at the committee are still working hard to fix them.
            GC are indeed very useful / impressive, but have limitations has well. More specifically, the GC thread can become a bottleneck on computers having a high number of CPUs. And to be fair, you need to take into account the hidden cost of moving your objects in memory (while your app is paused), which can become a serious issue with large applications.

            Something I would like to try one day is to replace the (global) delete operators in order to defer the request in a separate (dedicated) thread so it speeds up memory deallocation on other threads (while still benefiting of cache locality to run the destructors).

        • Pingback: ETW Central | Random ASCII

        • Pingback: Xperf for Excess CPU Consumption: WPA edition | Random ASCII

        • Chris Siddall's avatar

          This site uses Akismet to reduce spam. Learn how your comment data is processed.