This article attempts to summarize some findings about heap management under 32-bit versions of Windows that I had to learn about the hard way while trying to optimize memory usage of a fairly complex algorithm. While the algorithm was implemented as a portable Python/C hybrid (that is, Python code calling C libraries for the core algorithms), this occasionally complicated matters, but did not change most of the fundamental issues discussed here.
One interesting observation about this project was this: the
      particular out-of-memory issue I tried to solve did not seem to
      occur under Linux at all, and looked rather specific to Windows
      XP. After ruling out obvious things such as memory leaks and
      Python garbage collection, it eventually turned out that some
      aspects of Windows memory were responsible that don't appear to be
      documented explicitly in most places that talk about the Windows
      heap.
    
In this context, "very large" means "using a significant portion of the 4GB address space", in other words several 100MB of data in a single array. Of course, there were always a couple of "easy" answers out of this dilemma, such as:
- Use double indrection to split up a "huge" array into several "small" ones with more manageable size
- Maximize usable address space by booting Windows with the /3GB
          switch and setting the LARGEADDRESSAWARE
        flag on the executable.
 Note: Any warnings about reduced overall Windows performance should be remembered: I experienced a slowdown of at least a factor of 10 (!) in rebuilding a large project after enabling the "/3GB" switch, and have also seen reports about a drop in some graphics cards' performance (probably due to the loss of memory apertures).
- Move to a 64-bit version of Windows (and recompiling your code to make use of it!)
- Use Linux :-)
Huge "monolithic" memory blocks are bad
Try to avoid memory allocations of more than about 100MB in a single chunk like the plague.
This is perhaps the single most important lesson: if you keep this in mind and write your code accordingly you could stop reading here, and ignore the remaining advice: it mainly deals with getting by if you can't avoid them completely, e.g. because some algorithms such as qsort() can be implemented much more easily on a flat array.
Please note that the 100MB limit is not a hard and fast number - it may be 50MB or 200MB, depending on what you do with your blocks, and how many there are. But this is the order of magnitude where it usually gets tricky. I'd say anything in excess of 200MB is begging for trouble.
Address-space fragmentation can be a problem, and VMMap is your friend
The following program can be used to exercise most of thes isses described in this text (note that it uses GlobalAlloc rather than HeapAlloc for simplicity, but in recent versions of Windows the only difference is that GlobalAlloc works with the default heap that gets created for a process anyway):
#include <windows.h>
        #include <stdio.h>
        
        int main()
        {
            int i;
            void *p, *p2;
            static void* ptr[100000];
        
            printf("Attempting to allocate 1000MB: ");
            p = (void*)GlobalAlloc(0, 1000*1024*1024);
            if(p) puts("success"); else puts("failure");
        
            printf("Press [Enter] to continue");
        getchar();
        
            printf("Attempting to reallocate to 1001MB:
        ");
            p2 = (void*)GlobalReAlloc(p, 1001*1024*1024,
        0);
            if(p2) puts("success"); else { p2=p;
        puts("failure"); }
        
            printf("Press [Enter] to continue");
        getchar();
        
            printf("Attempting to reallocate to 900MB:
        ");
            p = (void*)GlobalReAlloc(p2, 900*1024*1024,
        0);
            if(p2) puts("success"); else { p=p2;
        puts("failure"); }
        
            printf("Press [Enter] to continue");
        getchar();
            
            GlobalFree(p2);
            puts("Freed memory");
        
            printf("Attempting to allocate/free 500MB:
        ");
            p = (void*)GlobalAlloc(0, 500*1024*1024);
            if(p) puts("success"); else puts("failure");
            GlobalFree(p);
        
            printf("Press [Enter] to continue");
        getchar();
        
            puts("Allocating/freeing 23,552 blocks of
        63k");
            for(i=0; i<23*1024; i++)
            {
                ptr[i] =
        (void*)GlobalAlloc(0, 63*1024);
                if(ptr[i]==NULL) {
        puts("Unexpected failure"); return 1; }
            }
            for(i=0; i<100000; i++)
        GlobalFree(ptr[i]);
        
            printf("Press [Enter] to continue");
        getchar();
        
            printf("Attempting to allocate/free 500MB:
        ");
            p = (void*)GlobalAlloc(0, 500*1024*1024);
            if(p) puts("success"); else puts("failure");
            GlobalFree(p);
        
            printf("Press [Enter] to continue");
        getchar();
        
            return 0;
        }
      Source code and compiled version are available here.
    
This example is best run together with the the VMMap
        utility by Microsoft's Sysinternals team, which can be used
      to display the memory layout of a process, and is an invaluable
      tool for resolving address space fragmentation issues. While doing
      so, it is useful to enable the "Show free regions"
      option of VMMap, which demonstrates very nicely how randomly
      placed EXEs and DLLs punch holes into the otherwise contiguous
      address space of Windows, furtherer limit the ability to allocate
      large arrays.
    
The sample application above stops after each major re-allocation event, allowing you to inspect memory with VMMap, and continues after pressing [Enter]. The expected output on a 32-bit Windows system looks as follows:
Attempting to allocate 1000MB: success
Press [Enter] to continue
Attempting to reallocate to 1001MB: failure
Press [Enter] to continue
Attempting to reallocate to 900MB: success
Press [Enter] to continue
Freed memory
Attempting to allocate/free 500MB: success
Press [Enter] to continue
Allocating/freeing 23,552 blocks of 63k
Press [Enter] to continue
Attempting to allocate/free 500MB: failure
Press [Enter] to continue
There is no VirtualReAlloc
One very simple fact that causes the first failure is this: there simply is no VirtualReAlloc() call in Windows to complement VirtualAlloc() and VirtualFree(). This has important implications for the way the HeapReAlloc() function is implemented internally: because HeapAlloc() directly calls VirtualAlloc() for allocation of large blocks, Windows can only resize such blocks by allocating a new block of the desired size, then copying the old data and finally freeing the old block.
This strategy of course works only if the two blocks can exist
      side-by-side in memory for a short time - if there is only enough
      address space for one of them, resizing fails (as demonstrated by
      the first failure in the above example). On the other hand, it
      turns out that in the case of shrinking a block Windows "hides"
      this failure by simply not reducing its size, but still
      succeeding: if you inspect memory at the third stopping point, you
      will notice that the block was not actually shunk back to 900MB in
      the address space, but remains at its original 1000MB allocation
      size (however, the unneeded memory gets decommitted from virtual
      memory).
    
Heap regions allocated for holding "small" blocks do not shrink
      back
    
    After allocating a large number of "small" items, the address
      region they occupied remains reserved for holding only such small
      items in the future, even after all of them have been freed again.
      The virtual memory held by these regions is "decommitted", so it
      doesn't show up in the memory usage of the process any more, but
      its address space is still unavailable for allocation of new
      blocks of virtual memory that are too big to be stored in those
      sub-allocated regions.
    
This means that large allocations following after freeing a lot
      of smaller items may fail, even though memory usage of the process
      appears to be low. In effect, the more "small" items are allocated
      by an application, the more of ites address space gets permanently
      dedicated to small-only allocations, and cannot be reclaimed other
      than by destroying the heap.
    
In the example above, this is the reason for the second failure
      of allocating a 500MB block (which used to work fine before) -
      having allocated about 1.5GB worth of 63k blocks before, most of
      the address space is now occupied by heap regions for holding
      "small" items, and thus become unavailable to large blocks.
    
DLLs and EXEs can fragment memory in unexpected ways
It is also important to look at how your EXEs and DLLs are spread
      accross the address space. If you look at it with VMMap, they
      appear to be placed in seemingly random fashion, thus reducing the
      maximum contiguous amout of memory that can be allocated in a
      single buffer. For example, when I look at the example above on my
      Windows 7 Home system, I see the following "Image" locations
      cutting through the upper end of the biggest available chunk of
      free memory:
    
- KernelBase.dll at 0x75AC0000
- ...followed by 28.120K of free memory
- ntdll.dll at 0x77680000
- ...followed by 80K of free memory
- kernel32.dll at 0x777D0000
- ...followed by 112K of free memory
- apisetschema.dll at 0x778C0000
- ...followed by 129.212K of free memory
 
Of course, these locations are not actually random, and may be
      different depending on your version of Windows - in fact, each EXE
      and DLL has a "preferred" location at which it can be loaded
      quickly, without having to relocate any absolute addresses in the
      executable file, and by default this is just where Windows will
      place them, in order to load quickly and reduce virtual memory
      usage. Each DLL loaded by your applications either directly or
      indirectly (such as an MSVC runtime library) will further fragment
      the address space.
    
However, when tuning an application for optimal usage of address
      space, this behaviour may be counter-productive, and in some cases
      it may be useful to optimize the loading location of
      application-specific DLLs to be clustered more closely in address
      space. This is best done at compile time, but in cases where
      rebuilding is not possible, the /REBASE
      option of the EDITBIN utility can also be used to move a DLL to a
      different base location.
    
One limitation I noticed when trying to rebase an executable (in
      my case, python.exe, which use a base location of 0x1d000000 by
      default, separating about 450MB from the bottom of the contiguous
      address space) is that EDITBIN will not rebase an EXE file, and
      may simply leave it unchanged without producing an error message.
      It is not clear to my why this happens, but in such cases only
      recompiling the EXE (or using a hex editor, for the foolishly
      brave :-)) will help. In each case, VMMap should be used to verify
      the intended effect of the address space reshuffling, as Windows
      may always take the liberty to resolve address space collisions by
      moving a DLL to a different place.