<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="http://code.msdn.microsoft.com/rss.xsl"?><rss version="2.0"><channel><title>New NUMA Support with Windows Server 2008 R2 and Windows 7</title><link>http://code.msdn.microsoft.com/64plusLP/Project/ProjectRss.aspx</link><description>The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New commodity systems are now appearing that leverage non-uniform mem...</description><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=57</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New commodity systems are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneous Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        // ****  NOTE:  The SetThreadGroupAffinity API now takes 3 parameters.   This example uses SDK headers from beta1.   After RC1 (May, 2009), the API
        // ****  uses SetThreadGroupAffinity(__in HANDLE hThread, __in CONST GROUP_AFFINITY *GroupAffinity, __out_opt PGROUP_AFFINITY PreviousGroupAffinity).
        // ****  Be sure to use the latest Windows SDK and the released Windows Server 2008 R2 and Windows 7.   See SDK header file &amp;quot;winbase.h&amp;quot; for verification of
        // ****  he API signature.
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
 
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Wed, 12 Aug 2009 14:18:24 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090812P</guid></item><item><title>UPDATED RELEASE: Win7NumaSamples (Dec 26, 2008)</title><link>http://code.msdn.microsoft.com/64plusLP/Release/ProjectReleases.aspx?ReleaseId=1979</link><description></description><author></author><pubDate>Wed, 17 Jun 2009 00:27:32 GMT</pubDate><guid isPermaLink="false">UPDATED RELEASE: Win7NumaSamples (Dec 26, 2008) 20090617A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=56</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. You can determine the node number for each processor by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors within a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;&amp;quot;5 Minute Concept&amp;quot; Webcasts&lt;/b&gt; 
&lt;/h2&gt;&lt;a href="http://channel9.msdn.com/tags/NUMA" class="externalLink"&gt;http://channel9.msdn.com/tags/NUMA&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocations occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
 
  //
  // Allocate array of pointers to memory blocks.
  //
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);		// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=W2K8R2" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=W2K8R2&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Fri, 13 Mar 2009 16:33:05 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090313P</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=55</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. You can determine the node number for each processor by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors within a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related &amp;quot;5 Minute Concept&amp;quot; Webcasts&lt;/b&gt; 
&lt;/h2&gt;&lt;a href="http://channel9.msdn.com/tags/NUMA" class="externalLink"&gt;http://channel9.msdn.com/tags/NUMA&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocations occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
 
  //
  // Allocate array of pointers to memory blocks.
  //
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);		// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=W2K8R2" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=W2K8R2&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Fri, 13 Mar 2009 16:32:32 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090313P</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=54</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. You can determine the node number for each processor by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors within a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocations occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
 
  //
  // Allocate array of pointers to memory blocks.
  //
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);		// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=W2K8R2" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=W2K8R2&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Sun, 01 Mar 2009 00:01:02 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090301A</guid></item><item><title>UPDATED RELEASE: Win7NumaSamples (Dec 26, 2008)</title><link>http://code.msdn.microsoft.com/64plusLP/Release/ProjectReleases.aspx?ReleaseId=1979</link><description></description><author></author><pubDate>Sat, 28 Feb 2009 23:59:30 GMT</pubDate><guid isPermaLink="false">UPDATED RELEASE: Win7NumaSamples (Dec 26, 2008) 20090228P</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=53</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. You can determine the node number for each processor by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors within a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocations occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
 
  //
  // Allocate array of pointers to memory blocks.
  //
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);		// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Thu, 29 Jan 2009 17:49:00 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090129P</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=52</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. You can determine the node number for each processor by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors within a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocations occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
 
  //
  // Allocate array of pointers to memory blocks.
  //
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);		// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Fri, 23 Jan 2009 00:57:06 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090123A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=51</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocations occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
 
  //
  // Allocate array of pointers to memory blocks.
  //
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);		// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Mon, 19 Jan 2009 14:32:47 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090119P</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=50</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocations occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
  //
  // Allocate array of pointers to memory blocks.
  //
 
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);					// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Sun, 18 Jan 2009 16:46:32 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090118P</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=49</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions may need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems.  Parallel Computing and High Performance Computing solution developers may also find NUMA awareness essential for performance scalability.
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocation occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
  //
  // Allocate array of pointers to memory blocks.
  //
 
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);					// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7" class="externalLink"&gt;http://http://channel9.msdn.com/posts/philpenn/New-NUMA-Support-with-Windows-Server-2008-R2-and-Windows-7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Wed, 14 Jan 2009 08:15:16 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090114A</guid></item><item><title>UPDATED RELEASE: Win7NumaSamples (Dec 26, 2008)</title><link>http://code.msdn.microsoft.com/64plusLP/Release/ProjectReleases.aspx?ReleaseId=1979</link><description></description><author></author><pubDate>Sat, 03 Jan 2009 03:04:15 GMT</pubDate><guid isPermaLink="false">UPDATED RELEASE: Win7NumaSamples (Dec 26, 2008) 20090103A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=48</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions will need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems. 
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocation occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory &amp;quot;access&amp;quot;).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
  //
  // Allocate array of pointers to memory blocks.
  //
 
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);					// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Sat, 03 Jan 2009 02:53:27 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090103A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=47</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many high-end server-class solutions will need to be architected with NUMA awareness in order to achieve linear performance scaling on such systems. 
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
&lt;/pre&gt; &lt;br /&gt;The following preliminary SDK sample illustrates NUMA memory allocation.   Virtual memory is allocated for each processor within a NUMA node.  The VirtualAllocExNuma API ensures that memory allocation occur from memory &amp;quot;near&amp;quot; the specified processor thus gaining efficient &amp;quot;access&amp;quot; (as in non-uniform memory access).&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
void AllocMemNumaNode(SIZE_T nAllocationSize=0)
{
  ULONG HighestNodeNumber;
  ULONG NumberOfProcessors;
 
  Display(L&amp;quot;\nAllocMemNumaNode results:\n&amp;quot;);
 
  if (nAllocationSize != 0)
    AllocationSize = nAllocationSize;
  else
    AllocationSize = 16*1024*1024;
 
  //
  // Get the number of processors and system page size.
  //
  SYSTEM_INFO SystemInfo;
  GetSystemInfo (&amp;amp;SystemInfo);
  NumberOfProcessors = SystemInfo.dwNumberOfProcessors;
  PageSize = SystemInfo.dwPageSize;
 
  //
  // Get the highest node number.
  //
  if (TRUE != GetNumaHighestNodeNumber(&amp;amp;HighestNodeNumber))
  {
      Display(L&amp;quot;GetNumaHighestNodeNumber failed: 0x%x\r\n&amp;quot;, GetLastError());
      goto Exit;
  }
 
  if (HighestNodeNumber == 0)
  {
      Display(L&amp;quot;\nThis is not a NUMA system - but let's continue anyway...\n&amp;quot;);
  }
  //
  // Allocate array of pointers to memory blocks.
  //
 
  PVOID* Buffers = (PVOID*) malloc (sizeof(PVOID)*NumberOfProcessors);
  if (Buffers == NULL)
  {
      Display(L&amp;quot;Allocating array of buffers failed&amp;quot;);
      goto Exit;
  }
 
  ZeroMemory (Buffers, sizeof(PVOID)*NumberOfProcessors);
 
  //
  // For each processor, get its associated NUMA node and allocate some memory from it.
  //
  for (UCHAR i = 0; i &amp;lt; NumberOfProcessors; i++)
  {
      UCHAR NodeNumber;
 
      if (TRUE != GetNumaProcessorNode (i, &amp;amp;NodeNumber))
      {
          Display(L&amp;quot;GetNumaProcessorNode failed: 0x%x\r\n&amp;quot;, GetLastError());
          goto Exit;
      }
 
      Display(L&amp;quot;CPU %u: node %u\r\n&amp;quot;, (ULONG)i, NodeNumber);
 
      PCHAR Buffer = (PCHAR)VirtualAllocExNuma(
          GetCurrentProcess(),
          NULL,
          AllocationSize,
          MEM_RESERVE | MEM_COMMIT,
          PAGE_READWRITE,
          NodeNumber);					// The NUMA node where memory should reside.
 
      if (Buffer == NULL)
      {
          Display(L&amp;quot;VirtualAllocExNuma failed: 0x%x, node %u\r\n&amp;quot;, GetLastError(), NodeNumber);
          goto Exit;
      }
 
      PCHAR BufferEnd = Buffer + AllocationSize - 1;
      SIZE_T NumPages = ((SIZE_T)BufferEnd)/PageSize - ((SIZE_T)Buffer)/PageSize + 1;
 
      Display(L&amp;quot;Allocated virtual memory:&amp;quot;);
      Display(L&amp;quot;%p - %p (%6Iu pages), preferred node %u\r\n&amp;quot;, Buffer, BufferEnd, NumPages, NodeNumber);
 
      Buffers[i] = Buffer;
 
      //
      // At this point, virtual pages are allocated but no valid physical
      // pages are associated with them yet.
      //
      // The FillMemory call below will touch every page in the buffer, faulting
      // them into our working set. When this happens physical pages will be allocated
      // from the preferred node we specified in VirtualAllocExNuma, or any node
      // if the preferred one is out of pages.
      //
      FillMemory(Buffer, AllocationSize, 'x');
 
      //
      // Check the actual node number for the physical pages that are still valid
      // (if system is low on physical memory, some pages could have been trimmed already).
      //
      DumpNumaNodeInfo(Buffer, AllocationSize);
 
      Display(L&amp;quot;&amp;quot;);
  }
 
Exit:
  if (Buffers != NULL)
  {
      for (UINT i = 0; i &amp;lt; NumberOfProcessors; i++)
      {
          if (Buffers[i] != NULL)
          {
              VirtualFree (Buffers[i], 0, MEM_RELEASE);
          }
      }
      free (Buffers);
  }
}
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Sat, 03 Jan 2009 02:51:51 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20090103A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=46</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many server-class solutions will need to be architected with NUMA awareness in order to achieve linear performance scaling on 64&amp;#43; LP systems. 
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows performance expert &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
 
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Tue, 30 Dec 2008 03:41:49 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20081230A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=45</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many server-class solutions will need to be architected with NUMA awareness in order to achieve linear performance scaling on 64&amp;#43; LP systems. 
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows Performance Engineer &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Mark Friedman &lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
 
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Tue, 30 Dec 2008 03:34:25 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20081230A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=44</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many server-class solutions will need to be architected with NUMA awareness in order to achieve linear performance scaling on 64&amp;#43; LP systems. 
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows Perf Engineer &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Rick Vicik&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
 
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7" class="externalLink"&gt;http://code.msdn.microsoft.com/Project/ProjectDirectory.aspx?TagName=Windows%2b7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Mon, 29 Dec 2008 22:59:57 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20081229P</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=43</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many server-class solutions will need to be architected with NUMA awareness in order to achieve linear performance scaling on 64&amp;#43; LP systems. 
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows Perf Engineer &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Rick Vicik&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (options &amp;quot;A&amp;quot; and &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
 
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/project/projectdirectory.aspx?TagName=Windows+7" class="externalLink"&gt;http://code.msdn.microsoft.com/project/projectdirectory.aspx?TagName=Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://Channel9.msdn.com/tags/Windows+7" class="externalLink"&gt;http://Channel9.msdn.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://Edge.TechNet.com/tags/Windows+7" class="externalLink"&gt;http://Edge.TechNet.com/tags/Windows+7&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;  &lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;</description><author>philpenn</author><pubDate>Mon, 29 Dec 2008 21:05:21 GMT</pubDate><guid isPermaLink="false">UPDATED WIKI: Home 20081229P</guid></item><item><title>CREATED RELEASE: Win7NumaSamples (Dec 26, 2008)</title><link>http://code.msdn.microsoft.com/64plusLP/Release/ProjectReleases.aspx?ReleaseId=1979</link><description></description><author></author><pubDate>Sat, 27 Dec 2008 02:58:39 GMT</pubDate><guid isPermaLink="false">CREATED RELEASE: Win7NumaSamples (Dec 26, 2008) 20081227A</guid></item><item><title>UPDATED WIKI: Home</title><link>http://code.msdn.microsoft.com/64plusLP/Wiki/View.aspx?title=Home&amp;version=42</link><description>&lt;div class="wikidoc"&gt;
&lt;h1&gt;
New NUMA Support with Windows Server 2008 R2 and Windows 7
&lt;/h1&gt;The 64-bit versions of Windows 7 and Windows Server 2008 R2 support more than 64 Logical Processors &amp;#40;LP&amp;#41; on a single computer.  New processors are now appearing that leverage non-uniform memory access &amp;#40;NUMA&amp;#41; architectures.   Within the near future, a system with 4 CPU sockets, 8 processor-cores per socket and with Simultaneious Multi-Threading &amp;#40;SMT&amp;#41; enabled per core, will achieve 64 Logical Processors.   Many server-class solutions will need to be architected with NUMA awareness in order to achieve linear performance scaling on 64&amp;#43; LP systems. 
&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Abstract*&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;The traditional model for multiprocessor support is Symmetric Multi-Processor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.&lt;br /&gt; &lt;br /&gt;System designers are now using non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.&lt;br /&gt; &lt;br /&gt;In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.&lt;br /&gt; &lt;br /&gt;The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.&lt;br /&gt; &lt;br /&gt;First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the &lt;b&gt;GetNumaHighestNodeNumber&lt;/b&gt; function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the &lt;b&gt;GetProcessAffinityMask&lt;/b&gt; function. You can determine the node for each processor in the list by using the &lt;b&gt;GetNumaProcessorNode&lt;/b&gt; function. Alternatively, to retrieve a list of all processors in a node, use the &lt;b&gt;GetNumaNodeProcessorMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the &lt;b&gt;SetProcessAffinityMask&lt;/b&gt; function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the &lt;b&gt;SetThreadAffinityMask&lt;/b&gt; function.&lt;br /&gt; &lt;br /&gt;Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the &lt;b&gt;GetNumaAvailableMemoryNode&lt;/b&gt; function. The &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; function enables the application to specify a preferred node for the memory allocation. &lt;b&gt;VirtualAllocExNuma&lt;/b&gt; does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.&lt;br /&gt; &lt;br /&gt;{*}Note: This article is in part a reprint of pre-release Windows SDK documentation.  Technical details are subject to change.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Processor Groups&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;Systems with multiple processors or systems with processors that have multiple cores furnish the operating system with multiple logical processors. A logical processor is one logical computing engine from the perspective of the operating system, application or driver. In effect, a logical processor is a thread.&lt;br /&gt; &lt;br /&gt;Support for systems that have more than 64 logical processors is based on the concept of a processor group. A processor group is a static set of up to 64 logical processors that is treated as a single scheduling entity. &lt;br /&gt; &lt;br /&gt;When the system starts, the operating system creates processor groups and assigns logical processors to the groups. A system can have up to four groups, numbered 0 to 3. Systems with fewer than 64 logical processors always have a single group, Group 0. The operating system minimizes the number of groups in a system. For example, a system with 128 logical processors would have two processor groups, not four groups with 32 logical processors in each group. &lt;br /&gt; &lt;br /&gt;The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group. Entire NUMA nodes are assigned to the same group, so that a node is a subset of a group. If multiple nodes are assigned to a single group, the operating system chooses nodes that are physically close to one another.&lt;br /&gt; &lt;br /&gt;For a discussion of operating system architecture changes to support more than 64 processors and the modifications needed for applications and kernel-mode drivers to take advantage of them, see the whitepaper &lt;i&gt;Supporting Systems That Have More Than 64 Processors&lt;/i&gt; at &lt;a href="http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx" class="externalLink"&gt;http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;&lt;img src="http://code.msdn.microsoft.com/Project/Download/FileDownload.aspx?ProjectName=64plusLP&amp;amp;DownloadId=4222" alt="GROUP.jpg" /&gt;&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Functions&lt;/b&gt;
&lt;/h2&gt;The following new functions are used with processors and processor groups.   See the &lt;b&gt;Windows SDK&lt;/b&gt; header files &lt;b&gt;winbase.h&lt;/b&gt; and &lt;b&gt;WinNT.h&lt;/b&gt;.   These API's are exposed via &amp;quot;kernel32.dll&amp;quot; and documented within the Windows SDK (which will be available at beta release).   See example usage scenarios within the &lt;i&gt;downloads&lt;/i&gt; section of this Code Gallery resource page.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;&lt;b&gt;CreateRemoteThreadEx&lt;/b&gt; &lt;br /&gt;Creates a thread that runs in the virtual address space of another process and optionally specifies extended attributes such as processor group affinity.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processors in a processor group or in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetActiveProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the number of active processor groups in the system.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetCurrentProcessorNumberEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor group and number of the logical processor in which the calling thread is running.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetLogicalProcessorInformationEx&lt;/b&gt; &lt;br /&gt;Retrieves information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of logical processors that a processor group or the system can support.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetMaximumProcessorGroupCount&lt;/b&gt; &lt;br /&gt;Returns the maximum number of processor groups that the system supports. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaAvailableMemoryNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the amount of memory that is available in the specified node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeNumberFromHandle&lt;/b&gt; &lt;br /&gt;Retrieves the NUMA node associated with the underlying device for a file handle.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaNodeProcessorMaskEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor mask for the specified NUMA node as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProcessorNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number of the specified logical processor as a USHORT value.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetNumaProximityNodeEx&lt;/b&gt; &lt;br /&gt;Retrieves the node number as a USHORT value for the specified proximity identifier.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified process.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetProcessorSystemCycleTime&lt;/b&gt; &lt;br /&gt;Retrieves the cycle time each processor in the specified group spent executing deferred procedure calls (DPCs) and interrupt service routines (ISRs).&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Retrieves the processor group affinity of the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Retrieves the processor number of the ideal processor for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryIdleProcessorCycleTimeEx&lt;/b&gt; &lt;br /&gt;Retrieves the accumulated cycle time for the idle thread on each logical processor in the specified processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadGroupAffinity&lt;/b&gt; &lt;br /&gt;Sets the processor group affinity for the specified thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadIdealProcessorEx&lt;/b&gt; &lt;br /&gt;Sets the ideal processor for the specified thread and optionally retrieves the previous ideal processor.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;&lt;i&gt;The following new functions are used with thread pools.&lt;/i&gt;&lt;/b&gt;&lt;br /&gt; &lt;br /&gt;&lt;b&gt;QueryThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Retrieves the stack reserve and commit sizes for threads in the specified thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPersistent&lt;/b&gt; &lt;br /&gt;Specifies that the callback should run on a persistent thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolCallbackPriority&lt;/b&gt; &lt;br /&gt;Specifies the priority of a callback function relative to other work items in the same thread pool.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SetThreadpoolStackInformation&lt;/b&gt; &lt;br /&gt;Sets the stack reserve and commit sizes for new threads in the specified thread pool. &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;New Structures&lt;/b&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;b&gt;CACHE_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Describes cache attributes. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_AFFINITY&lt;/b&gt; &lt;br /&gt;Contains a processor group-specific affinity, such as the affinity of a thread.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;GROUP_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about processor groups. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;NUMA&lt;i&gt;NODE&lt;/i&gt;RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about a NUMA node in a processor group. &lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR&lt;i&gt;GROUP&lt;/i&gt;INFO&lt;/b&gt; &lt;br /&gt;Contains the number and affinity of processors in a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;PROCESSOR_RELATIONSHIP&lt;/b&gt; &lt;br /&gt;Contains information about affinity within a processor group.&lt;br /&gt; &lt;br /&gt;&lt;b&gt;SYSTEM&lt;i&gt;LOGICAL&lt;/i&gt;PROCESSOR&lt;i&gt;INFORMATION&lt;/i&gt;EX&lt;/b&gt; &lt;br /&gt;Contains information about the relationships of logical processors and related hardware.&lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Usage Scenarios&lt;/b&gt;  &lt;i&gt;(See the sample code via the &amp;quot;downloads&amp;quot; tab on this page.)&lt;/i&gt;
&lt;/h2&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processor GROUPs?  Note that some processors may be parked (i.e. &amp;quot;Core Parking&amp;quot;).
   { 
         WORD wMaximumProcessorGroupCount = GetMaximumProcessorGroupCount();
         WORD wActiveProcessorGroupCount = GetActiveProcessorGroupCount();
         Display(L&amp;quot;MaximumProcessorGroupCount=%d \tActiveProcessorGroupCount=%d\n&amp;quot;,  wMaximumProcessorGroupCount, wActiveProcessorGroupCount);
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
 
   // How many processors per GROUP?
   { 
        for (WORD groupnum = 0; groupnum &amp;lt; wActiveProcessorGroupCount; groupnum++)
            Display(L&amp;quot;GROUP=0x%02X \tMaximumProcessorCount=%d \tActiveProcessorCount=%d\n&amp;quot;, groupnum, GetMaximumProcessorCount(groupnum), GetActiveProcessorCount(groupnum));  
   }
&lt;/pre&gt; &lt;br /&gt;&lt;pre&gt;
    // Get system logical processor information containing information about NUMA nodes and GROUP_AFFINITY relationships.
    // Each entry in the returned struct array describes a collection of processors denoted by the affinity mask and the type of 
    // relation this collection holds to each other.  The following outlines the type of possible relations:
    //        RelationProcessorCore
    //               The specified logical processors share a single processor core.
    //        RelationNumaNode
    //               The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).
    //        RelationCache
    //               The specified logical processors share a cache.
    //        RelationProcessorPackage 
    //               The specified logical processors share a physical package, for example multi-core processors share the same package.
 
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
    DWORD returnLength = 0;
    DWORD byteOffset = 0;
    bool done = FALSE;
 
    while (!done)
    {
        DWORD rc = GetLogicalProcessorInformationEx(RelationAll, buffer, &amp;amp;returnLength);
        if (FALSE == rc) 
        {
            if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) 
            {
                if (buffer) 
                    free(buffer);
                buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(returnLength);
                if (NULL == buffer) 
                    throw(GetLastError());
            } 
            else 
                throw(GetLastError());
        } 
        else
            done = TRUE;
    }
    ASSERT(buffer);
    TRACE(L&amp;quot;Call_GetLogicalProcessorInformationEx : returnLength=0x%08X\n&amp;quot;, returnLength);
		
    ptr = buffer;
    while (byteOffset &amp;lt; returnLength) 
    {
        TRACE(L&amp;quot;\tbyteOffset=0x%08X : ptr-&amp;gt;Size=0x%08X\n&amp;quot;, byteOffset, ptr-&amp;gt;Size);
    		
        switch (ptr-&amp;gt;Relationship) 
        {
          case RelationProcessorCore:
        	Display(L&amp;quot;\n  Processor \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;, 
        	           ptr-&amp;gt;Processor.GroupMask.Group, 
        	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
 
          case RelationNumaNode:
        	Display(L&amp;quot;\n  NumaNode \n\t NodeNumber=0x08X \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;NumaNode.NodeNumber,
        	           ptr-&amp;gt;NumaNode.GroupMask.Group,
        	           ptr-&amp;gt;NumaNode.GroupMask.Mask); 
          break;
 
          case RelationCache:
        	Display(L&amp;quot;\n  Cache \n\t Level=0x%02X \n\t Associativity=0x%02X \n\t LineSize=0x%04X \n\t CacheSize=0x%08X \n\t Type=%ws \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
        	           ptr-&amp;gt;Cache.Level,
        	           ptr-&amp;gt;Cache.Associativity,
        	           ptr-&amp;gt;Cache.LineSize,
        	           ptr-&amp;gt;Cache.CacheSize,
        	           GetCacheType(ptr-&amp;gt;Cache.Type),
        	           ptr-&amp;gt;Cache.GroupMask.Group,
        	           ptr-&amp;gt;Cache.GroupMask.Mask);
          break;
 
          case RelationProcessorPackage:
	Display(L&amp;quot;\n  Socket \n\t GROUP=0x%04X \n\t KAFFINITYmask=0x%08X\n&amp;quot;,
	           ptr-&amp;gt;Processor.GroupMask.Group,
	           ptr-&amp;gt;Processor.GroupMask.Mask);
          break;
						
          case RelationGroup:
        	Display(L&amp;quot;\n  Group \n\t MaximumGroupCount=0x%04X \n\t ActiveGroupCount=0x%04X\n&amp;quot;,
        	           ptr-&amp;gt;Group.MaximumGroupCount,
        	           ptr-&amp;gt;Group.ActiveGroupCount);
        	for (int c = 0; c &amp;lt; ptr-&amp;gt;Group.ActiveGroupCount; c++)
        	     Display(L&amp;quot;\t\t MaximumProcessorCount=0x%02X \n\t\t ActiveProcessorCount=0x%02X \n\t\t ActiveProcessorMask=0x%08X\n&amp;quot;,
        		ptr-&amp;gt;Group.GroupInfo[c].MaximumProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorCount,
        		ptr-&amp;gt;Group.GroupInfo[c].ActiveProcessorMask);
          break;
        		
          default:
            Display(L&amp;quot;\n  Error: Unsupported LOGICAL_PROCESSOR_RELATIONSHIP value.  0x%02X\n&amp;quot;, ptr-&amp;gt;Relationship);
          break;
        }
        byteOffset += ptr-&amp;gt;Size;
        ptr = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)(((PUCHAR)buffer) + byteOffset);
    }		
    free(buffer); 
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Application Awareness of NUMA Locality&lt;/b&gt;
&lt;/h2&gt;Scalable application design requires NUMA awareness from several perspectives.  Herb Sutter describes this process as &lt;a href="http://www.ddj.com/architect/208200273" class="externalLink"&gt;&amp;quot;Maximize Locality, Minimize Contention&amp;quot;&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.  Imagine the processor load required to service interrupts from modern 10 Gb/sec network cards, for example.   Ideally, the interrupt processing and any Deferred Procedure Calls (DPC) occur local to the network device.  Read a detailed analysis by Windows Perf Engineer &lt;a href="http://blogs.msdn.com/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx" class="externalLink"&gt;Rick Vicik&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;.   NUMA locality may be applied to processes, threads, devices, interrupts, and memory.   &lt;br /&gt; &lt;br /&gt;Threads can run only on the logical processors in a single group. By default, the thread affinity is all logical processors in the parent thread’s group. Windows assigns threads across logical processors within the thread’s affinity mask according to thread priority. At thread creation, an application can change the default thread affinity and can specify an ideal processor for a thread by calling the new CreateRemoteThreadEx function.&lt;br /&gt;The ideal processor is the logical processor on which the Windows scheduler tries to run the thread whenever possible. The scheduler searches for a processor in the following order:&lt;br /&gt;    1.  The thread’s ideal processor.&lt;br /&gt;    2.  A processor in the thread’s preferred NUMA node.&lt;br /&gt;    3.  Other processors in the thread affinity mask.&lt;br /&gt; &lt;br /&gt;To specify the group affinity for a thread at creation:&lt;br /&gt;    A. Call CreateRemoteThreadEx and pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;GROUP&lt;/i&gt;AFFINITY extended attribute together with a GROUP_AFFINITY structure.&lt;br /&gt; &lt;br /&gt;To change the affinity of an existing thread:&lt;br /&gt;    B. Call either the existing SetThreadAffinityMask function or the new SetThreadGroupAffinity function.&lt;br /&gt; &lt;br /&gt;To specify the ideal processor at thread creation:&lt;br /&gt;    C. Pass the PROC&lt;i&gt;THREAD&lt;/i&gt;ATTRIBUTE&lt;i&gt;IDEAL&lt;/i&gt;PROCESSOR extended attribute to CreateRemoteThreadEx together with a PROCESSOR_NUMBER structure.&lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization of an existing I/O worker thread with a disk device (option &amp;quot;B&amp;quot; above).  The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode1(pCDiskDrive pDisk)
{
    // FOR ILLUSTRATION ONLY - DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on the same NUMA node.
	
    // This demo illustrates NUMA localization of an existing thread.
	
    USHORT numaNode;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
 
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
			
    hThread = CreateThread(
	    NULL,    		// default security attributes
	    0,         			// use default stack size  
	    DemoThreadFunction,  	// thread function name
	    &amp;amp;numaNode,          	// argument to thread function 
	    0,                 		// use default creation flags 
	    &amp;amp;dwThreadID);   	                // returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    // Thread is paused while we check and adjust NUMA affinity.
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on orginal GROUP=0x%04X with KAFFINITYmask=0x%08X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, groupAffinityThread.Mask);
					
    if ((groupAffinityThread.Group != groupAffinityDisk.Group) ||
        ((groupAffinityThread.Mask &amp;amp; groupAffinityDisk.Mask) != groupAffinityThread.Mask))
    {
        SetThreadGroupAffinity(hThread, &amp;amp;groupAffinityDisk);  
    }
    return 1;
}
&lt;/pre&gt; &lt;br /&gt; &lt;br /&gt;The following example illustrates NUMA node localization upon creating a new I/O worker thread with a disk device (option &amp;quot;C&amp;quot; above).  Again, The anticipation is that the resultant thread-node-disk affinity will improve storage I/O performance.&lt;br /&gt; &lt;br /&gt;&lt;pre&gt;
DWORD MapIoThreadWithDiskNumaNode2(pCDiskDrive pDisk)
{
    // DEMO NUMA-NODE THREAD/DEVICE MAPPING
    //   1. Discover which NUMA node the disk device object is assigned.
    //   2. Create a worker thread on an ideal processor on the same NUMA node.
 
    // This demo illustrates NUMA localization upon creating a thread.
	
    USHORT numaNode = 0;
    DWORD dwThreadID = 0;
    HANDLE hThread = INVALID_HANDLE_VALUE;
    GROUP_AFFINITY groupAffinityDisk;
    GROUP_AFFINITY groupAffinityThread;
    LPPROC_THREAD_ATTRIBUTE_LIST pAttributeList = NULL;
    SIZE_T sizeToAlloc = 0;
    SIZE_T sizeOfBuffer = 0;
    DWORD numActiveProcs = 0;
    PROCESSOR_NUMBER processorNumber;
	
    if (!pDisk || !pDisk-&amp;gt;HandleIsValid())
        throw(L&amp;quot;\nMapIoThreadWithDiskNumaNode : Invalid input parameters.\n&amp;quot;);
		
    // get the NUMA node associated with the disk device object.
    if (GetNumaNodeNumberFromHandle(pDisk-&amp;gt;Handle(), &amp;amp;numaNode) == 0)
        throw(GetLastError());
		
    // get the ProcessorMask of the NUMA node associated with the disk device object.
    if (GetNumaNodeProcessorMaskEx(numaNode, &amp;amp;groupAffinityDisk) == 0)
        throw(GetLastError());
		
    Display(L&amp;quot;Device \&amp;quot;%ws\&amp;quot; is assigned GROUP=0x%04X, NUMAnode=0x%04X with KAFFINITYmask=0x%08X\n&amp;quot;, 
	(const wchar_t*)pDisk-&amp;gt;Name(), groupAffinityDisk.Group, numaNode, groupAffinityDisk.Mask);
	
    // choose one processor within the Disk's NUMA node for the ideal procesor number.
    USHORT node = 0;
    numActiveProcs = GetActiveProcessorCount(groupAffinityDisk.Group);
    processorNumber.Group = groupAffinityDisk.Group;
    processorNumber.Number = 0;
    do {
        GetNumaProcessorNodeEx(&amp;amp;processorNumber, &amp;amp;node); 
    } while ((node != numaNode) &amp;amp;&amp;amp; ((processorNumber.Number++) &amp;lt;= numActiveProcs));
	
    // first call returns the size required for 2 attributes.
    InitializeProcThreadAttributeList(NULL, 2, 0, &amp;amp;sizeToAlloc);  
    ASSERT(sizeToAlloc &amp;gt; 0);
    if(sizeToAlloc &amp;lt;= 0)
        throw(GetLastError());
		
    pAttributeList = (LPPROC_THREAD_ATTRIBUTE_LIST)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeToAlloc);
    ASSERT(pAttributeList != NULL);
    if (!pAttributeList)
        throw(GetLastError());
	
    sizeOfBuffer = sizeToAlloc;
	
    // second call creates the attribute list.
    if (InitializeProcThreadAttributeList(pAttributeList, 2, 0, &amp;amp;sizeOfBuffer) == 0)
        throw(GetLastError());	
    ASSERT(sizeOfBuffer == sizeToAlloc);
	
    // add GROUP_AFFINITY attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_GROUP_AFFINITY,
			&amp;amp;groupAffinityDisk,
			sizeof(GROUP_AFFINITY),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
 
    // add IDEAL_PROCESSOR attribute to the list.
    if (UpdateProcThreadAttribute(
			pAttributeList, 
			0,
			PROC_THREAD_ATTRIBUTE_IDEAL_PROCESSOR,
			&amp;amp;processorNumber,
			sizeof(PROCESSOR_NUMBER),
			NULL,
			NULL) == 0)
        throw(GetLastError()); 
		
    // Create the thread on the specified ideal processor or same Numa node as ideal processor.	
    hThread = CreateRemoteThreadEx(
  		GetCurrentProcess(),	                // target process handle
  		NULL,    			// default security attributes
  		0,         			// use default stack size  
  		DemoThreadFunction,  	// thread function name
  		&amp;amp;numaNode,          		// argument to thread function 
  		0,                 		// use default creation flags
  		pAttributeList,		// additional parameters for the new thread. 
  		&amp;amp;dwThreadID);   		// returns the thread identifier 
	
    if (hThread == INVALID_HANDLE_VALUE)
        throw(GetLastError());	
			
    DeleteProcThreadAttributeList(pAttributeList);
    if (HeapFree(GetProcessHeap(), 0, pAttributeList) == 0)
        throw(GetLastError());
	
    GetThreadGroupAffinity(hThread, &amp;amp;groupAffinityThread);   
    Display(L&amp;quot;\tThread 0x%08X created on GROUP=0x%04X, NUMAnode=0x%04X, KAFFINITYmask=0x%08X, IdealProcessor=0x%02X\n\n&amp;quot;, 
	dwThreadID, groupAffinityThread.Group, numaNode, groupAffinityThread.Mask, processorNumber.Number);
    return 1;
}
 
&lt;/pre&gt; &lt;br /&gt;&lt;h2&gt;
&lt;b&gt;Related Community Resources&lt;/b&gt; 
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href="http://blogs.technet.com/winserverperformance" class="externalLink"&gt;http://blogs.technet.com/winserverperformance&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://blogs.technet.com/windowsserver" class="externalLink"&gt;http://blogs.technet.com/windowsserver&lt;span class="externalLinkIcon"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.msdn.microsoft.com/project/projectdirectory.aspx?T