Announcement
Collapse
No announcement yet.
Windows 10 Combined VRAM Performance Update
Collapse
X
-
more about HT
from INTEL Developer Zone
what is the relation between "hardware thread" and "hyperthread"?
Dear Forum,
One of the Intel TBB webpages states that "a typical Xeon Phi coprocessor has 60 cores, and 4 hyperthreads/core". But this blog from Intel emphasizes that "The Xeon Phi co-processor utilizes multi-threading on each core as a key to masking the latencies inherent in an in-order micro-architecture. This should not be confused with hyper-threading on Xeon processors that exists primarily to more fully feed a dynamic execution engine."
I'm confused with these two conflicting statements. Could anyone explain the difference/similarity between hyperthread and hardware thread?
Besides, the software developer's guide says MIC has hardware multithreading by replicating complete architectural state 4 times (has this been used in xeon's hyperthreading, where one physical core is seen as two logical cores?), and further, MIC implements a ?smart? round-robin multithreading. Could you explain the relation between these two multithreading techniques?
Thanks a lot!
Comment
-
James Reinders (Intel) Thu, 05/15/2014 - 16:07
BEST REPLY
When we split words, it can be confusing can't it? The TBB documentation is wrong, thank you for pointing that out.
We choose to NOT call the hardware threads on the current Intel Xeon Phi Coprocessor (previously known by the code name Knights Corner) by the name "hyper-threads." The most important thing to know is that you'll usually need more Knights Corner threads per core to hit your best performance than you would with hyperthreading. That's consistent with the "highly parallel" optimized nature of an Intel Xeon Phi Coprocessor. The difference in these hardware threading techniques is instructive so I'll try to give an explanation of it that makes sense.
Regardless of what Intel device we talk about, a processing core will have one or more "hardware threads" per core. We use "hardware threads" as a very generic term that refers to multithreading achieved mostly by duplicating thread state and sharing most everything else in a processing core. Multithreading achieved by duplicating most everything, the whole "core," is what multicore and many-core designs are all about. Processors and coprocessor can have both "hardware threads" and lots of cores. "Hyperthreading" is a very specific form of implementing a "hardware thread" that is only found on dynamic (a.k.a. out-of-order) execution engines.
This highlights a difference in the Knights Corner microarchitecture and an Intel Xeon processor microarchitecture. The Knights Corner microarchitecture uses "in order" execution, so the hardware threads do a relatively simple round robin to feed the dual execution pipes of the microarchitecture. In this design, you can execute two vector (SIMD) instructions in parallel but they need to come from different threads. This is why we advise programmers to use at least two threads per core on Intel Xeon Phi Coprocessors. If you do not, the floating point (FP) peformance will peak at about half of what is possible. For most programmers, this is simply a matter of making sure OpenMP or TBB use at least 122 threads on a 61 core device. Many of us are in the habit of limiting FP intensive code to threads=cores on hyperthreaded machines. This is because on a hyperthreaded machines we find a microarchitecture with an out-of-order execution engine. In those designs, the full FP potential may be realized with a single thread. Additional threads on any device will put more pressure on caches and ask for more memory bandwidth. If your algorithm is already hitting peak FP usage, additional threads are not helpful unless they help with latency hiding. For the most part, out-of-order execution engines take care of latency hiding which an in-order design cannot. Therefore, with hyperthreading on an Intel Xeon processor you may hit peak performance with threads=cores. With the in-order execution design in the Knights Corner microarchitecture, at least two threads are needed to hit peak and latency hiding is often enhanced with even more threads. Many algorithms find three threads per core is their sweet spot while others prefer two or four.
In teaching programming for the Intel Xeon Phi coprocessor, we found that it was helpful to speak of this distinction mostly to encourage us all to experiment with how many threads per core served our applications best. Using OpenMP or TBB, this is as simple as setting a different parameter or environment variable and running several times to compare. No changes to a program can be avoided.
If we are used to always running threads=cores on a hyperthreaded machine, then it is useful to know that Knights Corner is not using hyperthreads and we should be at least using two threads per core (almost always) to get best performance.
That said, today's hyperthreading is much more advanced than it was a decade ago. If we've not ventured to test performance of our applications with hyperthreads recently, we should consider running some performance tests. If you are surprised how much better it is with hyperthreading than it used to be, please don't tell our marketing people... or they'll want to call them "hyper-thread PRO" or something else else I'll have to explain in a future blog.
I hope this clears everything up.
Thank you for pointing out our error in the TBB documents - I'll look to correct them.Last edited by AKK_K; 28 Sep 2015, 08:05:17.
Comment
-
In a normal (non-HT) CPU, the number of cores it has is the quantity of processing units. Each of these contain registers, program counters (registers), stack counters (registers), (usually) individual cache, and complete processing units. So if a normal CPU has 4 cores, it can run 4 threads simultaneously. When a thread is done (or the OS has decided that it's taking too much time and needs to wait its turn to start again), the CPU needs to follow those four steps to unload the thread and load in the new one before execution of the new one can begin.
In a HyperThreading CPU, on the other hand, the above holds true, but in addition, Each core has a duplicated set of Registers, Program Counters, Stack Counters, and (sometimes) cache. What this means is that a 4-core CPU can still only have 4 threads running simultaneously, but the CPU can have "preloaded" threads on the duplicated registers. So 4 threads are running, but 8 threads are loaded onto the CPU, 4 active, 4 inactive. Then, when it's time for the CPU to switch threads, instead of having to perform the loading/unloading at the moment the threads need to switch out, it simply "toggles" which thread is active, and performs the unloading/loading in the background on the newly "inactive" registers. Remember the two steps I suffixed with "these steps take time"? In a Hyperthreaded system, steps 2 and 4 are the only ones that need to be performed in real-time, whereas steps 1 and 3 are performed in the background in the hardware (divorced from any concept of threads or processes or CPU cores).
/
The hardware components of one physical core are shared between several threads. Each thread has at least its own set of registers. Most resources of the core (arithmetic and logic unit, floating point unit, cache) are shared between the threads. Naturally those threads compete for processing resources and stall if the desired units are already busy.
/
Intels Hyperthreading doubles some of the resources in each core that is resposible to distribute the instructions to the execution units.
สรุปว่า HT มีส่วนของ Hardware เพิ่มเติมครับLast edited by AKK_K; 28 Sep 2015, 09:21:18.
Comment
-
-
-
Windows 10: The new memory manager for video cards!
Windows 10: The new memory manager for video cards!08/09/2015,
Paul ?antr??ek, reviewsWindows 10 do not only for the players new DirectX 12,
but also a new version of the WDDM (Windows Display Driver Model).
While using DX12 games we still have to wait a while there,
the benefits of WDDM 2.0 for existing games can benefit now.
WDDM 2.0 Memory Management
The whole WDDM 2.0 has been modified to work in rendering, especially the planning and allocation of resources, to organize the graphics card (in cooperation with the controller) itself. For this purpose it was necessary to revise many things, especially the newly re-allocate work UMD and KMD, introduce support for virtual memory and to lay down rules and tools for effective management of memory (Paging engine, Page faulting ...). All these changes WDDM 2.0, then it must also support the graphics card itself (and their drivers). However, this again is not such a problem because the graphics card all this know how long ago, just simply was not space where to use it. In this space WDDM 2.0 graphics cards finally came. For a graphics card to plan their work themselves, must of course know exactly where they are placed in memory resources required for the actual rendering (surfaces). In an earlier model WDDM 1.X these resources are allocated in physical memory continuously. However, since this memory access not only the application itself (like a game), but could also be used by other applications, it may be that these surfaces have been moved to another location (address) of physical memory. Such unexpected and unplanned move the allocation of resources could of course be for the planner GPU fatal. For this reason, the WDDM 2.0 introduced virtual memory (virtual memory), which is isolated and unique to the application, so that no other application can use it. Therefore, if allocated some resources in the virtual memory for any of unplanned move can not happen and all the allocated resources remain throughout their "lifetime" at the same location (address inside the VM). Because it is a virtual memory, it may be allocated only in the local GPU memory (VRAM), but may extend also to the system memory (RAM). Virtual memory is then physically separated and managed in blocks (pages) of 4 KB (or VRAM also the 64 KB). For example, when virtual memory is allocated a texture is divided and aligned to these pages (pages) so that any additional texture started again on the new. For rendering itself so no longer necessary to use the texture of the whole, but only that part which is defined only certain pages (pages).
Comment
-
Virtual memory is thus not only provides a permanent, immutable allocation and distribution of resources (surfaces), but because its allocation was also extended to the area of system memory (RAM), may be its total size much larger than the one that physically offers VRAM particular graphic card. Of course, the graphics card can not perform rendering resources (surfaces) that are placed in a slow memory (which also contains part VM). All resources necessary for rendering (working set) must be physically located in VRAM graphics card and therefore also needs to transfer these resources from RAM to VRAM care. The transfer of resources between RAM and VRAM physically old Dedicated Paging Engine. The engine performs so-called paging (copy pages between RAM and VRAM), but which are not being transferred (copied) sources whole, but only their individual parts - pages (pages), which are absolutely necessary for the actual work of the GPU. Dedicated Paging Engine works by using DMA in parallel with the 3D engine, which means that while 3D Engine performs rendering one content (pictured), Paging Engine can at the same time perform paging operations on the contents of another (eg copying pages resources into VRAM needed for rendering images of these). If we could everything much simpler, then the whole system memory management under WDDM 2.0 works so that the VRAM has become only a sort of reservoir (buffer), which are located only resources needed for rendering the current frame (working set) or more of the following images. This buffer is then updated on the one hand the resources needed for further work GPU from the RAM and on the other, it has (for further work) unnecessary source removed. The graphics card has thus always have all the necessary resources in the VRAM and thus can render images without interruption. In fact it is similar solutions virtual memory, which has for some time have Windows operating systems, which is part of the contents of RAM installed on the HDD. Minor drawback parallel work Paging engine is then consumed a portion of bandwidth VRAM. When paging operations had Paging Engine is a VRAM accessed concurrently with the 3D engine, which just renders the image and "robbing" him as a part of the bandwidth VRAM at his work. If we apply the behavior of the new memory management on our game, "Kill the bad guy" could it would look something like this. All textures (surfaces), including the huge texture maps of the world, they are allocated in virtual memory. But since the whole texture of the world did not fit into VRAM, virtual memory is allocated in large part from RAM. Do VRAM is stretched only what is needed for the current rendering (working set), ie only those pages (pages) of this huge textures that contain the map of Czech, plus neighboring States. Paging Engine when playing tracks resource utilization and at the border Czech Slovakia starts (parallel to the 3D engine) to copy those pages VRAM textures of the world that contain the map of Ukraine. Pages containing a map of Germany are in turn removed from the VRAM. As you can see, despite the fact that all sources (surfaces) in VRAM not fit enough to fit into VRAM mainly working set and rendering will run smoothly - without tearing. If the game such claims on resources not and VRAM would fit whole surface, everything would take place just like before, and no paging would not occur. Analogously, you can then say that if you have a graphics card with less video memory, rendering will be similarly smooth and no jams as the graphics card, which has a far greater VRAM. If you are the owner of graphics card memory conversely larger, you're more likely that you do VRAM fit the entire surface and avoids memory paging. At the end of our description of the WDDM 2.0, it must be said that this whole issue about memory management WDDM 2.0 is still much more complex and extensive than what is described in this article. We deliberately avoided the issue needs to be around the new behavior UMD (User Mode Driver) and limiting functions KMD (Kernel Mode Driver), its own resource allocation treatment of access to unallocated memory (Page Fault) and improved switching resources (Context Switch). On all of this, perhaps we get some other time, but today we could only explain the basic principles of behavior the new memory management in WDDM 2.0. But to this day continued not only in theory, let's have a new memory management WDDM 2.0 look a little bit of practice and some here graphics card to test it on real games under Windows 10th
Comment

Comment