When Hyper-Threading Hurts

Preface

This article is part of a collection of information available at Bitsum Technologies as part of our Process Lasso software.

Get Process Lasso to experience the benefits of automated process priority adjustment and more.

CPU vs. Core

These two words are synonoms, though in this world of rapidly evolving technology they have changed a bit. CPU is now used to represent a single physical unit, where-as core is used to represent one of many CPUs on a single die. In this way, it can be confusing. Just remember, 4 cores really means 4 CPUs all on one chip.

Thread vs. Process

A Process is simply a collection of threads. The CPU Scheduler (the part of the OS that doles out tasks to the CPU) of Windows only sees threads. A process is just a higher level abstraction to group these threads together into a common unit.

Reminder about what CPU % use is

I want to remind people that CPU utilization occurs in micro-bursts, and the % use per second is not a perfect representation of how fast or slow a CPU is. That is to say, just because that metric shows only 75% of a CPU consumed, that doesn’t mean that you had an ‘extra’ 25% laying around. The speed at which that 75% was executed matters too. At best this metric gives you some idea of how CPU intensive your operations are.

UPDATE May 2015: Newer generation Intel and AMD processors pair their logical cores such that each pair share certain physical computational resources. Thus, the key point is to be aware that these pairs of logical cores exist, and if one of the two cores in a pair is utilized, it can dramatically impact the performance of the other core. These pairs are adjacent. So, logical cores 0 and 1 form a pair, 2 and 3 form a pair, etc…

To demonstrate this in a benchmark, logical cores 0 and 1 will offer much less performance than logical cores 0 and 2, because logical cores 0 and 2 are on two distinct physical CPUs, where as logical cores 0 and 1 share certain computational units and caches.

Legacy Article Describing Hyper-Threading on older Intel CPUs

Hyper-Threading is a feature introduced by Intel, and is exclusive to Intel processors. It splits a real CPU (a core) into 2. One is the real core, called the physical core. The other is just a secondary core, called the logical core. This logical core can’t do much, but it does provide a little increased parallelism. It is far from being a real core. In fact, it offers approx 30% the performance of a real physical core. That’s right, barely any computing power. Its purpose was simply to increase parallelism in a world dominated by I/O bound (non-CPU intensive) processes (actually threads, but we won’t split hairs here). When a CPU intensive (CPU bound) thread is switched to one of these cores, its performance will substantially degrade. Therefore, in some situations, it is appropriate to use Process Lasso’s HyperThreaded Core Avoidance, or disable HyperThreading all-together. Although the Windows Scheduler has become increasingly aware of HyperThreading, this is still a factor since the Scheduler is no AI, and it is especially important in XP and below where the Scheduler is even less aware of HyperThreading.

This is NOT to say HyperThreading is useless or detrimental for most. In most cases Windows threads are I/O bound, meaning they spend most of their time waiting for I/O and are not intensively using the CPU. In these cases, delegating them to a logical core is appropriate and can free a real core for doing CPU intensive work. [1].

This is because certain self-tuning algorithms can become confused when presented with a misrepresented number of cores, not realizing some are not real physical cores. In addition, the Scheduler may sometimes send a CPU bound thread to a Hyper-Threaded core, incurring a substantial performance penalty.

Instead of completely disabling Hyper-Threading, you can use programs like Process Lasso (free) to set default CPU affinities for critical processes, so that their threads never get allocated to logical cores. We call this feature Hyper-Threaded Core Avoidance. It is better than completely disabling Hyper-Threading because it leaves the rest of the system free to take advantage of this otherwise useful feature.

Few people really know just how well the Windows CPU Scheduler handles logical cores, but we think it is safe to say that XP became somewhat aware of them, and they’ve gradually improved it since then. Again, though, the scheduler is no AI and can not make perfect decisions. It will NEVER be perfect because the OS doesn’t have any knowledge of what threads are best to put on these slower logical cores.

It is therefore recommended that in some environments the default CPU affinity of select processes be set to avoid these Hyper-Threaded (logical) cores. Software such as Process Lasso allows for the default CPU affinity to be set for individual applications (meaning that CPU affinity is applied each time the process starts). Don’t worry, you need not pick out the cores that are logical from the ones that are physical, the Hyper-Threaded Core Avoidance feature of Process Lasso v5 will do this for you.

For reference, see this Microsoft article on optimizing server performance: http://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx or this posting at Agner Fog’s CPU Blog.

AMD’s Paired Logical Core Design

(UPDATE – 01-12-2012)

Now AMD joins the HyperThreading party, a side effect of a new design where fully capable physical cores share some computational units (e.g. FPU and L2 cache). Their cores are all real physical cores, not fake ones — BUT they have paired their cores together into Bulldozer (or PileDriver, SteamRoller, etc..) Modules with shared computational units. This means that the load should be balanced so as to not have too much load on a single pair, else performance is hurt. The scheduling effects are very similar to HyperThreading, you want to avoid every other core until it must be used.

As of Jan 2012 Microsoft has released an update for the Windows 7 and 2008/R2 scheduler that views the new AMD Bulldozer+ platform as having 1/2 real CPUs, and 1/2 ‘fake’ hyper-threaded CPUs. This helps to try to put the workload onto every other processor, an important thing for two reasons. One, TurboCore will be more likely to kick in. Two, the design of AMD Bulldozer+ is such that there are 2 pairs of cores that share an L2 cache, FPU, and other items. Thus, they are not fully independent, and you certainly want to back-fill utilization of them after the first core of each module is saturated. Don’t worry, if full computing power is needed, all cores get used — this just helps them perform better in ‘lightly threaded’ situations.

When I say Bulldozer+, I refer to PileDriver, SteamRoller, and all later incarnations we’ve seen, or are rumored to be in the works.

Patches for Windows 7 / 2008 R2:
http://support.microsoft.com/kb/2645594
http://support.microsoft.com/kb/2646060

The future – AMD Bulldozer, notice the paired cores with shared dependencies, similar to HyperThreading (but different in that each logical core offers substantially more computing power): Bulldozer architecture

AMD Bulldozer Scheduler patches for Windows 7 / 2008 R2 from Microsoft:
http://support.microsoft.com/kb/2645594
http://support.microsoft.com/kb/2646060