|
26.06.2007
So, How Smart is It?
Any investigation of frontside bus performance on Intel Core architectures will turn up references to how smart the prefetch process is. Prefetching is essential to the function of both the L1 and L2 caches. Prefetching is not a new idea, but the Core architecture takes it to a new level. While each core has a dedicated L1 cache, the L2 cache is shared by the processor cores. The Core processors have two hardware-based prefetchers to speed up data access from the L1 cache. The first, the Data Cache Unit (DCU) prefetcher, is also known as the streaming prefetcher and it’s particularly beneficial to streaming algorithms. It detects when the core is repeatedly accessing very recently loaded data in ascending order, and automatically fetches the next line in order to keep the DCU full and the algorithm running at top speed.
The second L1 prefetcher keeps track of the instruction pointer and looks at individual load instructions. If it determines that the load instructions are evenly spaced, or have a regular “stride,” it prefetches the next offset. This strided prefetcher can operate forwards or backwards. These prefetch algorithms reward the programmer who is aware of their operation and arranges data structures accordingly.
Unambiguous
One of the challenges of maintaining throughput in a highly parallel microarchitecture, where multiple instructions are being executed at the same time, is that a load operation may be dependent on a store that precedes it in the program code. The actual execution may be out of order, however, and in a less sophisticated implementation, the processor would have to block loads until all preceding store addresses are known. The Intel Core architecture implements memory disambiguation, which predicts which loads will not have dependencies on previous stores based on past experience, even without knowing the stores’ addresses. This allows the processor to continue loading data from the L1 cache. As the actual stores are executed, the memory disambiguator verifies its predictions. Actual conflicts occasionally occur, but the disambiguator simply causes the load and its associated instructions to be re-executed.
The L2 cache is also dependent on prefetching, but has to be aware of what the L1 prefetchers for each core are doing in order to maximize their performance. The data prefetch logic (DPL) for the L2 cache looks for patterns in the past requests of the L1 DCU and stores them in two separate arrays. It monitors DCU reads for stream activity and fills the cache accordingly. Although first introduced in the Pentium M series, the current Core architecture’s DPL is considerably more sophisticated. It detects when the stream skips cache lines and it adjusts dynamically to available bus bandwidth and the number of requests—it prefetches far ahead if the bus is not busy, less far if it’s busy.
Cache coherency has received a lot of attention lately, too. When multiple processor cores are executing threads, there is always the possibility that they are operating on the same data or modifying data that will affect the results of the other processor’s work. But is the issue relevant to most desktop multi-core designs? Zohar says, “In a single socket processor, the L2 cache is shared. Cache discrepancies are resolved within the chip itself; there is no coherency traffic on the memory bus.”
There is coherency activity in the cache, of course, but it is internal overhead, with no real effect on memory throughput. Cache coherency is a concern in multi-socket systems, as in workstations and servers, but each processor has its own FSB in such systems, and mechanisms in the chipset are specifically designed to reduce coherency traffic. So while this coherency traffic adds a bit to the overall utilization, it will still be well under 10%. “It’s a non-issue for desktop platforms,” says Zohar.
Closer Data
From the above, a clear goal emerges: bring the data closer to the processor. But do it in a way that maximizes throughput, minimizes bus traffic, and optimizes processor efficiency. Technologies such as streaming and strided prefetching, coordinated L1/L2 fetching, disambiguation, translation lookaside, and streamlined handling of page misses and memory ordering work together harmoniously to raise performance to a new level. In benchmark after benchmark, application after application, the Intel Core architecture and Advanced Smart Cache have proven their leadership with record-breaking performance.
Most importantly, today’s FSB is a system, one with more than enough bandwidth to handle dual, quad, and even eight-core processors.
|