Alberto Ros

Wrong-path-aware and optimized Entangling prefetcher

We extend the state-of-the-art Entangling prefetcher for instructions to avoid training during wrong-path execution. Our Entangling wrong-path-aware prefetcher is equipped with microarchitectural techniques that eliminate more than 99% of wrong-path pollution, thus reaching 98.9% of the performance of an ideal wrongpath-aware solution. We also propose two microarchitectural optimizations able to further increase performance benefits by 1.8%, on average. All this is achieved with just 304 bytes.

Reference: Alberto Ros, Alexandra Jimborean, "Wrong-Path-Aware Entangling Instruction Prefetcher". IEEE Transactions on Computers (TC), 2024. [PDF] [BibTeX entry]

WPA-Opt-Entangling

CVP-1 to ChampSim trace converter

Microarchitecture research relies on performance models with various degrees of accuracy and speed. In the past few years, one such model, ChampSim, has started to gain significant traction by coupling ease of use with a reasonable level of detail and simulation speed. At the same time, datacenter class workloads, which are not trivial to set up and benchmark, have become easier to study via the release of hundreds of industry traces following the first Championship Value Prediction (CVP-1) in 2018. A tool was quickly created to port the CVP-1 traces to the ChampSim format, which, as a result, have been used in many recent works. We revisit this conversion tool and find that several key aspects of the CVP-1 traces are not preserved by the conversion. We therefore propose an improved converter that addresses most conversion issues as well as patches known limitations of the CVP-1 traces themselves. We evaluate the impact of our changes on two commits of ChampSim, with one used for the first Instruction Championship Prefetching (IPC-1) in 2020. We find that the performance variation stemming from higher accuracy conversion is significant.

Reference: Josué Feliu, Arthur Perais, Daniel Jimenez, Alberto Ros, "Rebasing Microarchitectural Research with Industry Traces". 2023 IEEE International Symposium on Workload Characterization (IISWC), pages 100--114, Ghent, Belgium, October 2023. [PDF] [BibTeX entry]

CVP2ChampSim

MBPlib: Modular Branch Prediction Library

The Modular Branch Prediction Library (MBPlib) is an open-source C++ library. MBPlib runs over 18.4x faster than the current fastest framework, and its trace format uses 6.5x less disk space. MBPlib also makes development easier by providing utilities that are typically used as subcomponents in most branch prediction designs. Moreover, the library features one of the largest collections of example implementations, including traditional as well as state-of-the-art predictors. MBPlib will allow researchers to significantly reduce the time needed for evaluation. Furthermore, by giving the option of obtaining results within seconds, as well as by means of the broad collection of examples, written in a modern and uniform code style, MBPlib can significantly decrease the barrier to entry into the field. Thus, we believe that MBPlib is also a great tool for computer architecture classes.

Reference: Emilio Dominguez-Sanchez, Alberto Ros, "MBPlib: Modular Branch Prediction Library". International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 71--80, Raleigh, NC (USA), April 2023. [PDF] [BibTeX entry]

MBPlib

Splash-4 benchmark suite

Over the past three decades, the parallel applications of the Splash-2 benchmark suite have been instrumental in advancing multiprocessor research. Recently, the Splash-3 benchmarks eliminated performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the definition of the C memory model. This benchmark suite revisits the Splash-3 benchmarks and adapts them for contemporary architectures with atomic operations and lock-free constructs. With our changes, scalability improves for most benchmarks for up to 32 and 64 cores, showing an improvement of up to 9x in actual machines, and up to 5x in simulation, over the unmodified Splash-3 benchmarks. To denote the substantive nature of the improvements in the Splash-3 benchmarks and to re-introduce them in contemporary research, we refer to the new collection as Splash-4.

References:

Eduardo José Gómez-Hernández, Ruixiang Shao, Christos Sakalis, Stefanos Kaxiras, Alberto Ros, "Splash-4: Improving Scalability with Lock-Free Constructs". International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 235--236, Worldwide event, March 2021. [PDF] [BibTeX entry]
Eduardo José Gómez-Hernández, Juan Manuel Cebrian, Stefanos Kaxiras, Alberto Ros, "Splash-4: A Modern Benchmark Suite with Lock-Free Constructs". 2022 IEEE International Symposium on Workload Characterization (IISWC), pages 51--64, Austin, TX (USA), November 2022. [PDF] [BibTeX entry]

Splash-4

Berti data prefetcher

Berti is a first-level data cache prefetcher that selects the best local deltas, i.e., those that consider only demand accesses issued by the same instruction. Thanks to a high-confidence mechanism that precisely detects the timely local deltas with high coverage, Berti generates accurate prefetch requests. Then, it orchestrates the prefetch requests to the memory hierarchy, using the selected deltas. Our empirical results using ChampSim and SPEC CPU2017 and GAP workloads show that, with a storage overhead of just 2.55 KB, Berti improves performance by 8.5% compared to a baseline IP-stride and 3.5% compared to IPCP, a state-of-the-art prefetcher. Our evaluation also shows that Berti reduces dynamic energy at the memory hierarchy by 33.6% compared to IPCP, thanks to its high prefetch accuracy.

Reference: Agustín Navarro-Torres, Biswabandan Panda, Jesús Alastruey-Benedé, Pablo Ibáñez, V. Viñals-Yúfera, Alberto Ros, "Berti: An Accurate Local-Delta Data Prefetcher". 55th International Symposium on Microarchitecture (MICRO), pages 975--991, Chicago, IL (USA), October 2022. [PDF] [BibTeX entry]

Berti data prefetcher

Entangling prefetcher for instructions

Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers. We propose the Entangling Prefetcher for Instructions that entangles instructions to maximize timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. This prefetcher won the first Intruction Prefetching Championship and it was later published at ISCA.

References:

Alberto Ros, Alexandra Jimborean, "The Entangling Instruction Prefetcher". The 1st Instruction Prefetching Championship, Worldwide event, May 2020. [PDF] [BibTeX entry]
Alberto Ros, Alexandra Jimborean, "A Cost-Effective Entangling Prefetcher for Instructions". 48th International Symposium on Computer Architecture (ISCA), pages 99--111, June 2021. [PDF] [BibTeX entry]

Entangling prefetcher for instructions

Splash-3 benchmark suite

A well-known benchmark suite of parallel applications is the Splash-2 suite. Since its creation in the context of the DASH project, Splash-2 benchmarks have been widely used in research. However, Splash-2 was released over two decades ago and does not adhere to the recent C memory consistency model. This leads to unexpected and often incorrect behavior when some Splash-2 benchmarks are used in conjunction with contemporary compilers and hardware (simulated or real). Most importantly, we discovered critical performance bugs. In the Splash-3 benchmark suite we rectify the problematic benchmarks and contribute to the community a new sanitized version of the Splash-2 benchmarks.

Reference: Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, Alberto Ros, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research". International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 101--111, April 2016. [PDF] [BibTeX entry]

Splash-3

Fast&Furious tool

Existing multi-threaded applications perform synchronization either in an explicit way, e.g., making use of the functionality provided by synchronization libraries or in an implicit way, e.g., using shared variables. Unfortunately, the implicit synchronization constructs are prone to errors and difficult to detect. We developed a tool that is able to detect implicit synchronization in multi-threaded applications. The detection is performed by ensuring that during the execution of an application under a memory model that provides sequential consistency for data-race-free applications (SC for DRF), every read returns the same value as if running under sequential consistency. If the previous condition is not fulfilled by the execution, the application has data races, which may be intended to perform implicit synchronization.

Reference: Alberto Ros, Stefanos Kaxiras, "Fast&Furious: A Tool for Detecting Covert Racing". 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures (PARMA) and 4th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (DITAM), pages 1--6, January 2015. [PDF] [BibTeX entry].

Fast&Furious