We extend the state-of-the-art Entangling prefetcher for instructions to avoid training during wrong-path execution. Our Entangling wrong-path-aware prefetcher is equipped with microarchitectural techniques that eliminate more than 99% of wrong-path pollution, thus reaching 98.9% of the performance of an ideal wrongpath-aware solution. We also propose two microarchitectural optimizations able to further increase performance benefits by 1.8%, on average. All this is achieved with just 304 bytes.
Referencia: Alberto Ros, Alexandra Jimborean, "Wrong-Path-Aware Entangling Instruction Prefetcher". IEEE Transactions on Computers (TC), 2024. [PDF] [Entrada BibTeX]
Microarchitecture research relies on performance models with various degrees of accuracy and speed. In the past few years, one such model, ChampSim, has started to gain significant traction by coupling ease of use with a reasonable level of detail and simulation speed. At the same time, datacenter class workloads, which are not trivial to set up and benchmark, have become easier to study via the release of hundreds of industry traces following the first Championship Value Prediction (CVP-1) in 2018. A tool was quickly created to port the CVP-1 traces to the ChampSim format, which, as a result, have been used in many recent works. We revisit this conversion tool and find that several key aspects of the CVP-1 traces are not preserved by the conversion. We therefore propose an improved converter that addresses most conversion issues as well as patches known limitations of the CVP-1 traces themselves. We evaluate the impact of our changes on two commits of ChampSim, with one used for the first Instruction Championship Prefetching (IPC-1) in 2020. We find that the performance variation stemming from higher accuracy conversion is significant.
Referencia: Josué Feliu, Arthur Perais, Daniel Jimenez, Alberto Ros, "Rebasing Microarchitectural Research with Industry Traces". 2023 IEEE International Symposium on Workload Characterization (IISWC), pp. 100--114, Gante, Bélgica, octubre 2023. [PDF] [Entrada BibTeX]
The Modular Branch Prediction Library (MBPlib) is an open-source C++ library. MBPlib runs over 18.4x faster than the current fastest framework, and its trace format uses 6.5x less disk space. MBPlib also makes development easier by providing utilities that are typically used as subcomponents in most branch prediction designs. Moreover, the library features one of the largest collections of example implementations, including traditional as well as state-of-the-art predictors. MBPlib will allow researchers to significantly reduce the time needed for evaluation. Furthermore, by giving the option of obtaining results within seconds, as well as by means of the broad collection of examples, written in a modern and uniform code style, MBPlib can significantly decrease the barrier to entry into the field. Thus, we believe that MBPlib is also a great tool for computer architecture classes.
Referencia: Emilio Dominguez-Sanchez, Alberto Ros, "MBPlib: Modular Branch Prediction Library". International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 71--80, Raleigh, NC (EEUU), abril 2023. [PDF] [Entrada BibTeX]
Durante las últimas tres décadas, las aplicaciones paralelas del conjunto de aplicaciones Splash-2 han sido fundamentales para avanzar en la investigación en multiprocesadores. Recientemente y a raíz de la definición del modelo de memoria C, el conjunto de aplicaciones de Splash-3 eliminaron errores de rendimiento, carreras de datos y sincronización inadecuada del conjunto de aplicaciones de Splash-2. La suite presentada revisa las aplicaciones de Splash-3 y las adapta para arquitecturas contemporáneas con operaciones atómicas y construcciones lock-free. Con nuestros cambios, la escalabilidad mejora para la mayorí?a de las aplicaciones de 32 hasta 64 núcleos, lo que muestra una mejora de hasta 9 veces en las máquinas reales y hasta 5 veces en la simulació?n, en comparació?n con las aplicaciones de Splash-3 sin modificar. Para denotar la naturaleza sustantiva de las mejoras en los puntos de referencia de Splash-3 y reintroducirlas en la investigació?n contemporá?nea, nos referimos a la nueva colecció?n como Splash-4.
Referencias:
La prebúsqueda de instrucciones en caché es una técnica fundamental para diseñar computadores de alto rendimiento. Presentamos la prebúsqueda de instrucciones entrelazada (Entangling), que entrelaza instrucciones para maximizar la puntualidad. El mecanismo funciona encontrando qué instrucción debe activar la prebúsqueda para una instrucción posterior, teniendo en cuenta la latencia de cada falta de caché. El mecanismo se ajusta cuidadosamente para tener en cuenta tanto la cobertura como la precisión. Este prefetcher ganó el primer campeonato de prebúsqueda de instrucciones y fue publicado más tarde en la conferencia ISCA.
Referencias:
A well-known benchmark suite of parallel applications is the Splash-2 suite. Since its creation in the context of the DASH project, Splash-2 benchmarks have been widely used in research. However, Splash-2 was released over two decades ago and does not adhere to the recent C memory consistency model. This leads to unexpected and often incorrect behavior when some Splash-2 benchmarks are used in conjunction with contemporary compilers and hardware (simulated or real). Most importantly, we discovered critical performance bugs. In the Splash-3 benchmark suite we rectify the problematic benchmarks and contribute to the community a new sanitized version of the Splash-2 benchmarks.
Referencia: Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, Alberto Ros, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research". International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 101--111, abril 2016. [PDF] [Entrada BibTeX]
Existing multi-threaded applications perform synchronization either in an explicit way, e.g., making use of the functionality provided by synchronization libraries or in an implicit way, e.g., using shared variables. Unfortunately, the implicit synchronization constructs are prone to errors and difficult to detect. We developed a tool that is able to detect implicit synchronization in multi-threaded applications. The detection is performed by ensuring that during the execution of an application under a memory model that provides sequential consistency for data-race-free applications (SC for DRF), every read returns the same value as if running under sequential consistency. If the previous condition is not fulfilled by the execution, the application has data races, which may be intended to perform implicit synchronization.
Referencia: Alberto Ros, Stefanos Kaxiras, "Fast&Furious: A Tool for Detecting Covert Racing". 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures (PARMA) and 4th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (DITAM), pp. 1--6, enero 2015. [PDF] [Entrada BibTeX].