### CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

Sawan Singh<sup>1</sup>, Josue Feliu<sup>2</sup>, Manuel E. Acacio<sup>1</sup>, Alexandra Jimborean<sup>1</sup>, Alberto Ros<sup>1</sup>

<sup>1</sup> Universidad de Murcia Murcia, Spain



<sup>2</sup>Universitat Politècnica de València Valencia, Spain



POLITÈCNICA DE VALÈNCIA

Monday, October 23, 2023

32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)



► Load queue (LQ) is one of the most critical structures in a processor



- ► Load queue (LQ) is one of the most critical structures in a processor
- LQ keeps all in-flight loads in order and supports priority searches



- $\succ$  Load queue (LQ) is one of the most critical structures in a processor
- LQ keeps all in-flight loads in order and supports priority searches
- LQ size has been increasing



CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23



- $\succ$  Load queue (LQ) is one of the most critical structures in a processor
- LQ keeps all in-flight loads in order and supports priority searches
- LQ size has been increasing
- Energy consumption of the LQ is also growing



CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23



- $\succ$  Load queue (LQ) is one of the most critical structures in a processor
- LQ keeps all in-flight loads in order and supports priority searches
- LQ size has been increasing
- Energy consumption of the LQ is also growing
- Simultaneous multithreading (SMT) intensifies the pressure on LQ as it requires additional LQ searches



CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





#### We propose CELLO

> A software-hardware co-design for SMT processors with TSO consistency model





#### We propose CELLO

- ➤ A software-hardware co-design for SMT processors with TSO consistency model
- > The compiler detects memory operations in DRF regions



#### We propose CELLO

- ➤ A software-hardware co-design for SMT processors with TSO consistency model
- > The compiler detects memory operations in DRF regions
- The hardware optimizes their execution by safely skipping the LQ searches without violating the TSO consistency model



#### We propose CELLO

- ➤ A software-hardware co-design for SMT processors with TSO consistency model
- > The compiler detects memory operations in DRF regions
- The hardware optimizes their execution by safely skipping the LQ searches without violating the TSO consistency model
- CELLO reduces LQ searches by half

#### Outline



- Overview
- Background
- CELLO
- Evaluation
- Conclusion













➤ When loads execute the target address of older stores may be unknown





- When loads execute the target address of older stores may be unknown
- Loads executing in the presense of an older unresolved stores are dependency-speculative (D-speculative)





- ➤ When loads execute the target address of older stores may be unknown
- Loads executing in the presense of an older unresolved stores are dependency-speculative (D-speculative)
- 2. st resolves its address
  - Stores search the LQ to make their presence known to younger loads that might have executed D-speculatively





- ➤ When loads execute the target address of older stores may be unknown
- Loads executing in the presense of an older unresolved stores are dependency-speculative (D-speculative)
- 2. st resolves its address
  - Stores search the LQ to make their presence known to younger loads that might have executed D-speculatively
  - > If found the load and the subsequent instructions are squashed and re-executed

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





- ➤ When loads execute the target address of older stores may be unknown
- Loads executing in the presense of an older unresolved stores are dependency-speculative (D-speculative)
- 2. st resolves its address
  - Stores search the LQ to make their presence known to younger loads that might have executed D-speculatively
  - > If found the load and the subsequent instructions are squashed and re-executed
  - > LQ search by stores is 51% of total LQ searches

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





TSO respects load-load ordering





- TSO respects load-load ordering
- > The younger executed load becomes speculative when an older load has not yet performed





- TSO respects load-load ordering
- > The younger executed load becomes speculative when an older load has not yet performed
- > These speculative loads are called memory-speculative (M-Speculative)





- TSO respects load-load ordering
- > The younger executed load becomes speculative when an older load has not yet performed
- > These speculative loads are called memory-speculative (M-Speculative)
- Cache invalidations can expose speculative loads in another core





- > TSO respects load-load ordering
- > The younger executed load becomes speculative when an older load has not yet performed
- > These speculative loads are called memory-speculative (M-Speculative)
- Cache invalidations can expose speculative loads in another core
- Cache evictions are also treated as invalidations as once evicted from cache it no longer can receive an invalidation





- > TSO respects load-load ordering
- > The younger executed load becomes speculative when an older load has not yet performed
- > These speculative loads are called memory-speculative (M-Speculative)
- Cache invalidations can expose speculative loads in another core
- Cache evictions are also treated as invalidations as once evicted from cache it no longer can receive an invalidation
- > The LQ is searched by cache invalidations and evictions, which is about 3% in evaluated benchmarks

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





- Structures in blue are partitioned between SMT threads
- Multiple SMT threads can run in a single SMT core





> No invalidations to check load-load ordering as now it executes in a single SMT core





- ▶ No invalidations to check load-load ordering as now it executes in a single SMT core
- Stores search the LQ of other threads when writing to the cache





- ▶ No invalidations to check load-load ordering as now it executes in a single SMT core
- Stores search the LQ of other threads when writing to the cache
- The additional search required in SMT processor to maintain load-load ordering contribute to 46% of total LQ searches

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





In the SMT processor, the LQ is searched at:-

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





In the SMT processor, the LQ is searched at:-

1. When the store resolves the address at execute stage (51%)





In the SMT processor, the LQ is searched at:-

- 1. When the store resolves the address at execute stage (51%)
- 2. On cache invalidations and cache evictions (3%)

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





In the SMT processor, the LQ is searched at:-

- 1. When the store resolves the address at execute stage (51%)
- 2. On cache invalidations and cache evictions (3%)
- 3. When stores write to cache (46%)

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23

#### Outline



- Overview
- Background
- CELLO
- Evaluation
- Conclusion

#### CELLO [Design Overview]



- A software-hardware co-designed approach
- Leverages SC-for-DRF consistency model

### CELLO [Design Overview]



- A software-hardware co-designed approach
- Leverages SC-for-DRF consistency model
- CELLO compiler classifies memory access within sync and DRF

### CELLO [Design Overview]



- A software-hardware co-designed approach
- Leverages SC-for-DRF consistency model
- CELLO compiler classifies memory access within sync and DRF
- Compiler information is transmitted to the hardware by dedicated instruction

## CELLO [Design Overview]



- A software-hardware co-designed approach
- Leverages SC-for-DRF consistency model
- CELLO compiler classifies memory access within sync and DRF
- Compiler information is transmitted to the hardware by dedicated instruction
- ➢ Based on the DRF information, CELLO,
  - Filters the LQ searches in the DRF region.
  - ➢ Facilitates early load exit from LQ.



# pragma omp parallel for for (int i = 0; i < N; i++) { a[i] = a[i] + 10; lock(mtx); counter ++; b += a[i]; unlock(mtx); c[i] = c[i] + 5; }



# pragma omp parallel for for (int i = 0; i < N; i++) { a[i] = a[i] + 10; lock(mtx); counter ++; b += a[i]; unlock(mtx); c[i] = c[i] + 5; }
DRF (runs sequentially)

No conflicts possible as they runs sequentially









> In DRF regions no thread/core can perform concurrently to the same memory location if one of them is write





In DRF regions no thread/core can perform concurrently to the same memory location if one of them is write 1. Loads in DRF regions can perform OoO without breaking TSO guarantees, they are non M-Speculative





 In DRF regions no thread/core can perform concurrently to the same memory location if one of them is write 1. Loads in DRF regions can perform OoO without breaking TSO guarantees, they are non M-Speculative
 2. No LQ search is required to maintain load-load ordering in a DRF region





- In DRF regions no thread/core can perform concurrently to the same memory location if one of them is write 1. Loads in DRF regions can perform OoO without breaking TSO guarantees, they are non M-Speculative
   2. No LQ search is required to maintain load-load ordering in a DRF region
- > All non-DRF regions are sync regions and load-load ordering should be respected in TSO





- In DRF regions no thread/core can perform concurrently to the same memory location if one of them is write
   1. Loads in DRF regions can perform OoO without breaking TSO guarantees, they are non M-Speculative
   2. No LQ search is required to maintain load-load ordering in a DRF region
- > All non-DRF regions are sync regions and load-load ordering should be respected in TSO
- CELLO delineates DRF and sync regions by setDRF instruction





- In DRF regions no thread/core can perform concurrently to the same memory location if one of them is write
   Loads in DRF regions can perform OoO without breaking TSO guarantees, they are non M-Speculative
   No LQ search is required to maintain load-load ordering in a DRF region
- > All non-DRF regions are sync regions and load-load ordering should be respected in TSO
- CELLO delineates DRF and sync regions by setDRF instruction

setDRF 1 : Start of DRF region





- In DRF regions no thread/core can perform concurrently to the same memory location if one of them is write
   1. Loads in DRF regions can perform OoO without breaking TSO guarantees, they are non M-Speculative
   2. No LQ search is required to maintain load-load ordering in a DRF region
- > All non-DRF regions are sync regions and load-load ordering should be respected in TSO
- CELLO delineates DRF and sync regions by setDRF instruction

setDRF 1 : Start of DRF region setDRF 0 : End of DRF region









































CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23





CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23















## CELLO [u-Arch design, early removal of loads]





LQ head is safe to remove when

- ➢ LQ head becomes non M-Spec
- ➢ LQ head becomes non D-Spec

## CELLO [u-Arch design, early removal of loads]





LQ head is safe to remove when

- LQ head becomes non M-Spec (DRF Loads are M-Speculative by default)
- ➢ LQ head becomes non D-Spec





- → CELLO provides a simple design to filter M-spec LQ searches in SMT processors
- → CELLO allows the DRF load to be removed early from the LQ head if all older stores have resolved the address and already searched the LQ

#### Outline



- Overview
- Background
- CELLO
- Evaluation
- Conclusion



- → Detailed In-house out-of-order SMT processor model
- $\rightarrow$  Uses Sniper as front end and GEMS for memory model
- → Standard invalidation-based directory protocol using GARNET
- $\rightarrow$  TSO like consistency
- → Intel Alder Lake micro-architecture
- $\rightarrow$  CACTI-P is used to model energy consumption
- → Splash-3, PARSEC 3.0, and six fine-grain synchronization-intensive applications are used as benchmarks





















- → M-speculative LQ searches are almost eliminated
- → Overall, 47% of LQ searches are filtered by CELLO

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23

### Evaluation [Execution time]





### Evaluation [Execution time]



- $\rightarrow$  LQ search filtering helps reduce the LQ search port contention
- $\rightarrow$  Removing loads early helps in some applications
- → CELLO provide a speed up of 2.8% on average

## Evaluation [LQ energy]





→ Searches account for 65% of LQ energy consumption

→ As CELLO filter most of the M-sepc search, the reduction in LQ energy expenditure is about 33%

# Evaluation [LQ energy]



→ Searches account for 65% of LQ energy consumption

→ As CELLO filter most of the M-sepc search, the reduction in LQ energy expenditure is about 33%

### Evaluation [Sensitivity analysis]





### Evaluation [Sensitivity analysis]





🗕 CELLO 💻 Baseline

LQ Size

Key observations:-

- → Smaller LQ benefits from low energy consumption
- → CELLO offers a design space with a smaller LQ size without compromising the performance when compared to the baseline without CELLO with 192 entries LQ
- → CELLO managed to reduce the LQ size from 192 to 80 while providing the same performance

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions @ PACT'23



 $\rightarrow$  The compiler can help optimize hardware



- $\rightarrow$  The compiler can help optimize hardware
- $\rightarrow$  SMT suffers from extensive LQ searches



- $\rightarrow$  The compiler can help optimize hardware
- $\rightarrow$  SMT suffers from extensive LQ searches

#### $\rightarrow$ CELLO can

- 1. Avoid LQ searches by 47%
- 2. Provide a speedup of 2.8% (up to 18.6%)
- 3. Reduce the LQ energy consumption by 33%



- $\rightarrow$  The compiler can help optimize hardware
- $\rightarrow$  SMT suffers from extensive LQ searches
- $\rightarrow$  CELLO can
  - 1. Avoid LQ searches by 47%
  - 2. Provide a speedup of 2.8% (up to 18.6%)
  - 3. Reduce the LQ energy consumption by 33%
- → CELLO provides an interesting design space by allowing to reduction the LQ size from 192 to 80 without any performance loss.

## CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

Sawan Singh, Josue Feliu, Manuel E. Acacio, Alexandra Jimborean, Alberto Ros

singh.sawan@um.es

Thank you for your attention!



Funded by the European Union



Established by the European Commission



MINISTERIO DE CIENCIA E INNOVACIÓN



Financiado por la Unión Europea NextGenerationEU





ECHO, ERC Consolidator Grant (No 819134)

This presentation belong to the authors. No distrib<mark>ution is allowed without the authors' perm</mark>ission.

32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)

#### **Backup Slides**

