





### ALTERNATE PATH $\mu$ -OP CACHE PREFETCHING

Sawan Singh<sup>1</sup> Arthur Perais<sup>2</sup> Alexandra Jimborean<sup>1</sup> Alberto Ros<sup>1</sup>

<sup>1</sup>Computer Engineering Department University of Murcia, Spain <sup>2</sup>TIMA, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France









ISCA'51, Session 10A, July 3, 2024





 $\stackrel{\text{COMPUTER OX PARALLEL ARCHITECTURE OX SYSTEMS}}{\rightarrow} \mu\text{-op Cache}$ 





Holds recently decoded μ-ops





### $\stackrel{\square}{ o}$ $\mu$ -op Cache

- ullet Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder

Solomon et al. Micro-operation cache: a power aware frontend for variable instruction length ISA





- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads

Solomon et al. Micro-operation cache: a power aware frontend for variable instruction length ISA

Alternate Path µ-op Cache Prefetching @ISCA'51





- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads
  - Only provide 0.87% improvement over no  $\mu$ -op cache





- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads
  - Only provide 0.87% improvement over no  $\mu$ -op cache
  - Ideal  $\mu$ -op cache can provide 10.82% improvement





- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads
  - Only provide 0.87% improvement over no  $\mu$ -op cache
  - Ideal  $\mu$ -op cache can provide 10.82% improvement
- $\rightarrow$  We propose UCP (Alternate Path  $\mu$ -op Cache Prefetching)





- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads
  - Only provide 0.87% improvement over no  $\mu$ -op cache
  - Ideal  $\mu$ -op cache can provide 10.82% improvement
- $\rightarrow$  We propose UCP (Alternate Path  $\mu$ -op Cache Prefetching)
  - Identify hard-to-predict branches







- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads
  - Only provide 0.87% improvement over no  $\mu$ -op cache
  - Ideal  $\mu$ -op cache can provide 10.82% improvement
- $\rightarrow$  We propose UCP (Alternate Path  $\mu$ -op Cache Prefetching)
  - Identify hard-to-predict branches
  - Prefetch  $\mu$ -ops from alternate path of the hard-to-predict branch



Solomon et al. Micro-operation cache: a power aware frontend for variable instruction length ISA

Alternate Path u-op Cache Prefetching @ISCA'51





#### $\stackrel{\longrightarrow}{\rightarrow} \mu$ -op Cache

- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads
  - Only provide 0.87% improvement over no  $\mu$ -op cache
  - Ideal  $\mu$ -op cache can provide 10.82% improvement
- $\rightarrow$  We propose UCP (Alternate Path  $\mu$ -op Cache Prefetching)
  - Identify hard-to-predict branches
  - Prefetch  $\mu$ -ops from alternate path of the hard-to-predict branch



<sup>&</sup>lt;sup>1</sup> Solomon et al. Micro-operation cache: a power aware frontend for variable instruction length ISA

Alternate Path u-op Cache Prefetching @ISCA'51





### $\mu$ -op Cache

- Holds recently decoded  $\mu$ -ops
- First introduced for energy savings<sup>1</sup> in x86 which requires complex decoder
- Current μ-op caches are severely overwhelmed by server workloads
  - ullet Only provide 0.87% improvement over no  $\mu$ -op cache
  - Ideal  $\mu$ -op cache can provide 10.82% improvement
- $\rightarrow$  We propose UCP (Alternate Path  $\mu$ -op Cache Prefetching)
  - Identify hard-to-predict branches
  - Prefetch μ-ops from alternate path of the hard-to-predict branch

    Prefetch to μ-op cache



Solomon et al. Micro-operation cache: a power aware frontend for variable instruction length ISA

Alternate Path u-op Cache Prefetching @ISCA'51



### **OUTLINE**



- Overview
- Background & Motivation
- UCP
- Methodology & Results
- Conclusions































































predictor

Taken / not taken

### BACKGROUND & MOTIVATION



PROCESSOR FRONT-END

ightarrowDecode latency





### BACKGROUND & MOTIVATION



PROCESSOR FRONT-END

→Decode latency

From  $\mu$ -op cache: $\rightarrow$ Decode energy





### **BACKGROUND & MOTIVATION**







### BACKGROUND & MOTIVATION







### BACKGROUND & MOTIVATION



- $\rightarrow$  4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- → 19.3% of traces show a slowdown



### BACKGROUND & MOTIVATION



PERFORMANCE OF  $\mu$ -OPS CACHE WITH SERVER WORKLOADS



- ightarrow 4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- → 19.3% of traces show a slowdown
- ightarrow Increasing size of  $\mu$ -op cache does not help

IPC Improvement (w.r.t 4Kops  $\mu$ -op cache)



### BACKGROUND & MOTIVATION



- ightarrow 4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- → 19.3% of traces show a slowdown
- ightarrow Increasing size of  $\mu$ -op cache does not help





### BACKGROUND & MOTIVATION



- ightarrow 4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- → 19.3% of traces show a slowdown
- ightarrow Increasing size of  $\mu$ -op cache does not help





### BACKGROUND & MOTIVATION



- ightarrow 4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- → 19.3% of traces show a slowdown
- ightarrow Increasing size of  $\mu$ -op cache does not help





### BACKGROUND & MOTIVATION





- ightarrow 4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- → 19.3% of traces show a slowdown
- ightarrow Increasing size of  $\mu$ -op cache does not help





### **BACKGROUND & MOTIVATION**



- $\rightarrow$  4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- $\rightarrow$  19.3% of traces show a slowdown
- Increasing size of  $\mu$ -op cache does not help





### BACKGROUND & MOTIVATION

PERFORMANCE OF  $\mu$ -OPS CACHE WITH SERVER WORKLOADS



- ightarrow 4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- → 19.3% of traces show a slowdown
- ightarrow Increasing size of  $\mu$ -op cache does not help
- ightarrow Ideal  $\mu$ -op cache can provide 10.82% improvement



IPC Improvement (w.r.t 4Kops  $\mu$ -op cache)



### **BACKGROUND & MOTIVATION**

Performance of  $\mu$ -ops cache with server workloads



- $\rightarrow$  4Kops  $\mu$ -op provides only 0.87% improvement over no  $\mu$ -op cache
- $\rightarrow$  19.3% of traces show a slowdown
- $\rightarrow$  Increasing size of  $\mu$ -op cache does not help
  - Ideal  $\mu$ -op cache can provide 10.82% improvement



**Key Insight:** *Increasing the size* of  $\mu$ -op cache by 16x is still not close to Ideal μ-op cache





→ FTQ is unable to hide the L1I miss latency on branch misprediction





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction







- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction







- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction
- $\rightarrow$  What if the correct path was always in the  $\mu$ -op cache after a pipeline flush due to branch misprediction?





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction
- $\rightarrow$  What if the correct path was always in the  $\mu$ -op cache after a pipeline flush due to branch misprediction?



IPC Improvement (w.r.t 4Kops  $\mu$ -op cache)





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction
- $\rightarrow$  What if the correct path was always in the  $\mu$ -op cache after a pipeline flush due to branch misprediction?



IPC Improvement (w.r.t 4Kops  $\mu$ -op cache)





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction
- $\rightarrow$  What if the correct path was always in the  $\mu$ -op cache after a pipeline flush due to branch misprediction?



IPC Improvement (w.r.t 4Kops  $\mu$ -op cache)





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction
- $\rightarrow$  What if the correct path was always in the  $\mu$ -op cache after a pipeline flush due to branch misprediction?



### Key Insight:





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction
- $\rightarrow$  What if the correct path was always in the  $\mu$ -op cache after a pipeline flush due to branch misprediction?



## Key Insight:

**1.** FTQ is unable to hide the fetch latency on branch misprediction





- → FTQ is unable to hide the L1I miss latency on branch misprediction
  - FTQ is cleared on a branch misprediction
- $\rightarrow$  What if the correct path was always in the  $\mu$ -op cache after a pipeline flush due to branch misprediction?



### **Key Insight:**

- **1.** FTQ is unable to hide the fetch latency on branch misprediction
- **2.** Focusing on few but critical instructions can provide significant performance benefits



### **OUTLINE**



- Overview
- Background & Motivation
- UCP
- Methodology & Results
- Conclusions



#### UCP UCP: OVERVIEW



Identifies a hard-to-predict conditional branch (H2P)



### UCP UCP: OVERVIEW



- ① Identifies a hard-to-predict conditional branch (H2P)
- On a H2P begin generating addresses on alternate path (alternate path)





## UCP: OVERVIEW



- ① Identifies a hard-to-predict conditional branch (H2P)
- On a H2P begin generating addresses on alternate path (alternate path)
- 3 Generated addresses on alternate path are prefetched in parallel to predicted path fetch





#### UCP UCP: OVERVIEW



- ① Identifies a hard-to-predict conditional branch (H2P)
- On a H2P begin generating addresses on alternate path (alternate path)
- 3 Generated addresses on alternate path are prefetched in parallel to predicted path fetch
- $\Phi$  Prefetched instructions are decoded and inserted in the  $\mu$ -op cache





### UCP UCP: OVERVIEW



- Identifies a hard-to-predict conditional branch (H2P)
- On a H2P begin generating addresses on alternate path (alternate path)
- Generated addresses on alternate path are prefetched in parallel to predicted path fetch
- $\Phi$  Prefetched instructions are decoded and inserted in the  $\mu$ -op cache



**Key Idea:** Keep the alternate path in the  $\mu$ -op cache for H2P branches



## UCP: H2P Branch Detection



→ H2P Branch: a branch which has high chance of being mispredicted



### UCP UCP: H2P Branch Detection



- → H2P Branch: a branch which has high chance of being mispredicted
  - → TAGE-Conf<sup>2</sup>







- → H2P Branch: a branch which has high chance of being mispredicted
  - → TAGE-Conf<sup>2</sup>
    - Not saturated predictions from AltBank, HitBank & BiModal







- → H2P Branch: a branch which has high chance of being mispredicted
  - → TAGE-Conf<sup>2</sup>
    - Not saturated predictions from AltBank, HitBank & BiModal
    - Does not consider SC and LP





- → H2P Branch: a branch which has high chance of being mispredicted
  - → TAGE-Conf<sup>2</sup>
    - Not saturated predictions from AltBank, HitBank & BiModal
    - Does not consider SC and LP
  - → UCP-Conf





- → H2P Branch: a branch which has high chance of being mispredicted
  - → TAGE-Conf<sup>2</sup>
    - Not saturated predictions from AltBank, HitBank & BiModal
    - Does not consider SC and LP
  - → UCP-Conf
    - All predictions from AltBanks shows high miss rate







- → H2P Branch: a branch which has high chance of being mispredicted
  - → TAGE-Conf<sup>2</sup>
    - Not saturated predictions from AltBank, HitBank & BiModal
    - Does not consider SC and LP
  - → UCP-Conf
    - All predictions from AltBanks shows high miss rate
    - SC shows high miss rate













### UCP UCP: H2P Branch Detection





→ TAGE-Conf provide 48.5% coverage and 12% accuracy



### UCP UCP: H2P Branch Detection





- → TAGE-Conf provide 48.5% coverage and 12% accuracy
- → UCP-Conf improve coverage to 70% and accuracy to 14.66%



## UCP





Alt. RAS Alt. IND













































## UCP: ALTERNATE PATH STOPPING CONDITIONS



ightarrow Stopping alternate path is crucial to prevent  $\mu ext{-op}$  cache pollution



#### UCP



#### UCP: ALTERNATE PATH STOPPING CONDITIONS

- ightarrow Stopping alternate path is crucial to prevent  $\mu$ -op cache pollution
- → When a new H2P branch is detected



### UCP UCP: ALTERNATE PATH STOPPING CONDITIONS



- $\rightarrow$  Stopping alternate path is crucial to prevent  $\mu$ -op cache pollution
- → When a new H2P branch is detected
- → When the alternate path is unlikely to become the correct path



### UCP UCP: ALTERNATE PATH STOPPING CONDITIONS



- $\rightarrow$  Stopping alternate path is crucial to prevent  $\mu$ -op cache pollution
- → When a new H2P branch is detected
- → When the alternate path is unlikely to become the correct path
  - BTB miss on predicted taken branch on the alternate path



#### **UCP**



#### UCP: ALTERNATE PATH STOPPING CONDITIONS

- $\rightarrow$  Stopping alternate path is crucial to prevent  $\mu$ -op cache pollution
- → When a new H2P branch is detected
- → When the alternate path is unlikely to become the correct path
  - BTB miss on predicted taken branch on the alternate path
  - Weight of each branch on the alternate path is accumulated, high confidence branches have lower weight.





### **UCP**



#### UCP: ALTERNATE PATH STOPPING CONDITIONS

- $\rightarrow$  Stopping alternate path is crucial to prevent  $\mu$ -op cache pollution
- → When a new H2P branch is detected
- → When the alternate path is unlikely to become the correct path
  - BTB miss on predicted taken branch on the alternate path
  - Weight of each branch on the alternate path is accumulated, high confidence branches have lower weight.
  - Non-branch instructions after a branch are counted. Once the count reaches 64 alternate paths stops in our work





#### **OUTLINE**



- Overview
- Background & Motivation
- UCP
- Methodology & Results
- Conclusions





→ ChampSim + subset (traces showing  $\geq$  5% improvement with ideal  $\mu$ -op cache) of CVP traces (2 FP, 97 INT, 73 Crypto and 134 datacenter trace)





- → ChampSim + subset (traces showing  $\geq$  5% improvement with ideal  $\mu$ -op cache) of CVP traces (2 FP, 97 INT, 73 Crypto and 134 datacenter trace)
- → Intel Alder Lake like microarchitecture





- → ChampSim + subset (traces showing  $\geq$  5% improvement with ideal  $\mu$ -op cache) of CVP traces (2 FP, 97 INT, 73 Crypto and 134 datacenter trace)
- → Intel Alder Lake like microarchitecture
- → We execute 100M instructions, 50M warmup and 50M to collect stats





- → ChampSim + subset (traces showing  $\geq$  5% improvement with ideal  $\mu$ -op cache) of CVP traces (2 FP, 97 INT, 73 Crypto and 134 datacenter trace)
- → Intel Alder Lake like microarchitecture
- → We execute 100M instructions, 50M warmup and 50M to collect stats
- ightarrow 1 cycle penalty for switching from  $\mu$ -op cache to L1I cache



























#### METHODOLOGY & RESULTS













### METHODOLOGY & RESULTS













### METHODOLOGY & RESULTS

























### METHODOLOGY & RESULTS













### METHODOLOGY & RESULTS







#### **OUTLINE**



- Overview
- Background & Motivation
- UCP
- Methodology & Results
- Conclusions





 $\rightarrow\,$  FTQ fails to hide L1I miss latency on branch miss





- → FTQ fails to hide L1I miss latency on branch miss
- → Focusing only a few but critical instructions can provide better performance





- ightarrow FTQ fails to hide L1I miss latency on branch miss
- → Focusing only a few but critical instructions can provide better performance
- → UCP focus on critical instructions after a H2P branch





- → FTQ fails to hide L1I miss latency on branch miss
- → Focusing only a few but critical instructions can provide better performance
- → UCP focus on critical instructions after a H2P branch
- ightarrow Still space for improvement in optimizing  $\mu$ -op cache

### ALTERNATE PATH $\mu$ -OP CACHE PREFETCHING

Sawan Singh<sup>1</sup> Arthur Perais<sup>2</sup> Alexandra Jimborean<sup>1</sup>
Alberto Ros<sup>1</sup>



singh.sawan@um.es











ECHO, ERC Consolidator Grant (No 819134)

This presentation and recording belong to the authors. No distribution is allowed without the authors' permission.







- → UCP reuses the BTB by doubling the number of BTB banks (from 16 to 32)
- → Each cycle we determine the banks to be accessed
- → By default, demand requests are given priority to access the conflict banks
- → UCP keeps a 3-bit saturated counter which is incremented every time the alternate path is delayed
- → When the counter saturates, the alternate path is given priority for the conflict banks in that cycle
- → The counter resets next cycle















