Cost Model¶

Single-GPU Cost Modeling¶

Latency / Runtime Cost Estimation

Roofline Model
- Idea: \(\max(\text{compute time}, \text{memory access time})\)
- Use linear regression model for NVIDIA GPUs
- Compute time model (\(\text{latency} + \text{bandwidth} \times \text{#flops}\))
- Memory access model (\(\text{latency} + \text{bandwidth} \times \text{#bytes}\))

Memory Cost Estimation

Memory cost estimation for each operator
- FX graph enables precise peak memory analysis
- Memory liveness analysis
  - Track the time intervals during which tensors are alive

Communication Modeling

Challenge * Communication cost varies for multi-level hierarchical communication * Congestion may occur in certain scenarios

Device Mesh Abstraction

Mapping the GPU network hierarchy into a N-dimension device mesh
Devices in the same mesh dimension have the same communication capability (latency, bandwidth)
How to obtain the device mesh?
- Current Approach: Predefined device mesh shape
  - Profile all-reduce among the devices in the same mesh dim
  - Linear regression along each mesh dim \(i\) (\(\text{time}_i = \text{latency}_i + \text{bandwidth}_i \times \text{#bytes}\))
- In-Progress: Auto Mesh Discovery
  - Motivation
    - More robust to complex communication architecture
    - More accurate since each dimension can be considered respectively
  - Profile the communication latency between each mesh dimension
  - Search the mesh shape where the communication time along the same dim with max similarity
  - Linear regression along each mesh dim \(i\) (\(\text{time}_i = \text{latency}_i + \text{bandwidth}_i \times \text{#bytes}\))