How OpenAI or DeepMind calculates cost of training a transformer based models?
The basic equation giving the cost to train a transformer model is given by:
C is the compute required to train the transformer model, in total floating point operations
Here,
τ = is the aggregate throughput of your hardware setup
τ = (Number of GPUs)×(Actual FLOPs per GPU) in FLOPs
& T is the time spent training the model, in seconds
These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see each paper for more information.
This formula can be further simplified to:
So,
where,
P = is the number of parameters in the transformer model
D = is the dataset size, in tokens
Its worth noting the unit for C,
C can be represented as
- FLOP-seconds, which is in units of (FLOPs /Second * Seconds)
- GPU-hours, which is in units of [No. of GPU * Hours]
- Scaling laws papers tend to report values in PetaFLOP-days.
This can be further interpreted as
Total Memory Training=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory
Lets deep dive into each component in detail.
Model Memory
- FP32 (32-bit floating point): standard precision → 4 bytes of memory
- FP (16-bit floating point) : half the precision of FP32 → bytes of memory
- Mixed-Precision: Mixed-precision training combines FP16 and FP32 to speed up computation and reduce memory usage while maintaining accuracy.
For FP32 : Model memory = (4 bytes/ param) * no. of parameter
For FP16 : Model memory = (2 bytes/ param) * no. of parameter
Mixed-precision (fp16/bf16 and fp32) : Model memory = (2 bytes/ param) * no. of parameter + (4 bytes/ param) * no. of parameter
- In mixed-precision training, you also need to consider the extra memory used by optimizer, which may require an additional FP32 copy of the model.
Want to find out correct and accurate answers? Look for our LLM Interview Course
- 100+ Questions spanning 14 categories
- Curated 100+ assessments for each category
- Well-researched real-world interview questions based on FAANG & Fortune 500 companies
- Focus on Visual learning
- Real Case Studies & Certification
50% off Coupon Code — LLM50
Link for the course —
Optimizer Memory
Adam is magic → highly memory inefficient
vanilla AdamW
memory optimizer=(12 bytes per parameter)×(Number of Parameters)
- FP32 copy of parameters: 4 bytes per parameter
- Momentum: 4 bytes per parameter
- Variance: 4 bytes per parameter
8-bit Optimizers (e.g., bitsandbytes)
memory optimizer=(6 bytes per parameter)×(Number of Parameters)
- FP32 copy of parameters: 4 bytes per parameter
- Momentum: 1 byte per parameter
- Variance: 1 byte per parameter
SGD-like Optimizers with Momentum
memory optimizer=(8 bytes per parameter)×(Number of Parameters)
- FP32 copy of parameters: 4 bytes per parameter
- Momentum: 4 bytes per parameter
Activation Memory
- Modern GPUs are typically bottlenecked by memory, not FLOPs, for LLM training
- Activation recomputation/checkpointing is an extremely popular method where it works by recomputing activations of certain layers instead of storing them in GPU memory.
- Below is a result of Megatron’s selective recomputation
- dashed red line → memory capacity of an A100–80GB GPU
- “present work” indicates the memory requirements after applying selective activation recomputation
Memory without Recomputations
Without any optimizations, storing activations can consume a large amount of memory, particularly for deep models with many layers.
- s is the sequence length, in tokens
- b is the batch size per GPU
- h is the dimension of the hidden size within each transformer layer
- L is the number of layers in the transformer model
- t is the degree of tensor parallelism being used (1 if not)
- a is the number of attention heads in the transformer model
Memory with Recomputations
In rare case, if we want to recompute every activation then,
Gradient Memory
Gradient Memory in FP32: When gradients are stored in FP32 (32-bit floating point), each parameter’s gradient requires 4 bytes of memory.
memory gradients=(4 bytes/param)×(No. params)
Gradient Memory in FP16 : When gradients are stored in FP16 (16-bit floating point), which is common in mixed-precision training, each parameter’s gradient requires 2 bytes of memory.
memory gradients=(2 bytes/param)×(No. params)
Example Case Study
Lets calculate memory requirements for training a 7 billion parameter (7B) model using FP32 precision.
Memory for Model Parameters
Number of Parameters (P) = 7 billion
Memory per Parameter in FP32: 4 bytes
Model Memory = 4 bytes/param & 7 × 10⁹ = 28 GBs
Memory for Optimizer States
AdamW optimizer, which requires:
- 12 bytes per parameter in FP32
Optimizer Memory=12 bytes/param×7×10⁹ params=84×109 bytes=84 GB
Memory for Activations
Assuming:
- s = Sequence length, in tokens (128)
- b = Batch size per GPU (512)
- h = Hidden size dimension (4096)
- L = Number of layers (24)
- a = Number of attention heads (16)
- t = Degree of tensor parallelism (1)
memory activations (No Recomputation)= 512×4096×24×(10+ (24/1) +5 * (16×128)/(4096×1) ) bytes ~ 67.51 GB
Memory for Gradients
- Memory per Gradient in FP32: 4 bytes
Gradient Memory= 4 bytes/param×7×10⁹ params = 28 GB
AgenticRAG with LlamaIndex Course
Look into our AgenticRAG with LlamaIndex Course with 5 real-time case studies.
Total Memory Calculation
Total Memory=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory
Total Memory (No Recomputation) = 28 GB+84 GB+67.51 GB+28 GB=207.51 GB
You would require ~207 GBs to train a 7B parameter transformer model with FB32 and No Recomputation of activations.
Summary:
- C = 6PD — Amount of compute required to train a transformer based models as per OpenAI and DeepMind’s scaling law papers
- Total Memory Required =Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory
- For Model memory
- FP32 gives higher precision but uses more memory.
- FP16 uses less memory but may sacrifice some precision.
- Mixed-Precision strikes a balance by combining FP16 for efficiency and FP32 for critical parts, saving memory while maintaining performance. - For Optimizer Memory
- AdamW is the most memory-demanding, requiring 12 bytes per parameter.
- 8-bit Optimizers are more efficient, using only 6 bytes per parameter.
- SGD with Momentum strikes a middle ground, needing 8 bytes per parameter - For Activation Memory
- Activations consume a significant amount of memory during training, especially for large models with many layers.
- Batch Size directly impacts memory usage, with larger batch sizes requiring more memory.
- Activation Recomputation is an effective technique to reduce memory usage by recomputing activations during the backward pass rather than storing them all, trading off memory for additional computation. - For Gradient Memory
- FP32 Gradients: Require 4 bytes per parameter.
- FP16 Gradients: Require 2 bytes per parameter.
All previous coffee break concepts
Look for all of our volumes of coffee break concepts:
Follow us here and your feedback as comments and claps encourages us to create better content for the community.