How OpenAI or DeepMind calculates cost of training a transformer based models?

--

How to calculate cost of training a transformer based models?

The basic equation giving the cost to train a transformer model is given by:

Cost of training a transformer based models

C is the compute required to train the transformer model, in total floating point operations

Here,

τ = is the aggregate throughput of your hardware setup

τ = (Number of GPUs)×(Actual FLOPs per GPU) in FLOPs

& T is the time spent training the model, in seconds

These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see each paper for more information.

This formula can be further simplified to:

So,

where,

P = is the number of parameters in the transformer model

D = is the dataset size, in tokens

Its worth noting the unit for C,

C can be represented as

  • FLOP-seconds, which is in units of (FLOPs /Second * Seconds)
  • GPU-hours, which is in units of [No. of GPU * Hours]
  • Scaling laws papers tend to report values in PetaFLOP-days.

This can be further interpreted as

Total Memory Training​=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory

Lets deep dive into each component in detail.

Model Memory

  • FP32 (32-bit floating point): standard precision → 4 bytes of memory
  • FP (16-bit floating point) : half the precision of FP32 → bytes of memory
  • Mixed-Precision: Mixed-precision training combines FP16 and FP32 to speed up computation and reduce memory usage while maintaining accuracy.

For FP32 : Model memory = (4 bytes/ param) * no. of parameter

For FP16 : Model memory = (2 bytes/ param) * no. of parameter

Mixed-precision (fp16/bf16 and fp32) : Model memory = (2 bytes/ param) * no. of parameter + (4 bytes/ param) * no. of parameter

  • In mixed-precision training, you also need to consider the extra memory used by optimizer, which may require an additional FP32 copy of the model.

Want to find out correct and accurate answers? Look for our LLM Interview Course

  • 100+ Questions spanning 14 categories
  • Curated 100+ assessments for each category
  • Well-researched real-world interview questions based on FAANG & Fortune 500 companies
  • Focus on Visual learning
  • Real Case Studies & Certification

50% off Coupon Code — LLM50

Link for the course

Optimizer Memory

Adam is magic → highly memory inefficient

vanilla AdamW

memory optimizer​=(12 bytes per parameter)×(Number of Parameters)

  • FP32 copy of parameters: 4 bytes per parameter
  • Momentum: 4 bytes per parameter
  • Variance: 4 bytes per parameter

8-bit Optimizers (e.g., bitsandbytes)

memory optimizer​=(6 bytes per parameter)×(Number of Parameters)

  • FP32 copy of parameters: 4 bytes per parameter
  • Momentum: 1 byte per parameter
  • Variance: 1 byte per parameter

SGD-like Optimizers with Momentum

memory optimizer​=(8 bytes per parameter)×(Number of Parameters)

  • FP32 copy of parameters: 4 bytes per parameter
  • Momentum: 4 bytes per parameter

Activation Memory

  • Modern GPUs are typically bottlenecked by memory, not FLOPs, for LLM training
  • Activation recomputation/checkpointing is an extremely popular method where it works by recomputing activations of certain layers instead of storing them in GPU memory.
  • Below is a result of Megatron’s selective recomputation
Megatron’s selective recomputation
  • dashed red line → memory capacity of an A100–80GB GPU
  • “present work” indicates the memory requirements after applying selective activation recomputation

Memory without Recomputations

Without any optimizations, storing activations can consume a large amount of memory, particularly for deep models with many layers.

  • s is the sequence length, in tokens
  • b is the batch size per GPU
  • h is the dimension of the hidden size within each transformer layer
  • L is the number of layers in the transformer model
  • t is the degree of tensor parallelism being used (1 if not)
  • a is the number of attention heads in the transformer model

Memory with Recomputations

Memory with Recomputations

In rare case, if we want to recompute every activation then,

recompute every activation

Gradient Memory

Gradient Memory in FP32: When gradients are stored in FP32 (32-bit floating point), each parameter’s gradient requires 4 bytes of memory.

memory gradients​=(4 bytes/param)×(No. params)

Gradient Memory in FP16 : When gradients are stored in FP16 (16-bit floating point), which is common in mixed-precision training, each parameter’s gradient requires 2 bytes of memory.

memory gradients​=(2 bytes/param)×(No. params)

Example Case Study

Lets calculate memory requirements for training a 7 billion parameter (7B) model using FP32 precision.

Memory for Model Parameters

Number of Parameters (P) = 7 billion

Memory per Parameter in FP32: 4 bytes

Model Memory = 4 bytes/param & 7 × 10⁹ = 28 GBs

Memory for Optimizer States

AdamW optimizer, which requires:

  • 12 bytes per parameter in FP32

Optimizer Memory=12 bytes/param×7×10⁹ params=84×109 bytes=84 GB

Memory for Activations

Assuming:

  • s = Sequence length, in tokens (128)
  • b = Batch size per GPU (512)
  • h = Hidden size dimension (4096)
  • L = Number of layers (24)
  • a = Number of attention heads (16)
  • t = Degree of tensor parallelism (1)

memory activations (No Recomputation)​= 512×4096×24×(10+ (24/1) ​+5 * (16×128)/(4096×1) ​) bytes ~ 67.51 GB

Memory for Gradients

  • Memory per Gradient in FP32: 4 bytes

Gradient Memory= 4 bytes/param×7×10⁹ params = 28 GB

Total Memory Calculation

Total Memory=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory

Total Memory (No Recomputation) = 28 GB+84 GB+67.51 GB+28 GB=207.51 GB

You would require ~207 GBs to train a 7B parameter transformer model with FB32 and No Recomputation of activations.

Summary:

  • C = 6PD — Amount of compute required to train a transformer based models as per OpenAI and DeepMind’s scaling law papers
  • Total Memory Required =Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory
  • For Model memory
    - FP32 gives higher precision but uses more memory.
    - FP16 uses less memory but may sacrifice some precision.
    - Mixed-Precision strikes a balance by combining FP16 for efficiency and FP32 for critical parts, saving memory while maintaining performance.
  • For Optimizer Memory
    - AdamW is the most memory-demanding, requiring 12 bytes per parameter.
    - 8-bit Optimizers are more efficient, using only 6 bytes per parameter.
    - SGD with Momentum strikes a middle ground, needing 8 bytes per parameter
  • For Activation Memory
    - Activations consume a significant amount of memory during training, especially for large models with many layers.
    - Batch Size directly impacts memory usage, with larger batch sizes requiring more memory.
    - Activation Recomputation is an effective technique to reduce memory usage by recomputing activations during the backward pass rather than storing them all, trading off memory for additional computation.
  • For Gradient Memory
    - FP32 Gradients: Require 4 bytes per parameter.
    - FP16 Gradients: Require 2 bytes per parameter.

Follow us here and your feedback as comments and claps encourages us to create better content for the community.

Can you give multiple claps? Yes you can

--

--

Mastering LLM (Large Language Model)

MasteringLLM is a AI first EdTech company making learning LLM simplified with its visual contents. Look out for our LLM Interview Prep & AgenticRAG courses.