# How OpenAI or DeepMind calculates cost of training a transformer based models?

The basic equation giving the cost to train a transformer model is given by:

C is the compute required to train the transformer model, in total floating point operations

Here,

τ = is the **aggregate throughput** of your hardware setup

τ = (Number of GPUs)×(Actual FLOPs per GPU) in FLOPs

& T is the **time spent training the model**, in **seconds**

These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see each paper for more information.

This formula can be further simplified to:

So,

where,

P = is the number of parameters in the transformer model

D = is the dataset size, in tokens

Its worth noting the **unit for C**,

C can be represented as

**FLOP-seconds**, which is in units of (FLOPs /Second * Seconds)**GPU-hours**, which is in units of [No. of GPU * Hours]- Scaling laws papers tend to report values in
**PetaFLOP-days**.

This can be further interpreted as

## Total Memory Training=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory

Lets deep dive into each component in detail.

# Model Memory

**FP32 (32-bit floating point):**standard precision → 4 bytes of memory**FP (16-bit floating point) :**half the precision of FP32 → bytes of memory**Mixed-Precision**: Mixed-precision training combines FP16 and FP32 to speed up computation and reduce memory usage while maintaining accuracy.

**For FP32 :** Model memory = (4 bytes/ param) * no. of parameter

**For FP16 : **Model memory = (2 bytes/ param) * no. of parameter

**Mixed-precision (fp16/bf16 and fp32) : **Model memory = (2 bytes/ param) * no. of parameter + (4 bytes/ param) * no. of parameter

- In mixed-precision training, you also need to consider the extra memory used by optimizer, which may require an additional FP32 copy of the model.

# Want to find out correct and accurate answers? Look for our LLM Interview Course

**100+**Questions spanning 14 categories- Curated
**100+**assessments for each category - Well-researched
**real-world interview questions**based on**FAANG & Fortune 500**companies - Focus on
**Visual learning** - Real
**Case Studies & Certification**

**50% off Coupon Code — LLM50**

**Link for the course** —

# Optimizer Memory

Adam is magic → highly memory inefficient

# vanilla AdamW

memory optimizer=(

12 bytesper parameter)×(Number of Parameters)

**FP32 copy of parameters:**4 bytes per parameter**Momentum:**4 bytes per parameter**Variance:**4 bytes per parameter

# 8-bit Optimizers (e.g., bitsandbytes)

memory optimizer=(

6 bytesper parameter)×(Number of Parameters)

**FP32 copy of parameters:**4 bytes per parameter**Momentum:**1 byte per parameter**Variance:**1 byte per parameter

# SGD-like Optimizers with Momentum

memory optimizer=(

8 bytesper parameter)×(Number of Parameters)

**FP32 copy of parameters:**4 bytes per parameter**Momentum:**4 bytes per parameter

# Activation Memory

- Modern GPUs are typically
**bottlenecked by memory, not FLOPs**, for LLM training **Activation recomputation/checkpointing is an extremely popular method**where it works by recomputing activations of certain layers instead of storing them in GPU memory.- Below is a result of Megatron’s selective recomputation

- dashed red line → memory capacity of an A100–80GB GPU
- “present work” indicates the memory requirements after applying selective activation recomputation

# Memory without Recomputations

Without any optimizations, storing activations can consume a large amount of memory, particularly for deep models with many layers.

**s**is the sequence length, in tokens**b**is the batch size per GPU**h**is the dimension of the hidden size within each transformer layer**L**is the number of layers in the transformer model**t**is the degree of tensor parallelism being used (1 if not)**a**is the number of attention heads in the transformer model

# Memory with Recomputations

In rare case, if we want to recompute every activation then,

# Gradient Memory

**Gradient Memory in FP32: **When gradients are stored in FP32 (32-bit floating point), each parameter’s gradient requires **4 bytes** of memory.

memory gradients=(

4 bytes/param)×(No. params)

**Gradient Memory in FP16** : When gradients are stored in FP16 (16-bit floating point), which is common in mixed-precision training, each parameter’s gradient requires **2 bytes** of memory.

memory gradients=(

2 bytes/param)×(No. params)

# Example Case Study

Lets calculate memory requirements for training a 7 billion parameter (7B) model using FP32 precision.

# Memory for Model Parameters

**Number of Parameters (P) **= 7 billion

**Memory per Parameter in FP32:** 4 bytes

**Model Memory **= 4 bytes/param & 7 × 10⁹ = **28 GBs**

# Memory for Optimizer States

AdamW optimizer, which requires:

**12 bytes per parameter**in FP32

**Optimizer Memory**=12 bytes/param×7×10⁹ params=84×109 bytes=**84 GB**

# Memory for Activations

Assuming:

**s**= Sequence length, in tokens (**128**)**b**= Batch size per GPU (**512**)**h**= Hidden size dimension (**4096**)**L**= Number of layers (**24**)**a**= Number of attention heads (**16**)**t**= Degree of tensor parallelism (**1**)

**memory activations (No Recomputation)**= 512×4096×24×(10+ (24/1) +5 * (16×128)/(4096×1) ) bytes ~ **67.51 GB**

# Memory for Gradients

**Memory per Gradient in FP32:**4 bytes

**Gradient Memory=** 4 bytes/param×7×10⁹ params = **28 GB**

# AgenticRAG with LlamaIndex Course

Look into our **AgenticRAG with LlamaIndex Course** with 5 real-time case studies.

# Total Memory Calculation

Total Memory=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory

**Total Memory (No Recomputation) **= 28 GB+84 GB+67.51 GB+28 GB=**207.51 GB**

You would require** ~207 GBs** to train a **7B** parameter transformer model with **FB32** and **No Recomputation of activations.**

# Summary:

**C = 6PD**— Amount of compute required to train a transformer based models as per OpenAI and DeepMind’s scaling law papers**Total Memory Required**=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory- For Model memory

-**FP32**gives higher precision but uses more memory.

-**FP16**uses less memory but may sacrifice some precision.

-**Mixed-Precision**strikes a balance by combining FP16 for efficiency and FP32 for critical parts, saving memory while maintaining performance. - For Optimizer Memory

-**AdamW**is the most memory-demanding, requiring**12 bytes per parameter**.

-**8-bit Optimizers**are more efficient, using only**6 bytes per parameter**.

-**SGD with Momentum**strikes a middle ground, needing**8 bytes per parameter** - For Activation Memory

-**Activations**consume a significant amount of memory during training, especially for large models with many layers.

-**Batch Size**directly impacts memory usage, with larger batch sizes requiring more memory.

-**Activation Recomputation**is an effective technique to reduce memory usage by recomputing activations during the backward pass rather than storing them all, trading off memory for additional computation. - For Gradient Memory

-**FP32 Gradients:**Require**4 bytes per parameter**.

-**FP16 Gradients:**Require**2 bytes per parameter**.

# All previous coffee break concepts

Look for all of our volumes of coffee break concepts:

**Follow us here** and your **feedback as comments and claps encourages us** to create better content for the community.