How OpenAI or DeepMind calculates cost of training a transformer based models?

Mastering LLM (Large Language Model)

6 min readAug 24, 2024

How to calculate cost of training a transformer based models?

The basic equation giving the cost to train a transformer model is given by:

Cost of training a transformer based models

C is the compute required to train the transformer model, in total floating point operations

Here,

τ = is the aggregate throughput of your hardware setup

τ = (Number of GPUs)×(Actual FLOPs per GPU) in FLOPs

& T is the time spent training the model, in seconds

These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see each paper for more information.

This formula can be further simplified to:

So,

where,

P = is the number of parameters in the transformer model

D = is the dataset size, in tokens

Its worth noting the unit for C,

C can be represented as

FLOP-seconds, which is in units of (FLOPs /Second * Seconds)
GPU-hours, which is in units of [No. of GPU * Hours]
Scaling laws papers tend to report values in PetaFLOP-days.

This can be further interpreted as

Total Memory Training=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory

Lets deep dive into each component in detail.

Model Memory

FP32 (32-bit floating point): standard precision → 4 bytes of memory
FP (16-bit floating point) : half the precision of FP32 → bytes of memory
Mixed-Precision: Mixed-precision training combines FP16 and FP32 to speed up computation and reduce memory usage while maintaining accuracy.

For FP32 : Model memory = (4 bytes/ param) * no. of parameter

For FP16 : Model memory = (2 bytes/ param) * no. of parameter

Mixed-precision (fp16/bf16 and fp32) : Model memory = (2 bytes/ param) * no. of parameter + (4 bytes/ param) * no. of parameter

In mixed-precision training, you also need to consider the extra memory used by optimizer, which may require an additional FP32 copy of the model.

Want to find out correct and accurate answers? Look for our LLM Interview Course

100+ Questions spanning 14 categories
Curated 100+ assessments for each category
Well-researched real-world interview questions based on FAANG & Fortune 500 companies
Focus on Visual learning
Real Case Studies & Certification

50% off Coupon Code — LLM50

Link for the course —

Large Language Model (LLM) Interview Question And Answer Course

Dive deep into the world of AI with this comprehensive large language model (LLM) interview questions & answer course…

www.masteringllm.com

Optimizer Memory

Adam is magic → highly memory inefficient

vanilla AdamW

memory optimizer=(12 bytes per parameter)×(Number of Parameters)

FP32 copy of parameters: 4 bytes per parameter
Momentum: 4 bytes per parameter
Variance: 4 bytes per parameter

8-bit Optimizers (e.g., bitsandbytes)

memory optimizer=(6 bytes per parameter)×(Number of Parameters)

FP32 copy of parameters: 4 bytes per parameter
Momentum: 1 byte per parameter
Variance: 1 byte per parameter

SGD-like Optimizers with Momentum

memory optimizer=(8 bytes per parameter)×(Number of Parameters)

FP32 copy of parameters: 4 bytes per parameter
Momentum: 4 bytes per parameter

Activation Memory

Modern GPUs are typically bottlenecked by memory, not FLOPs, for LLM training
Activation recomputation/checkpointing is an extremely popular method where it works by recomputing activations of certain layers instead of storing them in GPU memory.
Below is a result of Megatron’s selective recomputation

dashed red line → memory capacity of an A100–80GB GPU
“present work” indicates the memory requirements after applying selective activation recomputation

Memory without Recomputations

Without any optimizations, storing activations can consume a large amount of memory, particularly for deep models with many layers.

s is the sequence length, in tokens
b is the batch size per GPU
h is the dimension of the hidden size within each transformer layer
L is the number of layers in the transformer model
t is the degree of tensor parallelism being used (1 if not)
a is the number of attention heads in the transformer model

Memory with Recomputations

In rare case, if we want to recompute every activation then,

Gradient Memory

Gradient Memory in FP32: When gradients are stored in FP32 (32-bit floating point), each parameter’s gradient requires 4 bytes of memory.

memory gradients=(4 bytes/param)×(No. params)

Gradient Memory in FP16 : When gradients are stored in FP16 (16-bit floating point), which is common in mixed-precision training, each parameter’s gradient requires 2 bytes of memory.

memory gradients=(2 bytes/param)×(No. params)

Example Case Study

Lets calculate memory requirements for training a 7 billion parameter (7B) model using FP32 precision.

Memory for Model Parameters

Number of Parameters (P) = 7 billion

Memory per Parameter in FP32: 4 bytes

Model Memory = 4 bytes/param & 7 × 10⁹ = 28 GBs

Memory for Optimizer States

AdamW optimizer, which requires:

12 bytes per parameter in FP32

Optimizer Memory=12 bytes/param×7×10⁹ params=84×109 bytes=84 GB

Memory for Activations

Assuming:

s = Sequence length, in tokens (128)
b = Batch size per GPU (512)
h = Hidden size dimension (4096)
L = Number of layers (24)
a = Number of attention heads (16)
t = Degree of tensor parallelism (1)

memory activations (No Recomputation)= 512×4096×24×(10+ (24/1) +5 * (16×128)/(4096×1) ) bytes ~ 67.51 GB

Memory for Gradients

Memory per Gradient in FP32: 4 bytes

Gradient Memory= 4 bytes/param×7×10⁹ params = 28 GB

AgenticRAG with LlamaIndex Course

Look into our AgenticRAG with LlamaIndex Course with 5 real-time case studies.

Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex

Learn Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex. Overcome traditional RAG challenges with…

www.masteringllm.com

Total Memory Calculation

Total Memory=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory

Total Memory (No Recomputation) = 28 GB+84 GB+67.51 GB+28 GB=207.51 GB

You would require ~207 GBs to train a 7B parameter transformer model with FB32 and No Recomputation of activations.

Summary:

C = 6PD — Amount of compute required to train a transformer based models as per OpenAI and DeepMind’s scaling law papers
Total Memory Required =Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory
For Model memory
- FP32 gives higher precision but uses more memory.
- FP16 uses less memory but may sacrifice some precision.
- Mixed-Precision strikes a balance by combining FP16 for efficiency and FP32 for critical parts, saving memory while maintaining performance.
For Optimizer Memory
- AdamW is the most memory-demanding, requiring 12 bytes per parameter.
- 8-bit Optimizers are more efficient, using only 6 bytes per parameter.
- SGD with Momentum strikes a middle ground, needing 8 bytes per parameter
For Activation Memory
- Activations consume a significant amount of memory during training, especially for large models with many layers.
- Batch Size directly impacts memory usage, with larger batch sizes requiring more memory.
- Activation Recomputation is an effective technique to reduce memory usage by recomputing activations during the backward pass rather than storing them all, trading off memory for additional computation.
For Gradient Memory
- FP32 Gradients: Require 4 bytes per parameter.
- FP16 Gradients: Require 2 bytes per parameter.

All previous coffee break concepts

Look for all of our volumes of coffee break concepts:

Mastering LLM: Comprehensive Courses in Generative AI and LLMs

Mastering LLM is on a mission to train 10 million AI engineers in the next 5 years. We offer comprehensive courses in…

www.masteringllm.com

Follow us here and your feedback as comments and claps encourages us to create better content for the community.

How OpenAI or DeepMind calculates cost of training a transformer based models?

Total Memory Training​=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory

Model Memory

Want to find out correct and accurate answers? Look for our LLM Interview Course

Large Language Model (LLM) Interview Question And Answer Course

Dive deep into the world of AI with this comprehensive large language model (LLM) interview questions & answer course…

Optimizer Memory

vanilla AdamW

8-bit Optimizers (e.g., bitsandbytes)

SGD-like Optimizers with Momentum

Activation Memory

Memory without Recomputations

Memory with Recomputations

Gradient Memory

Example Case Study

Memory for Model Parameters

Memory for Optimizer States

Memory for Activations

Memory for Gradients

AgenticRAG with LlamaIndex Course

Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex

Learn Agentic Retrieval Augmented Generation (AgenticRAG) with LlamaIndex. Overcome traditional RAG challenges with…

Total Memory Calculation

Summary:

All previous coffee break concepts

Mastering LLM: Comprehensive Courses in Generative AI and LLMs

Mastering LLM is on a mission to train 10 million AI engineers in the next 5 years. We offer comprehensive courses in…

Written by Mastering LLM (Large Language Model)

Total Memory Training=Model Memory+ Optimizer Memory+ Activation Memory+ Gradient Memory