top of page
Title:
Category:
Let's build GPT: from scratch, in code, spelled out.
Tutorial
URL
Authors:
Andrej Karpathy
Published
7 August 2018
Review:
Dennis Kuriakose
Review Date :
1 August 2024
Summary
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.
Review & Notes:
Let's build GPT: from scratch, in code, spelled out.
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.
Chapters:
00:00:00Â intro: ChatGPT, Transformers, nanoGPT, Shakespeare baseline language modeling, code setup
00:07:52Â reading and exploring the data
00:09:28Â tokenization, train/val split
00:14:27Â data loader: batches of chunks of data
00:22:11Â simplest baseline: bigram language model, loss, generation 00:34:53Â training the bigram model
00:38:00Â port our code to a script Building the "self-attention" 00:42:13Â version 1: averaging past context with for loops, the weakest form of aggregation
00:47:11Â the trick in self-attention: matrix multiply as weighted aggregation 00:51:54Â version 2: using matrix multiply
00:54:42Â version 3: adding softmax
00:58:26Â minor code cleanup
01:00:18Â positional encoding
01:02:00Â THE CRUX OF THE VIDEO: version 4: self-attention
01:11:38Â note 1: attention as communication
01:12:46Â note 2: attention has no notion of space, operates over sets 01:13:40Â note 3: there is no communication across batch dimension 01:14:14Â note 4: encoder blocks vs. decoder blocks
01:15:39Â note 5: attention vs. self-attention vs. cross-attention
01:16:56Â note 6: "scaled" self-attention. why divide by sqrt(head_size) Building the Transformer
01:19:11Â inserting a single self-attention block to our network
01:21:59Â multi-headed self-attention
01:24:25Â feedforward layers of transformer block
01:26:48Â residual connections
01:32:51Â layernorm (and its relationship to our previous batchnorm) 01:37:49Â scaling up the model! creating a few variables. adding dropout Notes on Transformer
01:42:39Â encoder vs. decoder vs. both (?) Transformers
01:46:22Â super quick walkthrough of nanoGPT, batched multi-headed self-attention
01:48:53Â back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF 01:54:32Â conclusions
Technology Posts
bottom of page