{
  "voice": "en-US-AndrewMultilingualNeural",
  "rate": "-2%",
  "lessons": [
    {
      "id": 0,
      "slug": "welcome",
      "title": "Welcome",
      "narration": "Hello, and welcome to this course on generative AI. If you have ever used ChatGPT, or Midjourney, or GitHub Copilot, you have already met generative AI. You have asked it questions, watched it draw, watched it write code. But here is the thing... most people who use these tools have absolutely no idea how they work. And that is what we are going to fix. Together. In the next two hours, we will open the hood on the most exciting technology of our generation. We will start at the very beginning. With tokens. The little pieces of text that a model actually sees. Then we will build up. Embeddings. Attention. Transformers. Training. We will see how an image is born from pure noise. How a model learns to write code. How retrieval gives a model long-term memory. How agents let a model act, not just answer. There is no math in this course. None. Just clear analogies, beautiful visuals, and lots of curiosity. So grab a coffee. Settle in. And let me show you, step by step, how this entire field works. Let us begin.",
      "beats": [
        {
          "after": "welcome",
          "show": 1
        },
        {
          "after": "ChatGPT",
          "show": 2
        },
        {
          "after": "Midjourney",
          "show": 3
        },
        {
          "after": "Copilot",
          "show": 4
        },
        {
          "after": "tokens",
          "show": 5
        },
        {
          "after": "Embeddings",
          "show": 6
        },
        {
          "after": "Attention",
          "show": 7
        },
        {
          "after": "Transformers",
          "show": 8
        },
        {
          "after": "noise",
          "show": 9
        },
        {
          "after": "memory",
          "show": 10
        },
        {
          "after": "agents",
          "show": 11
        },
        {
          "after": "begin",
          "show": 12
        }
      ]
    },
    {
      "id": 1,
      "slug": "what-is-genai",
      "title": "What is Generative AI, really?",
      "narration": "Welcome back. Today we open the most important door in this entire course. What IS generative AI? Really. You hear the term every single day. In the news. On podcasts. In every product launch. So let me give you a definition that actually sticks. Generative AI is any model that produces NEW content. New text. New images. New audio. New video. New code. Things that, technically, did not exist before you asked for them. Take ChatGPT. You ask a question, it writes an essay made just for you. Take Midjourney. You type a prompt, it paints a picture nobody has ever painted before. Take Suno. You describe a vibe, it composes a brand new song. Take Sora. You describe a scene, it produces a brand new video. Take Copilot. You name a function, it writes the code. All five of these are wildly different products. But under the hood, they share the same beating heart. Generation. Now, contrast this with the AI you have been using for years. Your spam filter. That is also AI. But it does not WRITE you a new email. It just sorts existing ones into folders. Your phone camera that recognizes faces? Same idea. It tags. It does not paint. So here is the core mental shift you need to make. Old AI looked at the world and answered a single question. What is this? Cat or dog? Spam or not? Five-star or one-star? Generative AI looks at the world and answers a completely different question. Make me a new one. We call the older type discriminative. Because it discriminates. It tells things apart. The new type, that creates, we call generative. It builds. It invents. The leap between these two ideas is the leap between a critic and an artist. Now, there are three core ideas that power every generative model on Earth. Three pillars. Memorize them. Pillar one. Probability. A generative model never says, \"this is THE answer.\" It says, \"given everything I have seen, this is the MOST LIKELY answer.\" Every output is a roll of weighted dice. Pillar two. Learning. Those weights did not come from a human writing rules. They came from training on enormous piles of data. The internet. Books. Code. Conversations. Pillar three. Building. The model strings probabilities together. Token after token. Pixel after pixel. Note after note. Three legs of one stool. Let me give you the analogy that makes this click. Imagine a master chef. Over a forty-year career, this chef has tasted millions of dishes. From a hundred countries. A thousand cuisines. Slowly, the chef has absorbed patterns. Salt brightens. Acid lifts. Heat transforms. The chef does not memorize recipes. The chef has internalized the underlying grammar of food. Now, you walk into the chef's kitchen. You say, \"make me something Mexican, but a little Japanese, and please use citrus.\" The chef does not pull out a recipe book. The chef simply invents. That is exactly how a generative model thinks about data. Now, what kinds of things can these models actually create? The list, today, is staggering. Text. Long essays, short tweets, technical documentation, customer support, novels. Images. Photoreal portraits, oil paintings, anime, schematics, logos. Audio. Speech in any voice, music in any genre, sound effects on demand. Video. Short clips, animations, even feature-quality scenes. Code. Whole functions, entire apps, bug fixes, refactors. Three-dimensional assets. Game environments, architectural models, product visualizations. Even molecules. Yes, even chemistry. AlphaFold predicts proteins that have never been seen before. Anywhere a pattern can be found in data, a generative model can be trained to invent new examples of it. That is the beautiful, universal nature of this idea. Now, you might be wondering. How is this different from the AI that came before? Let us walk through the history quickly. In the nineteen fifties, AI was rule-based. Engineers wrote thousands of IF-statements. Brittle. Unscalable. In the nineteen nineties, AI shifted to statistical machine learning. Spam filters. Recommendation engines. Useful, but narrow. In the two thousand and tens, deep learning broke open. Image classification. Translation. Speech-to-text. Computers could finally see and hear. But they still mostly classified. They labeled. In the two thousand and twenties, finally, GenAI arrived. The same neural networks that learned to recognize cats, we taught them to PAINT cats. The pivotal architecture was something called the transformer, which we will visit shortly. The whole field shifted from understanding the world to creating it. So let me crystallize the takeaway. A discriminative model learns the boundary between things. Cat or not cat. Spam or not spam. A generative model learns the entire shape of the data. Every angle a cat can take. Every fur color. Every lighting condition. Once it has learned that shape, it can sample brand new examples that fit perfectly. That is the essence. Three things to remember from this lesson. First. GenAI generates, it does not retrieve. Second. Every output is statistically novel. Third. The same recipe works for text, images, audio, code, and beyond. And here is the question that should be in your mind by now. If this idea is so simple, why did it only start working in twenty twenty-two? Why not the nineteen seventies? That is the next chapter of our story. Three exponential curves crossing at exactly the same moment in history. Let us go.",
      "beats": [
        {
          "after": "really",
          "show": 1
        },
        {
          "after": "Generative",
          "show": 2
        },
        {
          "after": "ChatGPT",
          "show": 3
        },
        {
          "after": "Midjourney",
          "show": 4
        },
        {
          "after": "Suno",
          "show": 5
        },
        {
          "after": "Sora",
          "show": 6
        },
        {
          "after": "Copilot",
          "show": 7
        },
        {
          "after": "Generation",
          "show": 8
        },
        {
          "after": "spam",
          "show": 9
        },
        {
          "after": "faces",
          "show": 10
        },
        {
          "after": "shift",
          "show": 11
        },
        {
          "after": "Cat",
          "show": 12
        },
        {
          "after": "Make",
          "show": 13
        },
        {
          "after": "discriminative",
          "show": 14
        },
        {
          "after": "generative",
          "show": 15
        },
        {
          "after": "artist",
          "show": 16
        },
        {
          "after": "three",
          "show": 17
        },
        {
          "after": "Probability",
          "show": 18
        },
        {
          "after": "dice",
          "show": 19
        },
        {
          "after": "Learning",
          "show": 20
        },
        {
          "after": "data",
          "show": 21
        },
        {
          "after": "Building",
          "show": 22
        },
        {
          "after": "stool",
          "show": 23
        },
        {
          "after": "chef",
          "show": 24
        },
        {
          "after": "millions",
          "show": 25
        },
        {
          "after": "absorbed",
          "show": 26
        },
        {
          "after": "grammar",
          "show": 27
        },
        {
          "after": "kitchen",
          "show": 28
        },
        {
          "after": "invents",
          "show": 29
        },
        {
          "after": "Text",
          "show": 30
        },
        {
          "after": "Images",
          "show": 31
        },
        {
          "after": "Audio",
          "show": 32
        },
        {
          "after": "Video",
          "show": 33
        },
        {
          "after": "Code",
          "show": 34
        },
        {
          "after": "assets",
          "show": 35
        },
        {
          "after": "molecules",
          "show": 36
        },
        {
          "after": "fifties",
          "show": 37
        },
        {
          "after": "nineties",
          "show": 38
        },
        {
          "after": "tens",
          "show": 39
        },
        {
          "after": "twenties",
          "show": 40
        },
        {
          "after": "architecture",
          "show": 41
        },
        {
          "after": "creating",
          "show": 42
        },
        {
          "after": "boundary",
          "show": 43
        },
        {
          "after": "shape",
          "show": 44
        },
        {
          "after": "First",
          "show": 45
        },
        {
          "after": "Second",
          "show": 46
        },
        {
          "after": "Third",
          "show": 47
        },
        {
          "after": "story",
          "show": 48
        },
        {
          "after": "go",
          "show": 49
        }
      ]
    },
    {
      "id": 2,
      "slug": "discriminative-vs-generative",
      "title": "Discriminative vs Generative — the leap",
      "narration": "Welcome back. In this lesson, we are going to draw the most important line in all of modern AI. The line between two completely different kinds of jobs. Imagine two people. The first is an art critic. She walks into a gallery. She looks at a painting and says, \"this one is a Monet. This one is a Picasso. And this one... is a fake.\" Her job is to tell things apart. To recognize. To classify. We call this... discriminate. The second person is an artist. He walks up to a blank canvas. He picks up a brush. And he paints a brand new Monet. A scene that did not exist this morning. A combination of light and pigment that never lived anywhere on Earth before. His job is to create. To generate. Now, both of them need deep knowledge of art. But notice... they are solving completely different problems. The critic only needs to know what makes a Monet a Monet. The artist needs to know how a Monet is BUILT. Brushstroke by brushstroke. That, in one image, is the leap from discriminative AI to generative AI. Let us go deeper. Discriminative AI is the kind of AI you have been using for years, even if you did not know it. Your spam filter? Discriminative. It looks at every email and asks, \"spam or not spam?\" Two boxes. It just sorts. Your phone camera that recognizes your face? Discriminative. It looks at pixels and asks, \"is this Abhijeet, or someone else?\" Recommendation engines on Netflix? Discriminative. Five stars, or one star? Predicted thumbs up, or thumbs down? In all of these cases, the model never invents anything. It never writes you a new email. It never paints you a new face. It just sorts. Mathematically, we say it learns to estimate the probability of a label, given an input. It learns where the boundary sits. The boundary between cat and dog. Between fraud and not fraud. That is all it needs. So discriminative models can be relatively small. They can learn from less data. They run fast. And for fifty years, almost everything in AI lived on this side of the line. Generative AI plays a completely different game. A generative model does not ask, \"is this a cat?\" It asks, \"what does a cat look like?\" That is a much, much bigger question. Because to answer it, the model has to understand every possible angle of a cat. Every fur color. Every breed. Every pose. Every lighting condition. Every photograph that was ever taken of a cat, and every photograph that could have been taken. All of that has to live inside the model. Mathematically, we say a generative model learns the probability of the input itself. Just the entire distribution of what cats look like. And once you have that distribution... here is the magic moment. You can sample from it. You can ask the model to roll a die inside its learned cat-space. And out comes a brand new cat. A cat that has never existed. That is generation. So generative models are dramatically harder to train. They need more data. They need more compute. They need cleverer architectures. But what they give back... is nothing less than the ability to create. Let me give you a picture that locks this in. Imagine your data, scattered as dots in space. Cat photos here. Dog photos there. A discriminative model has one job. Draw a line between the cat dots and the dog dots. Just a boundary. Once that boundary is drawn, any new dot can be classified by which side of the line it falls on. Simple. Clean. Limited. A generative model has a much bigger job. Forget the line. Instead, draw the entire shape of the cat cloud. The whole blob. The boundary. The interior. The density. Every region where cat-like things live. Once you have that shape, you can pick any point inside it... and decode that point back into an actual cat picture. That, right there, is sampling. That is generation. Discriminative models are about lines. Generative models are about shapes. Same data, completely different question, completely different answer. And building the shape is roughly a thousand times harder than drawing the line. Which is exactly why this revolution took so long. For about fifty years, almost all of AI was discriminative. The reason was simple. Generative was just too hard. The math was unstable. The compute was not there. The data was not there. Every attempt at generation in the eighties, in the nineties, in the early two thousands, gave you something that looked... a little wrong. Blurry faces. Garbled text. Music that drifted. Researchers knew the goal. They could not reach it. And then, slowly... three things changed at the same time. The data exploded. The compute exploded. And one tiny architectural breakthrough finally cracked the problem. We will spend the next lesson on exactly what those three forces were. So here is the takeaway. The one mental model that will earn you back the time you spent on this lesson many times over. Discriminative AI learns the boundary between things. Cat or not cat. Spam or not spam. Generative AI learns the entire shape of the data. Every angle, every variation, every example. Once you see this distinction, every news headline makes sense. Every product launch, every research paper, every breakthrough you read about... fits cleanly on one side or the other. So... what finally pushed generative across the finish line? What aligned in twenty twenty-two to make all this explode? Three forces. Three exponential curves crossing at the same moment. That is the next chapter. Let us go.",
      "beats": [
        {
          "after": "Welcome",
          "show": 1
        },
        {
          "after": "two",
          "show": 2
        },
        {
          "after": "critic",
          "show": 3
        },
        {
          "after": "Monet",
          "show": 4
        },
        {
          "after": "Picasso",
          "show": 5
        },
        {
          "after": "fake",
          "show": 6
        },
        {
          "after": "discriminate",
          "show": 7
        },
        {
          "after": "artist",
          "show": 8
        },
        {
          "after": "canvas",
          "show": 9
        },
        {
          "after": "brush",
          "show": 10
        },
        {
          "after": "morning",
          "show": 11
        },
        {
          "after": "generate",
          "show": 12
        },
        {
          "after": "leap",
          "show": 13
        },
        {
          "after": "deeper",
          "show": 14
        },
        {
          "after": "spam",
          "show": 15
        },
        {
          "after": "Abhijeet",
          "show": 16
        },
        {
          "after": "Netflix",
          "show": 17
        },
        {
          "after": "sorts",
          "show": 18
        },
        {
          "after": "boundary",
          "show": 19
        },
        {
          "after": "fast",
          "show": 20
        },
        {
          "after": "fifty",
          "show": 21
        },
        {
          "after": "look",
          "show": 22
        },
        {
          "after": "angle",
          "show": 23
        },
        {
          "after": "breed",
          "show": 24
        },
        {
          "after": "pose",
          "show": 25
        },
        {
          "after": "lighting",
          "show": 26
        },
        {
          "after": "distribution",
          "show": 27
        },
        {
          "after": "sample",
          "show": 28
        },
        {
          "after": "scattered",
          "show": 29
        },
        {
          "after": "line",
          "show": 30
        },
        {
          "after": "side",
          "show": 31
        },
        {
          "after": "shape",
          "show": 32
        },
        {
          "after": "interior",
          "show": 33
        },
        {
          "after": "sampling",
          "show": 34
        },
        {
          "after": "harder",
          "show": 35
        },
        {
          "after": "eighties",
          "show": 36
        },
        {
          "after": "Blurry",
          "show": 37
        },
        {
          "after": "exploded",
          "show": 38
        },
        {
          "after": "breakthrough",
          "show": 39
        },
        {
          "after": "headline",
          "show": 40
        },
        {
          "after": "chapter",
          "show": 41
        },
        {
          "after": "go",
          "show": 42
        }
      ]
    },
    {
      "id": 3,
      "slug": "why-genai-exploded",
      "title": "Why GenAI exploded in 2022",
      "narration": "November, two thousand twenty-two. ChatGPT launches. The world changes overnight. Or so it seems. But the truth is... the seeds for that moment were planted decades earlier. And to understand the explosion, we need to understand the three forces that finally aligned. Force number one. Data. Specifically, the entire internet. Trillions of words. Billions of images. Code. Books. Reddit threads. Papers. Forums. By two thousand twenty, we finally had enough text to teach a machine the patterns of human language. Force number two. Compute. Specifically, GPUs. That is graphics processing units. Originally designed to render video games. It turns out, the math behind a flying dragon is the same math behind a neural network. Companies like NVIDIA spent twenty years making GPUs faster, cheaper, and more parallel. Without them... none of this would be possible. Force number three. Architecture. In two thousand seventeen, a small team of researchers at Google published a paper called \"Attention is All You Need.\" It introduced the transformer. We will spend a whole lesson on it later. For now, just know this. The transformer was the key that unlocked the door. Stack three forces together. Data, compute, architecture. Multiply them. And what you get is a phenomenon we call SCALING LAWS. Bigger models, more data, more compute... predictably get smarter. Not a little smarter. Dramatically smarter. So when ChatGPT launched, it was not a miracle. It was the moment three exponential curves crossed at the same point on the graph. The big idea here is that GenAI is not one breakthrough. It is the convergence of three. Now... let us actually open the hood. Starting with the smallest unit a model ever sees. Not a word. Not a letter. A token. Let us go.",
      "beats": [
        {
          "after": "ChatGPT",
          "show": 1
        },
        {
          "after": "Data",
          "show": 2
        },
        {
          "after": "internet",
          "show": 3
        },
        {
          "after": "Compute",
          "show": 4
        },
        {
          "after": "GPUs",
          "show": 5
        },
        {
          "after": "NVIDIA",
          "show": 6
        },
        {
          "after": "Architecture",
          "show": 7
        },
        {
          "after": "transformer",
          "show": 8
        },
        {
          "after": "scaling",
          "show": 9
        },
        {
          "after": "smarter",
          "show": 10
        },
        {
          "after": "convergence",
          "show": 11
        },
        {
          "after": "token",
          "show": 12
        }
      ]
    },
    {
      "id": 4,
      "slug": "tokens",
      "title": "Tokens — the atoms of language",
      "narration": "Here is something that surprises almost every newcomer to this field. When you type a sentence into a language model, the model does not actually see your words. Not the way you do. To you, a sentence is a string of meaningful words. To the model, that very same sentence is something else entirely. So what does it see? It sees tokens. Tokens are the true atoms of every language model. Every prompt, every answer, every essay it writes, is built from these little pieces. So let us slow down, and really understand them. Because once tokens click, a dozen confusing things about A.I. suddenly make sense. First, the obvious question. Why not just use whole words? It seems so natural. One word, one unit. Well, there are a few deep problems with that. The first problem is vocabulary. The English language alone has hundreds of thousands of words. Add names, slang, typos, and other languages, and the list explodes into the millions. A model would need a giant lookup table for every single one. The second problem is worse. What happens when the model meets a word it has never seen before? A brand new product name. A made-up word in a poem. A rare medical term. If the model only knows whole words, a new word is a total blank. It cannot even represent it. And language invents new words every single day. So instead of words, we use something cleverer. We chop text into chunks. A chunk might be a whole common word, like the, or and. It might be a piece of a longer word. It might be a single character. Or a piece of punctuation. These chunks are our tokens. Let me show you with a classic example. Take the word, unbelievable. To you, that is one word, with a clear meaning. But a typical model does not store unbelievable as one unit. Instead, it breaks it into three familiar pieces. Un. Believ. And able. Three tokens. And here is the beautiful part. The model has seen un at the start of thousands of words. Unhappy. Unfair. Unkind. It has seen able at the end of thousands more. Comfortable. Reliable. Readable. So even though it may never have seen unbelievable before, it can still understand it, by combining pieces it already knows well. Now, each of these tokens is mapped to a number. A unique identifier. Un might be token one thousand eight hundred. Believ might be token six thousand. Able might be token seven hundred. So the word becomes a small list of numbers. And this is the key mental shift. By the time your sentence reaches the heart of the model, it is no longer text at all. It is just a stream of integers. A river of digits. Take a simple sentence. I love cats. To the model, that might become four or five numbers, one after another. The grammar, the meaning, the spelling, all of it is now encoded as plain numbers. The model only ever does math. It never sees a letter. So how does the model decide where to chop? What is the rule? The technique is called byte pair encoding. Or B P E, for short. And the idea behind it is genuinely elegant. It starts with the simplest possible vocabulary. Just individual letters. Every word, at first, is spelled out one character at a time. Then it looks at a massive pile of training text, and it asks a simple question. Which pair of symbols appears next to each other most often? Maybe it is t followed by h. Those two letters show up together constantly. So the algorithm glues them into a single new token. T h becomes th. Then it repeats. Now which pair is most common? Perhaps th followed by e. So it merges again. T h e becomes the. And it keeps going. Merge, after merge, after merge. After a few thousand of these merges, something wonderful has happened. The most common words, like the, and of, and to, have become single tokens. While rare words are still built from smaller pieces. Common things get short. Rare things stay flexible. The model gets the best of both worlds. The final list of all these chunks is called the vocabulary. And its size is a real design choice, with a real trade-off. If you pick a small vocabulary, say a few thousand tokens, then every sentence gets chopped into many tiny pieces. Long sequences. Slower to process. If you pick a huge vocabulary, say two hundred thousand tokens, then sentences become short, but the model needs far more memory to hold them all. Most modern models land somewhere in the middle. Often around fifty thousand, to one hundred thousand tokens. A careful balance. There are also a few special tokens, that are not really words at all. There is usually a marker for the start of a text. One for the end. One for padding, to make sequences line up. And often, a token that simply means, unknown. These quiet little markers help the model keep everything organized. Now, tokenization has some quirks that trip people up, so let us face them directly. First, spaces matter. In most systems, the space before a word is actually part of the token. So the word cat, at the very start of a sentence, and the same word in the middle, with a space in front, can be two different tokens. Capitalization matters too. Cat, with a capital letter, and cat, in lower case, are often different tokens. Digits are especially strange. A long number, like a phone number, often gets split into several odd little pieces, in ways that have nothing to do with math. This is one reason language models, historically, were so clumsy with arithmetic. Here is a consequence that costs real money. Tokenization is not equally efficient across languages. English, because so much training text is English, tends to be very efficient. A common English word is often just one token. But the same idea, written in another language, can take three, four, even five times as many tokens. So a sentence in Hindi, or Arabic, or Chinese, can quietly cost several times more than the same sentence in English. Same meaning. Very different token count. And that brings us to two ideas that every builder must understand. The first is the context window. A model can only look at a limited number of tokens at once. This is its working memory. Its attention span. Maybe a few thousand tokens. Maybe a few hundred thousand, in the largest models. But it is always measured in tokens, not in words, and it is never unlimited. Feed it a document longer than its window, and the earliest parts simply fall away. Forgotten. The second idea is pricing. When you use a model through an A.P.I., you pay per token. Both for the tokens you send in, and the tokens it generates back. Let me make that concrete. Imagine a long prompt of two thousand tokens, and an answer of one thousand tokens. That is three thousand tokens, for a single question. Now multiply that by ten thousand users a day, and suddenly tokens are your entire bill. Which is exactly why one tiny habit saves so much money. Be concise. A shorter prompt is a cheaper prompt. Trimming needless words, removing repetition, summarizing long context, these are not just matters of style. At scale, they are the difference between a cheap product, and an expensive one. So let me leave you with one simple picture. Think of tokens as LEGO bricks. A small, fixed set of pieces. On their own, each brick is almost meaningless. But snap them together, in the right order, and you can build absolutely anything. A tweet. A novel. A computer program. A whole conversation. The model does not work with ideas, or paragraphs. It works with these humble little bricks. One token at a time. Let us lock in the three things to remember. First. A model never sees words. It sees tokens, chunks of text, each mapped to a number. Second. Byte pair encoding builds those chunks, by merging the most common pairs, over and over. Third. Tokens are the unit of everything that matters in practice. Your context window, your speed, and your bill, are all measured in tokens. But this raises a puzzle, doesn't it? If a token is just an integer, a bare number with no meaning of its own, then how on Earth does the model know that the word king, and the word queen, are deeply related? How does a plain number come to carry meaning? That is one of the most beautiful ideas in all of A.I. And it is exactly where we go next. To embeddings.",
      "beats": [
        {
          "after": "sentence",
          "show": 1
        },
        {
          "after": "tokens",
          "show": 2
        },
        {
          "after": "words",
          "show": 3
        },
        {
          "after": "unbelievable",
          "show": 4
        },
        {
          "after": "number",
          "show": 5
        },
        {
          "after": "integers",
          "show": 6
        },
        {
          "after": "encoding",
          "show": 7
        },
        {
          "after": "letters",
          "show": 8
        },
        {
          "after": "glues",
          "show": 9
        },
        {
          "after": "vocabulary",
          "show": 10
        },
        {
          "after": "special",
          "show": 11
        },
        {
          "after": "spaces",
          "show": 12
        },
        {
          "after": "Digits",
          "show": 13
        },
        {
          "after": "languages",
          "show": 14
        },
        {
          "after": "window",
          "show": 15
        },
        {
          "after": "pricing",
          "show": 16
        },
        {
          "after": "concise",
          "show": 17
        },
        {
          "after": "LEGO",
          "show": 18
        },
        {
          "after": "remember",
          "show": 19
        },
        {
          "after": "embeddings",
          "show": 20
        }
      ]
    },
    {
      "id": 5,
      "slug": "embeddings",
      "title": "Embeddings — meaning as geometry",
      "narration": "Let us solve a little puzzle together. How does a language model know that the word king, and the word queen, are deeply related? After all, by now you know the uncomfortable truth. Once we tokenize everything, all the model really has, are numbers. Token forty-two. Token nine hundred. Token twelve thousand. Bare integers. There is nothing inside the number forty-two that whispers, royalty. So where does meaning come from? The answer is one of the most beautiful ideas in this entire field. It is called an embedding. An embedding is simply a vector. And a vector is just a list of numbers. Maybe seven hundred and sixty-eight numbers. Maybe four thousand. And here is the crucial part. Every single token gets its very own vector. So the token for king is not just the number forty-two. It is a long list of numbers. And the token for queen, is a different long list. The magic is in how those lists relate to each other. Let me give you a picture that makes this click instantly. Imagine a giant city. A real one. New York, let us say. In a real city, similar things cluster together. Restaurants gather in one district. Banks, in the financial quarter. Theatres, in another. Now, imagine that instead of a real city, we build a giant city of meaning. A vast space, where every token lives at an address. In this city, words with similar meanings live in the same neighborhood. The neighborhood for animals is over here. Inside it, the word dog sits on one block. The word cat, just next door. Down the road is the neighborhood for verbs. Run, jump, swim, all clustered together. Far across town is the neighborhood for emotions. That is what an embedding space truly is. A map of meaning, where location is everything. Now, you might picture this city in two dimensions. Like a paper map. But the real thing is far stranger. An embedding space has hundreds of dimensions. Sometimes thousands. We cannot visualize that. Our brains give up after three. But the mathematics does not care. It happily measures distance and direction in seven hundred dimensions, just as easily as in two. And here is the key idea. Distance means similarity. If two tokens sit close together in this space, their meanings are close. King and queen are neighbors. Cat and dog are neighbors. King and bicycle are on opposite sides of town. To measure how close two vectors are, we usually use something called cosine similarity. You do not need the formula. Just the intuition. It measures whether two vectors point in the same direction. Same direction, similar meaning. Now we arrive at the most astonishing part. In this space of meaning, directions carry meaning too. There is a famous result. Take the vector for king. Subtract the vector for man. Then add the vector for woman. And where do you land? Almost exactly on the vector for queen. King, minus man, plus woman, lands on queen. Read that again, slowly. The model was never told what gender is. Yet a clean, consistent direction for gender simply emerged in the space. There is a direction for plural. A direction for past tense. A direction for capital cities. Meaning became geometry. Relationships became arrows you can do arithmetic on. So, where do these magical vectors come from? Nobody sits down and hand writes seven hundred numbers for the word cat. That would be impossible. Instead, the embeddings are learned. During training, they start as random numbers. Pure noise. But every time the model makes a prediction, and gets nudged toward the right answer, the vectors shift a tiny bit. Words that appear in similar contexts, slowly drift together. Over billions of examples, the random cloud organizes itself into the beautiful city of meaning we just described. The structure is not designed. It is discovered. There is one more subtlety, and it matters enormously. In the earliest systems, every word had exactly one fixed vector. The word bank always got the same numbers, whether you meant a river bank, or a money bank. We call those static embeddings. But modern transformers do something cleverer. They build contextual embeddings. The vector for bank is computed fresh, every time, based on the words around it. In the sentence, I sat by the river bank, the vector leans toward nature. In the sentence, I deposited cash at the bank, the very same word leans toward finance. The meaning bends to fit the context. And that single upgrade, from static to contextual, is a huge part of why modern models feel so fluent. So why should you, as a builder, care about any of this? Because embeddings are everywhere in real systems. When you search a knowledge base by meaning, instead of keywords, you are comparing embeddings. When a store recommends a product similar to one you liked, embeddings. When we cluster thousands of customer reviews into themes, embeddings. And in the next few lessons, when we build retrieval systems that give a model long term memory, embeddings will be the beating heart of it all. So let us lock in the takeaway. A token, on its own, is just a number, with no meaning. An embedding turns that token into a rich vector, a point in a vast space of meaning. Nearby points share meaning. Directions encode relationships. And the whole map is learned, not designed. Now, let me deepen your intuition with the principle that makes all of this possible. Linguists call it the distributional hypothesis, and it has a wonderful slogan. You shall know a word, by the company it keeps. The idea is that words which appear in similar surroundings, tend to have similar meanings. The words doctor and nurse show up around the same kinds of words. Hospital. Patient. Care. So the model, simply by watching context, learns to place them near each other. It never needs a dictionary. The meaning leaks in, through the company each word keeps. Let me give you another taste of the arithmetic, because it really is remarkable. Take the vector for Paris. Subtract France. Add Italy. And you land, almost perfectly, on Rome. The model has discovered a clean direction that means, the capital of. It was never taught geography. It read enough text that the pattern carved itself into the space. There are directions for tense, for plural, for comparative adjectives, big to bigger, warm to warmer. Geometry, quietly encoding grammar. Now, a fair question. If this space has hundreds of dimensions, how do researchers ever look at it? We use clever tools that squash those hundreds of dimensions down to two, just for our eyes. And when we do, the picture is breathtaking. All the animals cluster in one blob. All the countries in another. Numbers form a gentle line. Months curve into a loop. The hidden structure of language, suddenly made visible. Here is something else that surprises people. Embeddings are not only for single words. We can embed an entire sentence, or a whole document, into one vector. A single point that captures the gist of a paragraph. And once a paragraph is a point in meaning space, magical things become easy. To find documents about a topic, you do not match keywords. You drop your question into the same space, and simply look for the nearest neighbors. This is called semantic search, and it is a quiet revolution. Imagine a support system. A customer types, my payment will not go through. A keyword search hunts for those exact words. But an embedding search understands the meaning, and happily surfaces an article titled, troubleshooting failed transactions, even though not a single word matches. Same meaning, different words, nearby vectors. That is the power. The same trick drives recommendation engines, that suggest a song like the one you love. It drives clustering, that groups thousands of reviews into themes nobody labeled. And it is the foundation of the retrieval systems we will build later in this course, the ones that give a model a vast, searchable, long term memory. But here is the question that should be forming in your mind. We now have these beautiful vectors. But a sentence is more than a bag of words. Order matters. Context matters. How does a token actually look at the other tokens around it, and update its meaning on the fly? How does river teach bank what it really means? That mechanism has a name. It is the single most important idea in modern A.I. It is called attention. And it is where we go next.",
      "beats": [
        {
          "after": "king",
          "show": 1
        },
        {
          "after": "numbers",
          "show": 2
        },
        {
          "after": "vector",
          "show": 3
        },
        {
          "after": "city",
          "show": 4
        },
        {
          "after": "neighborhood",
          "show": 5
        },
        {
          "after": "dimensions",
          "show": 6
        },
        {
          "after": "distance",
          "show": 7
        },
        {
          "after": "directions",
          "show": 8
        },
        {
          "after": "queen",
          "show": 9
        },
        {
          "after": "learned",
          "show": 10
        },
        {
          "after": "static",
          "show": 11
        },
        {
          "after": "bank",
          "show": 12
        },
        {
          "after": "company",
          "show": 13
        },
        {
          "after": "Rome",
          "show": 14
        },
        {
          "after": "sentence",
          "show": 15
        },
        {
          "after": "search",
          "show": 16
        },
        {
          "after": "recommendation",
          "show": 17
        },
        {
          "after": "attention",
          "show": 18
        }
      ]
    },
    {
      "id": 6,
      "slug": "attention",
      "title": "Attention — the secret sauce",
      "narration": "Let me ask you a simple question, and I want you to notice how fast your brain answers it. Read this sentence. The bank closed early today. Now, what is a bank, in that sentence? Is it a river bank? Or a financial bank? You knew, instantly. A money bank. But pause, and ask yourself. How did you know? You knew because of one single word. Closed. That word reached across the sentence, and told you exactly which meaning was intended. That, right there, is attention. Attention is the mechanism that lets every token in a sentence, look at every other token, and decide which ones actually matter, for understanding it. Right now. In this exact context. Before attention, models read text one word at a time, with a short and leaky memory. By the end of a long sentence, the beginning had faded. Attention threw that limitation away. It lets a token at the end of a paragraph, reach all the way back to the very first word, in a single step. So how does it actually work? The mechanism is surprisingly elegant, and it rests on three simple questions that every token asks. We call them the query, the key, and the value. Let me make them concrete. The query is what a token is looking for. Think of it as the token raising its hand and asking, who here is relevant to me? The key is what a token offers. It is like a label each token wears, advertising, here is what I am about. And the value is what a token actually contains. The real information it will hand over, if chosen. So picture a room full of tokens. Each token broadcasts a key, describing itself. Each token also sends out a query, describing what it needs. The model then compares every query, against every key. Mathematically, this comparison is a dot product. You do not need the math. Just the idea. When a query and a key point in a similar direction, they match strongly. When they do not, the match is weak. So the word bank sends out a query that means, roughly, what kind of bank am I? The word closed has a key that strongly matches that query. The word the, has a key that matches almost nothing. Now, all those match scores get squeezed through a step called softmax. That simply turns the raw scores into clean percentages, that add up to one hundred percent. These are the attention weights. They decide how much each token gets to influence every other. Finally, here is the payoff. Each token builds its new, updated meaning, by taking a weighted blend of all the values in the room. Mostly the values it matched strongly with. So bank pulls in a large helping of meaning from closed, a little from today, and almost nothing from the. The word literally rewrites its own meaning, by paying attention to its most relevant neighbors. Let me give you an analogy that locks it in. Imagine a meeting room. Everyone has a question they want answered, that is their query. Everyone also wears a name tag describing their expertise, that is their key. You scan the tags, find the people most relevant to your question, and listen mostly to them. That is attention. A room where everyone, simultaneously, finds and listens to whoever matters most to them. Now, two upgrades make this truly powerful. The first is called self attention. It simply means the tokens are all looking at each other, within the same sentence. Not at some separate input. The sentence interrogates itself, and every word refines its meaning in the light of all the others. The second upgrade is even more clever. It is called multi head attention. Instead of running this whole process once, the model runs it many times in parallel. Each parallel copy is called a head. And here is the beautiful part. Different heads learn to focus on different kinds of relationships. One head might track grammar, linking verbs to their subjects. Another might follow long range references, connecting a pronoun to the name it stands for. Another might watch for tone. The model blends all these perspectives together. It is like having a dozen specialists read the same sentence, each catching something the others missed. And here is why this design conquered the field. All of these comparisons happen at the same time. In parallel. Unlike the old approach, which crawled through text one word after another, attention looks at the whole sentence at once. That makes it a perfect fit for modern hardware, which loves doing many things simultaneously. There is a cost, and you should know it. Because every token compares itself with every other token, the work grows with the square of the length. Double the text, and you roughly quadruple the effort. That is exactly why very long inputs are expensive, and why so much current research is about making attention cheaper. So let us crystallize the takeaway. Attention lets every token ask a question, the query, advertise itself with a key, and offer its content as a value. Strong matches mean strong influence. Each token then rebuilds its meaning, as a weighted blend of the tokens that matter most to it. Do this with many heads, in parallel, stacked over and over, and you get a machine with an almost uncanny grasp of context. Let me slow down and walk through a tiny, concrete example, so the three roles really stick. Consider the sentence, the animal did not cross the street because it was too tired. Now, what does it refer to? The animal, or the street? You know it means the animal, because a street does not get tired. Watch how attention solves this. The token it sends out a query that, in effect, asks, which earlier noun do I stand for? The token animal carries a key that, combined with the word tired, matches that query strongly. The token street matches weakly. So when it rebuilds its meaning, it pulls heavily from animal. The pronoun has been resolved, not by a grammar rule someone wrote, but by a soft, learned comparison of queries and keys. Now change one word. The animal did not cross the street because it was too wide. Suddenly, it means the street. And attention, reading the word wide instead of tired, quietly shifts its weights, and points at street instead. The very same machinery, landing on the opposite answer, purely from context. This is the flexibility that older models simply could not match. Let me also make multi head attention more vivid. Picture reading a contract. One part of your mind tracks the legal definitions. Another follows the dates and deadlines. Another watches for anything that sounds risky. You read the document once, but several specialists inside you are each looking for something different, and then you combine their findings. That is exactly what the multiple heads do. In a real model, researchers have actually peeked inside, and found heads that specialize. One reliably links verbs to their subjects. Another connects pronouns to names. Another tracks quotation marks and brackets. The model was never told to divide the labor this way. It discovered, on its own, that splitting attention into specialists, was the most useful thing to do. One more practical point worth carrying with you. Because attention compares every token with every other token, it is the part of the model that strains hardest as text gets long. A short prompt is cheap. A very long document is expensive, and the cost grows faster than the length. This is why context windows have limits, why long inputs cost more, and why a huge slice of current research is devoted to clever tricks, that approximate attention more cheaply, without losing its magic. We now have the single most important building block in modern A.I. But a building block is not a building. How do we stack attention into a full, working model, the kind that powers ChatGPT and Claude? For that, we need to assemble these pieces into a complete architecture. The one that changed everything. The transformer. Let us go.",
      "beats": [
        {
          "after": "bank",
          "show": 1
        },
        {
          "after": "closed",
          "show": 2
        },
        {
          "after": "attention",
          "show": 3
        },
        {
          "after": "query",
          "show": 4
        },
        {
          "after": "key",
          "show": 5
        },
        {
          "after": "value",
          "show": 6
        },
        {
          "after": "dot",
          "show": 7
        },
        {
          "after": "softmax",
          "show": 8
        },
        {
          "after": "blend",
          "show": 9
        },
        {
          "after": "meeting",
          "show": 10
        },
        {
          "after": "self",
          "show": 11
        },
        {
          "after": "heads",
          "show": 12
        },
        {
          "after": "parallel",
          "show": 13
        },
        {
          "after": "square",
          "show": 14
        },
        {
          "after": "animal",
          "show": 15
        },
        {
          "after": "wide",
          "show": 16
        },
        {
          "after": "contract",
          "show": 17
        },
        {
          "after": "transformer",
          "show": 18
        }
      ]
    },
    {
      "id": 7,
      "slug": "transformer",
      "title": "The Transformer, demystified",
      "narration": "In the year two thousand and seventeen, a small team of researchers at Google published a paper, with one of the boldest titles in the history of artificial intelligence. They called it, Attention Is All You Need. It was a confident claim. And honestly, they turned out to be right. That single paper launched the architecture behind almost every modern A.I. system you have heard of. The transformer. To appreciate why it mattered so much, we have to look at what came before. Before transformers, the best language models used something called recurrent networks. The idea was intuitive. Read the text one word at a time, like a person reading aloud, keeping a running summary in memory. It worked, but it had two crippling problems. First, it was slow. Because each word depended on the one before it, you could not process them in parallel. You were stuck reading strictly left to right. Second, it had a leaky memory. By the time it reached the end of a long paragraph, the beginning had faded to a blur. The transformer threw all of that out. It made one radical bet. Forget reading word by word. Just use attention, applied many times, in parallel, stacked deep. And that bet paid off spectacularly. So what does a transformer actually look like, on the inside? Picture an assembly line, in a factory. At one end, you drop in your raw materials. The tokens. At the other end, out comes a finished product. A prediction of the next token. In between, the materials pass through a series of identical stations. We call those stations transformer blocks. A small model might have a dozen. A large one might have ninety six, stacked one on top of another. The input does not jump straight to the blocks, though. First, each token is turned into its embedding, the vector of meaning we discussed earlier. But there is a catch. Attention, by itself, has no sense of order. To attention, a sentence is just a bag of words. So we add something called a positional encoding. A little signal stitched into each embedding, that says, you are word number one, you are word number two, and so on. Now the model knows not just what the words are, but where they sit. With that prepared, the tokens flow into the stack of blocks. And every single block does the same four things, in the same order. Let me walk you through one block. Step one. Self attention. Every token looks at every other token, and updates its understanding, exactly as we learned in the last lesson. This is where context flows between words. Step two. A feedforward network. After mixing information across tokens, each token is passed, individually, through a small neural network. This is where the model does its private thinking, refining each token on its own. Step three. Residual connections. This one sounds technical, but the idea is simple and vital. Instead of replacing a token entirely at each step, the block adds its update on top of the original. It keeps a copy of the input, and adds the new insight to it. This little trick is what allows us to stack blocks dozens deep, without the signal getting lost or scrambled along the way. Step four. Layer normalization. A gentle housekeeping step, that keeps all the numbers in a healthy range, so training stays stable. And that is it. Self attention, feedforward, residual, normalize. Four steps. Then the token, now a little wiser, flows into the next block, and the whole thing repeats. Block after block after block. Here is the part that still amazes researchers. As information climbs through this stack, the model builds understanding in layers. The earliest blocks tend to capture simple things. Grammar. Spelling. Which word follows which. The middle blocks start tracking meaning, and the relationships between ideas. And the deepest blocks begin to handle genuinely abstract reasoning. Nobody programmed this hierarchy. It emerged, simply from stacking the same block, over and over, and training at enormous scale. At the very top of the stack, after the final block, sits the output head. It takes the richly processed vector for the last position, and turns it back into a prediction. A probability, for every possible next token. And that is how a transformer, at its core, does just one thing. It predicts the next token, astonishingly well. Now, you will sometimes hear the words encoder and decoder. A quick note. Some transformers read an input and produce an output, like translation, and use both halves. But the famous chat models, the ones that generate text, are mostly decoder only. They just predict the next token, again and again, feeding their own output back in. Simple, and powerful. So let us pull it all together. A transformer is an assembly line of identical blocks. Each block runs self attention, then a feedforward network, wrapped in residual connections and normalization. Stack these blocks deep, add positional information at the start, and a prediction head at the end, and you have the engine behind the entire generative revolution. Let me linger on why the residual connections matter so much, because they are the quiet hero of this whole design. Imagine a message, whispered down a line of a hundred people. Without any help, by the end, the message is hopelessly garbled. That is what happens to a signal passing through a hundred transformations. It drifts and decays. The residual connection is like giving every person in the line, the original written note, alongside whatever they heard whispered. At each step, the model keeps the original, and merely adds a small correction. So even a very deep stack stays grounded, and the signal survives all the way to the top. Without this one trick, deep transformers simply would not train. Now let me make the scale concrete, because the numbers are staggering. A modern model might stack dozens, even close to a hundred of these blocks. Each block holds millions of adjustable numbers, called parameters, inside its attention and its feedforward network. Add them all up, and you reach billions of parameters. Every one of them, tuned automatically during training. When people say a model has seventy billion parameters, this is what they mean. Seventy billion little dials, inside this assembly line, all set by learning, not by hand. Here is a subtle thing that trips up newcomers. When a chat model writes you a paragraph, it is not planning the whole thing in advance. It runs the entire transformer, just to predict one next token. Then it appends that token to the text, and runs the entire stack again, to predict the one after. Token by token, looping, each step informed by everything written so far. The fluency you see, the sense of a coherent thought, emerges from thousands of these tiny, local predictions, chained together. It feels like planning. It is actually a very, very good guess, repeated. And one more idea to file away. The same block, repeated, is why transformers scale so gracefully. Want a more capable model? Make the blocks wider, or stack more of them, and train on more data. That simple recipe, more depth, more width, more data, keeps producing smarter models, with almost eerie reliability. That predictable improvement, has a name. The scaling laws. And they are a big part of why the whole industry suddenly started pouring billions into ever larger models. There is one last question hanging in the air. We now have this magnificent machine. But a freshly built transformer knows absolutely nothing. Its millions, or billions, of internal numbers, are pure random noise. So how do we take this empty, powerful engine, and fill it with knowledge of grammar, facts, code, and reasoning? How do we actually teach it? That is the next chapter of our story. The three stages of training. Let us go.",
      "beats": [
        {
          "after": "Google",
          "show": 1
        },
        {
          "after": "recurrent",
          "show": 2
        },
        {
          "after": "transformer",
          "show": 3
        },
        {
          "after": "assembly",
          "show": 4
        },
        {
          "after": "blocks",
          "show": 5
        },
        {
          "after": "positional",
          "show": 6
        },
        {
          "after": "self",
          "show": 7
        },
        {
          "after": "feedforward",
          "show": 8
        },
        {
          "after": "residual",
          "show": 9
        },
        {
          "after": "normalization",
          "show": 10
        },
        {
          "after": "grammar",
          "show": 11
        },
        {
          "after": "head",
          "show": 12
        },
        {
          "after": "decoder",
          "show": 13
        },
        {
          "after": "parameters",
          "show": 14
        },
        {
          "after": "token",
          "show": 15
        },
        {
          "after": "scaling",
          "show": 16
        },
        {
          "after": "stages",
          "show": 17
        }
      ]
    },
    {
      "id": 8,
      "slug": "training-stages",
      "title": "Pretraining, fine-tuning, RLHF",
      "narration": "A newborn transformer knows nothing. Its weights are random. If you ask it a question, it will produce gibberish. So how do we teach it? In three stages. Stage one is called pretraining. We give the model a simple, almost insultingly simple, task. Predict the next token. We feed it half a sentence and ask, \"what comes next?\" If it is wrong, we nudge the weights. If it is right, we leave them be. Now, do this... a few trillion times. Across the entire internet. Books. Code. Forums. Encyclopedias. After a few months on a giant cluster of GPUs, something magical happens. The model has not just learned grammar. It has learned facts. Logic. Style. Even a little bit of reasoning. Just from predicting the next token, over and over and over. The medical analogy is perfect. Pretraining is medical school. The student reads everything. Knows a little about everything. But cannot yet treat a patient. Stage two is fine-tuning. We take that giant generalist model... and we show it a smaller, curated dataset. High-quality question-and-answer pairs. Polite conversation. Step-by-step reasoning. The model learns the SHAPE of a good response. The medical equivalent is residency. The student practices on real cases. Builds bedside manner. Stage three is the part nobody saw coming. RLHF. That stands for reinforcement learning from human feedback. Here, we let the model produce two answers to the same question. Then a human says, \"I prefer this one.\" We train a second model to predict that preference. And then we use that preference model to nudge the original. Over millions of comparisons. The model learns... not just to be correct. But to be HELPFUL. To be honest. To say \"I do not know,\" instead of making things up. RLHF is what turned a raw language model into ChatGPT. The big idea is this. Three stages. Pretraining gives the model knowledge. Fine-tuning gives it shape. RLHF gives it judgment. Each stage fixes problems the previous one left behind. So now we have a trained model. When you type a question... how does it actually generate the answer? That is more interesting than you think. Let us go.",
      "beats": [
        {
          "after": "stages",
          "show": 1
        },
        {
          "after": "pretraining",
          "show": 2
        },
        {
          "after": "predict",
          "show": 3
        },
        {
          "after": "internet",
          "show": 4
        },
        {
          "after": "school",
          "show": 5
        },
        {
          "after": "fine",
          "show": 6
        },
        {
          "after": "residency",
          "show": 7
        },
        {
          "after": "RLHF",
          "show": 8
        },
        {
          "after": "preference",
          "show": 9
        },
        {
          "after": "helpful",
          "show": 10
        },
        {
          "after": "ChatGPT",
          "show": 11
        },
        {
          "after": "judgment",
          "show": 12
        }
      ]
    },
    {
      "id": 9,
      "slug": "sampling",
      "title": "Sampling — temperature, top-p, top-k",
      "narration": "Here is something nobody tells you. When a trained model gets your prompt, it does not give you ONE answer. It gives you a list of probabilities. Across the entire vocabulary. Maybe fifty thousand tokens. Each with a probability of being the next word. So somebody has to choose. And that somebody... is the sampling strategy. The simplest strategy is greedy. Always pick the most likely token. Sounds smart, right? It is also incredibly boring. Greedy decoding produces text that is technically correct, but lifeless. Repetitive. Predictable. So we add some randomness. Enter temperature. Temperature is a knob. At zero, you get pure greedy. The most likely token, every time. As you turn the knob up, the distribution flattens. Less likely tokens get a fair shot. At one, you sample roughly true to the model's beliefs. At two, things get wild. Surprising. Sometimes nonsense. The cooking analogy is great. Low temperature is a chef who only follows the recipe exactly. High temperature is a chef who improvises. There is also top-k sampling. Only consider the top K most likely tokens. Throw the rest away. And top-p, also called nucleus sampling. Keep only the tokens that, together, account for the top P of the probability mass. Top-p is smarter than top-k, because it adapts. When the model is confident, top-p shrinks. When the model is unsure, top-p expands to include more options. In production, you usually use top-p around point nine, with a temperature around point seven. That gives you variety, without chaos. Now... here is the beautiful part. Same model. Same prompt. Different sampling. Wildly different outputs. A factual question wants low temperature. A poem wants high temperature. A code suggestion wants top-p tightened. The big idea is this. The model gives you a probability cloud. Sampling is how you walk through it. Different walks, different stories. But sometimes... the model walks confidently into a wall. It tells you something that sounds perfect, but is completely false. We call this hallucination. And it deserves its own lesson. Up next.",
      "beats": [
        {
          "after": "probabilities",
          "show": 1
        },
        {
          "after": "greedy",
          "show": 2
        },
        {
          "after": "temperature",
          "show": 3
        },
        {
          "after": "knob",
          "show": 4
        },
        {
          "after": "wild",
          "show": 5
        },
        {
          "after": "chef",
          "show": 6
        },
        {
          "after": "top",
          "show": 7
        },
        {
          "after": "nucleus",
          "show": 8
        },
        {
          "after": "production",
          "show": 9
        },
        {
          "after": "outputs",
          "show": 10
        },
        {
          "after": "hallucination",
          "show": 11
        }
      ]
    },
    {
      "id": 10,
      "slug": "hallucinations",
      "title": "Why models hallucinate",
      "narration": "A real story. In two thousand twenty-three, a lawyer in New York filed a brief in court. It cited six previous cases. The judge looked them up. Not one of them existed. The lawyer had used ChatGPT. ChatGPT had INVENTED them. Confidently. Plausibly. Completely. We call this a hallucination. The model produces something that sounds right, looks right, reads right... but is just wrong. Why does this happen? Because of how we trained it. Remember lesson eight. Pretraining tells the model, \"predict the next token.\" That is the entire goal. Be PLAUSIBLE. Sound right. Notice what is NOT in that goal. Truth. Verification. Honesty. Truth was never in the loss function. So the model learns to be a confident generator of plausible text. Not a librarian. Not a fact-checker. A confident generator. The medical analogy is the brilliant intern. Eight years of school. Knows everything. Walks into a patient's room. Sees something unfamiliar. Now... a great doctor would say, \"I do not know. Let me look it up.\" But our intern... was rewarded their entire life for sounding confident. So they invent. They fill the gap with something that SOUNDS like the right answer. That is hallucination. So how do we fight it? Three tools. One. Retrieval. Give the model a search engine. Force it to ground answers in actual documents. We will spend a whole lesson on this. It is called RAG. Two. Citations. Make the model cite its sources. And, crucially, verify those citations exist. Three. Verification. For high-stakes answers, route through a second, slower system that fact-checks the first. None of these eliminate hallucinations. They reduce them. The big idea is this. A language model is a probability machine. Truth was never in its loss. So as builders, we have to add truth from outside. Never trust. Always verify. Great. We have now seen how text models work, end to end. But text is not the only thing models generate. They paint. They speak. They sing. They code. Up next... how an image is born from pure noise.",
      "beats": [
        {
          "after": "lawyer",
          "show": 1
        },
        {
          "after": "court",
          "show": 2
        },
        {
          "after": "invented",
          "show": 3
        },
        {
          "after": "hallucination",
          "show": 4
        },
        {
          "after": "predict",
          "show": 5
        },
        {
          "after": "loss",
          "show": 6
        },
        {
          "after": "intern",
          "show": 7
        },
        {
          "after": "Retrieval",
          "show": 8
        },
        {
          "after": "Citations",
          "show": 9
        },
        {
          "after": "Verification",
          "show": 10
        },
        {
          "after": "verify",
          "show": 11
        },
        {
          "after": "noise",
          "show": 12
        }
      ]
    },
    {
      "id": 11,
      "slug": "diffusion",
      "title": "Diffusion — images from noise",
      "narration": "Michelangelo had a famous quote. \"Every block of stone has a statue inside it. The sculptor's job is to set it free.\" That is exactly how a diffusion model thinks about images. Diffusion is the algorithm behind Stable Diffusion. Behind DALL-E. Behind Midjourney. And it is wonderfully simple. It works in two phases. The forward phase. Take a real image. Add a little noise. A little more. A little more. Repeat a thousand times. By the end... pure static. Like an old TV with no signal. Now, train a model to UN-do that. Step by step. Given a slightly noisy image, predict the noise. Subtract it. Repeat. We call this the reverse process. After training... here is the magic. Start with pure random noise. Run the reverse process. The model starts to see... patterns. Then shapes. Then objects. Then a scene. After fifty steps, you have a brand new image that has never existed. Like developing a Polaroid in reverse. The image was always there. The model just chips away the noise. Now, how do you steer it? How do you say, \"I want a cat in a spacesuit\"? You add a condition. The text \"a cat in a spacesuit\" is converted into an embedding. Remember embeddings, lesson five? That embedding is plugged into the denoising network at every step. So the model is not just removing noise. It is removing noise IN A DIRECTION. The direction toward \"a cat in a spacesuit.\" Beautiful. Now, why is diffusion so much better than the older approaches? Three reasons. It is stable to train. It scales gracefully. And it produces extremely sharp, coherent images. Not blurry. Not glitchy. Stunningly real. The big idea here is this. Generation equals controlled denoising. Start with chaos. Apply the right vector field. End with a masterpiece. So now we have models that write text and models that paint pictures. What if we want one model that does both? At the same time? That is the multimodal revolution. Up next.",
      "beats": [
        {
          "after": "Michelangelo",
          "show": 1
        },
        {
          "after": "diffusion",
          "show": 2
        },
        {
          "after": "forward",
          "show": 3
        },
        {
          "after": "noise",
          "show": 4
        },
        {
          "after": "static",
          "show": 5
        },
        {
          "after": "reverse",
          "show": 6
        },
        {
          "after": "scene",
          "show": 7
        },
        {
          "after": "Polaroid",
          "show": 8
        },
        {
          "after": "embedding",
          "show": 9
        },
        {
          "after": "direction",
          "show": 10
        },
        {
          "after": "denoising",
          "show": 11
        },
        {
          "after": "multimodal",
          "show": 12
        }
      ]
    },
    {
      "id": 12,
      "slug": "multimodal",
      "title": "Multimodal — eyes, ears, voice",
      "narration": "Imagine you take a photo of your fridge. You send it to ChatGPT. And the model says, \"Looks like you have eggs, half an onion, and some cheese. How about a frittata?\" That is multimodal AI. One model. Multiple senses. How does it work? The trick is beautifully simple. Remember embeddings? Vectors that live in a space of meaning? Well... what if we trained an image encoder that mapped pictures into THE SAME embedding space as text? So the picture of a fridge, and the words \"a fridge,\" land at almost the same address? Now your transformer does not have to know which input is text and which is image. It just sees vectors. It reasons. It writes. Audio works the same way. Spectrograms, the visual fingerprint of sound, get encoded into vectors. Same space. Same transformer. Video adds time, but the principle holds. The earliest model to crack this was called CLIP. C-L-I-P. Trained on four hundred million image-and-caption pairs from the internet. CLIP taught us that you can teach a computer to understand pictures, simply by reading captions. Today, every frontier model. GPT four. Claude three. Gemini. They are all multimodal. They see PDFs. They watch videos. They hear voice notes. They produce charts. The brain analogy is perfect. Your visual cortex and your language cortex are not separate brains. They are wired together. Through neurons that pass signals across both. A multimodal model does the same thing, in vector space. Now, what does this unlock for builders? A LOT. You can summarize a meeting from its recording. You can ask questions about a chart. You can describe a video. You can read a handwritten note. Whole categories of software just got rewritten. The big idea is this. The world is multimodal. So our models should be too. Same vector space. Different sensors. One reasoning engine. Up next, we will tour the OTHER modalities. Code. Music. Speech. Video. Same recipe, fascinating ingredients.",
      "beats": [
        {
          "after": "fridge",
          "show": 1
        },
        {
          "after": "frittata",
          "show": 2
        },
        {
          "after": "multimodal",
          "show": 3
        },
        {
          "after": "encoder",
          "show": 4
        },
        {
          "after": "address",
          "show": 5
        },
        {
          "after": "Spectrograms",
          "show": 6
        },
        {
          "after": "CLIP",
          "show": 7
        },
        {
          "after": "captions",
          "show": 8
        },
        {
          "after": "frontier",
          "show": 9
        },
        {
          "after": "brain",
          "show": 10
        },
        {
          "after": "sensors",
          "show": 11
        }
      ]
    },
    {
      "id": 13,
      "slug": "other-modalities",
      "title": "Code, speech, music, video",
      "narration": "Let us take a quick tour. Same recipe... wildly different applications. First. Code. It turns out, a programming language is a language. Tokens. Grammar. Patterns. Train a transformer on a few hundred million GitHub repositories and you get a model that writes code as fluently as it writes English. GitHub Copilot. Cursor. Claude Code. They are all transformers, fine-tuned on code. With one twist. They are great at code because code is more REGULAR than natural language. The grammar is strict. Errors get punished by compilers. There is more signal. Less ambiguity. So code models often outperform their natural-language siblings. Second. Speech. Text-to-speech, TTS, is the technology behind the voice in this very lesson. Modern TTS uses diffusion or transformers, but in audio space. Audio gets chopped into tokens too. Just shorter chunks. Twenty milliseconds at a time. Train on thousands of hours of audio with text transcripts... and you get a model that turns text into sound that is, frankly, indistinguishable from a human. Third. Music. Suno. Udio. ElevenLabs. They take the same recipe, and apply it to musical audio. Hum a melody and they will arrange it. Type a vibe and they will compose it. Fourth. Video. The newest, hardest frontier. Models like Sora. Runway. Veo. They extend diffusion across time. Each frame must be coherent with the next. Costs are enormous. Quality is rising fast. We are about a year out from production-grade video generation being everyday. The unifying observation is this. Once you have the recipe... attention plus scaling plus diffusion... you can apply it to ANY signal that can be tokenized. And it turns out, almost everything can. Pixels. Audio. Code. DNA. Protein structures. Even motion. The big idea here is universality. The transformer is not a language tool. It is a SEQUENCE tool. Anything you can serialize... you can generate. So now we know how the magic works. From tokens to multimodal. From training to sampling. Time for the most practical question of all. How do we actually USE these models to build things? That is the rest of this course. Let us go.",
      "beats": [
        {
          "after": "Code",
          "show": 1
        },
        {
          "after": "GitHub",
          "show": 2
        },
        {
          "after": "compilers",
          "show": 3
        },
        {
          "after": "Speech",
          "show": 4
        },
        {
          "after": "TTS",
          "show": 5
        },
        {
          "after": "Music",
          "show": 6
        },
        {
          "after": "Suno",
          "show": 7
        },
        {
          "after": "Video",
          "show": 8
        },
        {
          "after": "Sora",
          "show": 9
        },
        {
          "after": "tokenized",
          "show": 10
        },
        {
          "after": "universality",
          "show": 11
        }
      ]
    },
    {
      "id": 14,
      "slug": "prompting",
      "title": "Prompt engineering that actually works",
      "narration": "Same model. Same question. Two different prompts. Wildly different answers. That is the entire premise of prompt engineering. And let me tell you... it is the most underrated skill of this decade. Let us go through the patterns that actually work. Pattern one. Set a role. Instead of, \"summarize this paper\"... try, \"you are a senior research analyst. Summarize this paper for an executive audience in three bullet points.\" Roles anchor tone. Anchor depth. Anchor format. Pattern two. Provide context. The model has no idea who you are or what you are working on. Tell it. \"I am building a fitness app for retirees. Suggest five feature ideas.\" That extra sentence transforms the output. Pattern three. Use few-shot examples. Instead of describing the format you want, SHOW the format. Two or three examples in the prompt. The model picks up the pattern instantly. We call this in-context learning. It is one of the most surprising abilities of large language models. Pattern four. Constrain the output. \"Respond in JSON with exactly these three keys.\" \"Use no more than fifty words.\" Constraints turn the model from a wandering generalist into a sharp specialist. Pattern five. Chain of thought. Add the magic phrase. \"Let us think step by step.\" Or, \"explain your reasoning before giving the final answer.\" This dramatically improves accuracy on math, logic, and multi-step tasks. The model literally reasons better, just because you asked it to. The freelancer analogy is perfect. Imagine hiring a brilliant freelancer. Sending them a one-line brief versus a one-page brief. Same talent. Different product. The brief matters. Now... what is prompt engineering NOT? It is not magic words. It is not secret incantations. It is just clear writing. With structure. With examples. With constraints. The big idea is this. Prompts are the new programming language of the GenAI era. The better you express what you want, the better you get it. But prompts have a hard limit. The context window. What if you need the model to know things it was not trained on? Like your company's documents? Your codebase? Your knowledge base? For that... we use a technique called retrieval augmented generation. Or RAG. Up next.",
      "beats": [
        {
          "after": "prompts",
          "show": 1
        },
        {
          "after": "role",
          "show": 2
        },
        {
          "after": "context",
          "show": 3
        },
        {
          "after": "few",
          "show": 4
        },
        {
          "after": "examples",
          "show": 5
        },
        {
          "after": "Constrain",
          "show": 6
        },
        {
          "after": "JSON",
          "show": 7
        },
        {
          "after": "chain",
          "show": 8
        },
        {
          "after": "step",
          "show": 9
        },
        {
          "after": "freelancer",
          "show": 10
        },
        {
          "after": "RAG",
          "show": 11
        }
      ]
    },
    {
      "id": 15,
      "slug": "rag",
      "title": "RAG — giving models a memory",
      "narration": "Here is the problem. Your model was trained six months ago. Maybe two years ago. It does not know what happened yesterday. It does not know your company's internal docs. It does not know your customer's order history. So if you build a customer support bot using a vanilla model... it will hallucinate. Confidently. The fix is called RAG. Retrieval augmented generation. Read those words slowly. We RETRIEVE relevant information. We AUGMENT the prompt with that information. Then the model GENERATES the answer. Three steps. Beautiful. Let us walk through it. Step one. Take all your documents. Help articles. Customer records. Product specs. Chop them into chunks. Maybe five hundred tokens each. Step two. For every chunk, compute an embedding. Remember embeddings? Vectors of meaning? Store all those vectors in a special database. We call it a vector database. Examples are Pinecone. Weaviate. Chroma. Qdrant. Step three. When a user asks a question, embed THAT question too. Now do a similarity search. Find the chunks whose embeddings live closest to the question's embedding. Maybe top five. Step four. Stuff those chunks into the prompt. \"Here are five relevant passages from our knowledge base. Use them to answer the user's question.\" Step five. The model generates. Grounded. Cited. Accurate. The exam analogy is great. A closed-book exam tests memory. An open-book exam tests reasoning. RAG turns every query into an open-book exam. With the right pages already on the desk. Why is this such a big deal? Because it separates KNOWING from GENERATING. The model does not need to memorize your company's docs. It just needs to read them, when needed. Updates are easy. Just re-index the documents. Cost is lower. Hallucinations drop dramatically. And citations become possible. Today, RAG is the most-deployed pattern in production GenAI. Every chatbot you have used at a bank, an airline, a SaaS dashboard... probably runs RAG under the hood. The big idea is this. Models do not need long memory. They need long REACH. RAG gives them that reach. But sometimes... the model needs to do more than just answer. It needs to actually act. Book a flight. Run a query. Send an email. For that, we need agents. Up next.",
      "beats": [
        {
          "after": "trained",
          "show": 1
        },
        {
          "after": "RAG",
          "show": 2
        },
        {
          "after": "documents",
          "show": 3
        },
        {
          "after": "chunks",
          "show": 4
        },
        {
          "after": "embedding",
          "show": 5
        },
        {
          "after": "vector",
          "show": 6
        },
        {
          "after": "similarity",
          "show": 7
        },
        {
          "after": "passages",
          "show": 8
        },
        {
          "after": "exam",
          "show": 9
        },
        {
          "after": "open",
          "show": 10
        },
        {
          "after": "reach",
          "show": 11
        },
        {
          "after": "agents",
          "show": 12
        }
      ]
    },
    {
      "id": 16,
      "slug": "agents",
      "title": "Tools, function calling, agents",
      "narration": "So far, we have treated the model like a thinker. You ask. It answers. End of story. But what if the model could ACT? Pick up the phone. Run a database query. Send an email. Book a calendar slot. That is the world of tools, function calling, and agents. Let us start with function calling. Modern APIs let you describe a set of tools, in plain JSON. Each tool has a name. A description. A list of parameters. You include those tool descriptions in the prompt. The model can choose to invoke one. When it does, it does not run the tool itself. It outputs a structured JSON object. Like, \"call get-weather, with city equals Bangalore.\" Your code then runs the tool. Returns the result. The model continues. Beautiful. Clean. Predictable. The chef analogy. A chef given a recipe book is one thing. A chef given the kitchen, the ingredients, and the right to call any supplier... is a different beast entirely. Function calling gives the model the kitchen. Now stack this in a loop, and you get an AGENT. The model takes a user goal. Plans. Calls a tool. Reads the result. Plans again. Calls another tool. Eventually, returns a final answer. We call this pattern ReAct. Reason and Act. It is the spine of every modern AI agent. From auto-coders to research assistants to customer-service bots that can issue refunds. But here is the catch. Agents are powerful. And dangerous. They can loop forever. They can call expensive APIs. They can take actions that are hard to undo. So in production, you wrap them with guardrails. Maximum tool calls per session. Cost ceilings. Confirmation prompts before destructive actions. Audit logs. The promise of agents is real. They turn models from oracles into operators. They unlock automation we have dreamed about for decades. The big idea is this. Function calling extends a model from \"answer\" to \"act.\" Agents extend it from one act to many. With both come capability. And, with capability, responsibility. So we have a powerful prototype. Now... how do we ship it? That is the final lesson. The realities of production. Cost. Latency. Evaluation. Safety. Up next.",
      "beats": [
        {
          "after": "act",
          "show": 1
        },
        {
          "after": "tools",
          "show": 2
        },
        {
          "after": "JSON",
          "show": 3
        },
        {
          "after": "weather",
          "show": 4
        },
        {
          "after": "chef",
          "show": 5
        },
        {
          "after": "agent",
          "show": 6
        },
        {
          "after": "ReAct",
          "show": 7
        },
        {
          "after": "refunds",
          "show": 8
        },
        {
          "after": "guardrails",
          "show": 9
        },
        {
          "after": "operators",
          "show": 10
        },
        {
          "after": "responsibility",
          "show": 11
        }
      ]
    },
    {
      "id": 17,
      "slug": "production",
      "title": "Cost, latency, evaluation, safety",
      "narration": "You have a working prototype. The demo wows the team. Time to ship. And now... reality kicks in. Production GenAI lives in a triangle of three pressures. Cost. Latency. Quality. You can usually have two. Rarely all three. Let us go through them. Cost. Tokens add up. Fast. A long-context query at premium pricing can hit dollars per call. The fixes are well-known. Cache aggressively. Trim prompts. Use smaller models for simpler tasks. Compress context. Batch when possible. If your bill scares you, you have not optimized yet. Latency. A two-second response feels slow. A six-second response feels broken. The fixes. Stream tokens to the user as they arrive. Pre-compute embeddings. Use distilled, smaller models for the easy tier. Move expensive reasoning off the critical path. Evaluation. Here is the hard truth. You cannot manage what you cannot measure. So before you ship, build a golden set. A few hundred curated inputs with known good outputs. Run them automatically on every model change. Use LLM as judge for fuzzy quality. Use human review for the highest-stakes flows. Without evals, you are flying blind. Safety. Models can be jailbroken. Made to leak data. Tricked into producing biased or harmful content. The fixes. Filter inputs. Filter outputs. Monitor for prompt injection. Strip personally identifiable information. Red-team before launch. Have an incident playbook. The road ahead is exciting. Models are getting smaller, faster, and cheaper. Distilled models like seven-billion-parameter open weights now match what required hundreds of billions a year ago. Inference is moving to the edge. To phones. To laptops. The future of GenAI is not just bigger frontier models. It is also leaner, more focused, more reliable production systems. The big idea is this. GenAI is moving from labs to engineering. The novelty is over. The discipline is just beginning. So that is your two-hour tour. From tokens, to transformers, to training, to multimodal, to agents, to production. You now know more about how this technology actually works than ninety-nine percent of the people who use it every day. Use that knowledge wisely. Build something delightful. And remember... the model is the easy part. The hard part... is everything around it. Thank you for joining me. And good luck out there.",
      "beats": [
        {
          "after": "prototype",
          "show": 1
        },
        {
          "after": "triangle",
          "show": 2
        },
        {
          "after": "Cost",
          "show": 3
        },
        {
          "after": "Latency",
          "show": 4
        },
        {
          "after": "stream",
          "show": 5
        },
        {
          "after": "Evaluation",
          "show": 6
        },
        {
          "after": "golden",
          "show": 7
        },
        {
          "after": "Safety",
          "show": 8
        },
        {
          "after": "jailbroken",
          "show": 9
        },
        {
          "after": "edge",
          "show": 10
        },
        {
          "after": "engineering",
          "show": 11
        },
        {
          "after": "delightful",
          "show": 12
        },
        {
          "after": "luck",
          "show": 13
        }
      ]
    }
  ]
}