Skip to main content

Command Palette

Search for a command to run...

How I Used AI to Build YouTube Quizr - Part One

Updated
6 min read
How I Used AI to Build YouTube Quizr - Part One
S
With 20+ years of dev experience, I'm an online educator, mentor, and consultant--teaching practical, advanced topics to developers like you on how to write: ✅ new AI/LLM features ✅ maintainable JS code ✅ scalable web apps ...yup that's about it! 🎤

If you've ever watched a great YouTube video and thought, "I wish I could test myself on this later," you already get the product idea. I turned that into a small app called YouTube Quizr.

The pitch is simple: paste a YouTube URL, wait a bit, get a quiz. Not random trivia. Not a summary dressed up as education. A real multiple-choice quiz grounded in the transcript, with timestamps so you can jump straight to the part of the video that backs each answer. That last requirement is where the engineering starts: "sounds smart" is not the same as "you can verify it."

The Version That Ships in Your Head

The first version of this app sounds embarrassingly easy:

  • Download transcript

  • Send it to an LLM

  • Ask for five questions

  • Ship it

That version would "work." It would also be flaky, expensive, hard to reason about at scale, and weak on longer videos. The real story isn't "I used AI to generate quiz questions." It's that I ended up building a system around the AI so the output would actually be worth using.

A typical one-shot prompt looks like this:

Here is a YouTube transcript. Generate a summary, then a quiz with five multiple-choice questions.

That asks the model to understand the whole transcript, pick topics, write questions and distractors, stay grounded, and return something parseable, all in one pass. It is way too many responsibilities for a single LLM step to do (reliably).

Why Quizzes, and Why Timestamps

Quizzes turn passive watching into active learning: you commit to an answer, then check whether you understood the material. Timestamps matter because if you cannot jump back to the source when a question is unclear or you disagree with the key, you cannot really verify the quiz accuracy against the video. From day one, I wanted questions that were:

  • Based on actual transcript content

  • Anchored to a real time range in the video

  • Structured enough that the rest of the app could validate and trust them

That sounds obvious. But it changed the architecture a lot.

One Prompt Was Never Going to Be Enough

Many AI demos lean on one large prompt. That is fragile when it fails. Long transcripts make it worse:

  • More tokens and cost

  • More room for the model to drift or repeat themes

  • Harder to tie every question to a specific slice of the source

So the transcript stops being a single blob and becomes something you process in stages. Instead of "prompt in, questions out," I built a pipeline. At a high level it will:

  • Fetch and normalize the transcript

  • Optionally clean up grammar without breaking timing / line alignment

  • Break the transcript into meaningful *segments ("text semantics")

  • Generate draft questions per segment (often in parallel)

  • Normalize topic labels and key terms across drafts

  • Rank, trim, validate, and assemble the final quiz

*Segmentation here means to create topical chunks of the transcript, not CV-style image masks.

The LLM model isn't the app. It's one component inside a system that decides what context to give it, what shape it is allowed to return, what happens when it fails, and how much the UI should trust the result.

"Grounded" Beats "Open-Ended" Here

In many consumer AI products, the appeal (and danger) is flexibility. Here the important idea is constraint: each question should come exclusively from transcript evidence--not even from the model's own training data. The quiz processing tracks several production details such as:

  • Which transcript lines the question used

  • Whether line references stay inside the assigned segment

  • Whether timestamps can be derived from those lines

  • Whether the final payload matches your schema

That is closer to treating the LLM as a bounded worker than as a general chat surface.

The Monorepo Paid Off

The repo is split into packages so the same generation logic powers the site and the CLI:

  • @yt-quizr/shared: schemas, types, YouTube helpers

  • @yt-quizr/service: transcript fetch, prompts, providers, generateQuiz

  • @yt-quizr/web: Next.js UI and API routes

  • @yt-quizr/cli: command-line generation

That let me treat quiz generation as a service instead of something welded into one app. The web layer is mostly orchestration and UX; the CLI reuses the same pipeline; shared types keep contracts honest. Now the messy middle implementation can change without rewriting every caller, or add a new client (eg. a mobile app).

AI Quality Is Mostly a Systems Problem

People often assume model choice is the hard part. It matters, but the higher-leverage work was:

  • Choosing what the model sees (and what it does not)

  • Splitting work into stages with clear responsibilities

  • Running independent work in parallel where it makes sense

  • Validating structured output instead of scraping free-form text

  • Handling partial failures without poisoning the whole run

Quality came less from hunting a "perfect" model and prompt, and more from building rails around adequate ones. Same pattern as hardening any other external dependency.

Why This Became Long-Running Work

Pipeline thinking has a side effect. Once you run multiple structured steps, possibly in parallel, with retries, you are not building a short request-response feature. You are building a long-running task. Browsers and long-lived HTTP requests do not mix well. Serverless backends enforce timeouts. APIs and transcript calls throw errors. Users should not stare at a blank page for several minutes while work finishes in the background.

So a "good AI quiz generator" quickly became a good async job system wrapped around the transcript pipeline. That lines up with something I have written about before: not everything belongs in one blocking call. (For more on the mindset of parallelization, see Async by default: JavaScript++.)

Where the Real Engineering Starts

Part one is the perspective shift: the project stopped being "generate me a quiz with AI" and became "design a reliable pipeline that turns messy real-world transcripts into structured, verifiable questions."

In part two, I dig deep into the pipeline itself: text semantic segmentation, per-segment workers, structured outputs, and why splitting the work is what really made the quality jump.