How I Used AI to Build YouTube Quizr - Part One

If you've ever watched a great YouTube video and thought, "I wish I could test myself on this later," you already get the product idea. I turned that into a small app called YouTube Quizr.
The pitch is simple: paste a YouTube URL, wait a bit, get a quiz. Not random trivia. Not a summary dressed up as education. A real multiple-choice quiz grounded in the transcript, with timestamps so you can jump straight to the part of the video that backs each answer. That last requirement is where the engineering starts: "sounds smart" is not the same as "you can verify it."
The Version That Ships in Your Head
The first version of this app sounds embarrassingly easy:
Download transcript
Send it to an LLM
Ask for five questions
Ship it
That version would "work." It would also be flaky, expensive, hard to reason about at scale, and weak on longer videos. The real story isn't "I used AI to generate quiz questions." It's that I ended up building a system around the AI so the output would actually be worth using.
A typical one-shot prompt looks like this:
Here is a YouTube transcript. Generate a summary, then a quiz with five multiple-choice questions.
That asks the model to understand the whole transcript, pick topics, write questions and distractors, stay grounded, and return something parseable, all in one pass. It is way too many responsibilities for a single LLM step to do (reliably).
Why Quizzes, and Why Timestamps
Quizzes turn passive watching into active learning: you commit to an answer, then check whether you understood the material. Timestamps matter because if you cannot jump back to the source when a question is unclear or you disagree with the key, you cannot really verify the quiz accuracy against the video. From day one, I wanted questions that were:
Based on actual transcript content
Anchored to a real time range in the video
Structured enough that the rest of the app could validate and trust them
That sounds obvious. But it changed the architecture a lot.
One Prompt Was Never Going to Be Enough
Many AI demos lean on one large prompt. That is fragile when it fails. Long transcripts make it worse:
More tokens and cost
More room for the model to drift or repeat themes
Harder to tie every question to a specific slice of the source
So the transcript stops being a single blob and becomes something you process in stages. Instead of "prompt in, questions out," I built a pipeline. At a high level it will:
Fetch and normalize the transcript
Optionally clean up grammar without breaking timing / line alignment
Break the transcript into meaningful *segments ("text semantics")
Generate draft questions per segment (often in parallel)
Normalize topic labels and key terms across drafts
Rank, trim, validate, and assemble the final quiz
*Segmentation here means to create topical chunks of the transcript, not CV-style image masks.
The LLM model isn't the app. It's one component inside a system that decides what context to give it, what shape it is allowed to return, what happens when it fails, and how much the UI should trust the result.
"Grounded" Beats "Open-Ended" Here
In many consumer AI products, the appeal (and danger) is flexibility. Here the important idea is constraint: each question should come exclusively from transcript evidence--not even from the model's own training data. The quiz processing tracks several production details such as:
Which transcript lines the question used
Whether line references stay inside the assigned segment
Whether timestamps can be derived from those lines
Whether the final payload matches your schema
That is closer to treating the LLM as a bounded worker than as a general chat surface.
The Monorepo Paid Off
The repo is split into packages so the same generation logic powers the site and the CLI:
@yt-quizr/shared: schemas, types, YouTube helpers@yt-quizr/service: transcript fetch, prompts, providers,generateQuiz@yt-quizr/web: Next.js UI and API routes@yt-quizr/cli: command-line generation
That let me treat quiz generation as a service instead of something welded into one app. The web layer is mostly orchestration and UX; the CLI reuses the same pipeline; shared types keep contracts honest. Now the messy middle implementation can change without rewriting every caller, or add a new client (eg. a mobile app).
AI Quality Is Mostly a Systems Problem
People often assume model choice is the hard part. It matters, but the higher-leverage work was:
Choosing what the model sees (and what it does not)
Splitting work into stages with clear responsibilities
Running independent work in parallel where it makes sense
Validating structured output instead of scraping free-form text
Handling partial failures without poisoning the whole run
Quality came less from hunting a "perfect" model and prompt, and more from building rails around adequate ones. Same pattern as hardening any other external dependency.
Why This Became Long-Running Work
Pipeline thinking has a side effect. Once you run multiple structured steps, possibly in parallel, with retries, you are not building a short request-response feature. You are building a long-running task. Browsers and long-lived HTTP requests do not mix well. Serverless backends enforce timeouts. APIs and transcript calls throw errors. Users should not stare at a blank page for several minutes while work finishes in the background.
So a "good AI quiz generator" quickly became a good async job system wrapped around the transcript pipeline. That lines up with something I have written about before: not everything belongs in one blocking call. (For more on the mindset of parallelization, see Async by default: JavaScript++.)
Where the Real Engineering Starts
Part one is the perspective shift: the project stopped being "generate me a quiz with AI" and became "design a reliable pipeline that turns messy real-world transcripts into structured, verifiable questions."
In part two, I dig deep into the pipeline itself: text semantic segmentation, per-segment workers, structured outputs, and why splitting the work is what really made the quality jump.



