How I Used AI to Build YouTube Quizr

In part one, I talked about the high-level shift that made this project work: stop treating the LLM like the whole product, and start treating it like one component inside a carefully designed system. Now I want to get into the part that actually made the output quality spike.

The core trick wasn't "use AI." It was: don't ask the AI to do too much in one shot. (If you've ever watched an LLM confidently do multiple jobs badly at once, that's what I'm talking about.)

The Naive Prompt (and Why It Falls Apart)

Here's the obvious first version:

Here is a YouTube transcript. Generate a summary, then a quiz with five multiple-choice questions.

Simple. Reasonable. Also not very good. The problem isn't the model, it's that the job is massively underspecified. You're asking it to do everything at once:

Understand the full transcript
Find topic boundaries on its own
Decide what's worth asking about
Write plausible distractors
Keep answers grounded in the source
Return something your app can actually parse

That's not one task. That's a workflow. So I built a workflow instead.

The Pipeline

The quiz generation path in YouTube Quizr moves through seven stages:

Fetch transcript and video metadata (with a fallback relay for hostile datacenter environments)
Optional grammar cleanup to normalize messy auto-generated captions
Segmentation to split the transcript into topical ranges
Parallel segment workers to generate draft questions per segment
Meta-normalize to align topic labels and key terms across all drafts
Rank and trim the candidate pool
Validate and assemble the final quiz object

Each stage has one job. It can fail independently. The model never has to hold the whole problem in its head at once. Think of it as a refactor where you stop passing a "god object" around, and start passing small, typed arguments. Same energy.

Getting the Transcript (Harder Than It Sounds)

Transcript ingestion sounds straightforward: call a library, get captions, move on. But real deployments are rude. Some datacenter environments just don't talk to YouTube as cleanly as your laptop does, and sometimes fetches fail or return partial metadata. A production environment has a way of turning "two lines of code" into "why does this only work when I'm on coffee shop Wi-Fi?"

So the service tries a direct fetch first, then falls back to a small relay service when configured. Not glamorous, but it's the kind of thing that separates "works in a demo" from works in production. (You know which one pays the bills.)

Grammar Cleanup (With a Constraint)

Auto-generated captions can be a mess. Weird or no punctuation, run-on sentences, ASR artifacts that are technically readable but hard to turn into clean questions. The grammar-fix step lets the model clean all of that up... but with one firm rule: it isn't allowed to change the timing structure or line count.

That constraint is important. Every downstream stage depends on being able to map a question back to a specific transcript line and derive a timestamp from it. Give the model permission to freely rewrite and you break that alignment. So the prompt makes the constraint explicit, and the pipeline enforces it. Basically: you can tidy the room, you can't rearrange the furniture.

Segmentation: Where It Starts Feeling Smart

Once the transcript is clean, the pipeline segments it into contiguous topical blocks. A 15-minute video might have 6-8 natural topic shifts, and the best question buried around minute 11 simply isn't visible if you're thinking at the full-transcript level.

Semantic segmentation gives the rest of the pipeline:

Smaller working contexts so each worker call stays focused
Better topic coverage since each segment represents a distinct area
Natural parallelism because independent segments can run concurrently

That last point matters a lot for performance. Question generation becomes a map operation instead of one big serialized call. If you've ever parallelized a slow for-loop and felt a little smug about it, yup - same dopamine hit.

Segment Workers: The Real Drafting Stage

This is my favorite part. Each segment gets its own worker call, and that worker sees only its assigned transcript range. The prompt is deliberately narrow:

You are a teacher creating quiz questions from one contiguous segment of a video transcript.

Segment 3 of 6 (lines 48-71, time 04:12-06:05).

Rules:
- Use only the supplied transcript segment as evidence.
- Don't infer unstated facts or outside knowledge.
- Each question's line references must fall within lines 48-71.
- Return 0 to 2 questions. Returning zero is fine if the segment lacks testable content.

The worker returns structured JSON validated against a schema, not loose prose. If one worker produces bad output, it doesn't poison the whole run. If a segment has nothing interesting worth asking, returning zero questions is fine! That resilience is something you simply can't get from a single monolithic prompt. One bad apple shouldn't spoil the whole JSON blob. 🍎✨

Meta-Normalize: Cleaning Up the Seams

When multiple workers run independently, you get a subtle problem. Three workers might all handle "authentication" topic correctly, but label it differently:

Worker 1 calls it "Authentication"
Worker 2 calls it "User auth"
Worker 3 calls it "Login flow"

None of those labels are wrong, but a quiz with all three looks confusing and sloppy. It's the naming-in-a-code-review problem, except now it's your quiz UX. The meta-normalize pass fixes this: it receives compact metadata for every draft question and proposes label renames for topics and key terms. Crucially, it doesn't see question bodies or answers. It only aligns labels, nothing else. That narrow scope keeps the risk low and the benefit real.

Ranking: Generation and Curation Are Different Jobs

Even after normalization, not every valid draft is worth keeping. Some questions are redundant, some cover the same topic from a slightly different angle, some are technically correct but not very educational.

Rather than asking the generation step to also handle curation, the pipeline runs a separate ranking pass over the full candidate pool and selects a balanced final set. Once I stopped thinking "give me five questions" and started thinking "generate candidates, then select the best five," the output quality jumped noticeably. Drafts are cheap; shipping the wrong question batch to users is not.

The Takeaway

By the end of the pipeline, the system has layered on quite a few constraints: line numbers, timestamps, segment boundaries, structured output, schema validation, label normalization, and ranking. That might sound like over-engineering. I'd argue it's just quality engineering--simple enough, no simpler.

If you want reliable output from a probabilistic system, you need stages with opinions. You need checkpoints. You need a way to recover when one component is noisy. The LLM model isn't the app; it's one worker inside a system that knows how to leverage it well.

The final user experience still looks like: paste URL, wait a bit, get quiz. That's exactly how simple it should feel. All that pipeline complexity is just what makes the simplicity possible.

In part three, I'm going to cover the other half of the system: why generating a quiz became a long-running background job, how the async architecture works, and what had to happen so the product stayed responsive even when generation took several minutes.

How I Used AI to Build YouTube Quizr - Part Two