How I Used AI to Build YouTube Quizr

By the time I finished the core quiz pipeline (part 2), I had a new problem. Actually, I had a very old problem wearing a new hat: long-running asynchronous work.

Once you stop doing one prompt and start doing transcript fetches, grammar cleanup, segmentation, parallel worker calls, normalization, ranking, and retries, you're no longer building a cute request-response feature. You're building infrastructure. And infrastructure doesn't care how elegant your prompt is. It will happily time you out mid-thought.

The Browser Is the Wrong Place to Wait

There's an early version of this product where the browser calls one API route, waits patiently, and eventually gets back a quiz. That version is fine... right until it's not. On serverless infrastructure with real-world timeouts, a pipeline that touches multiple LLM calls and a third-party transcript fetch simply can't live inside a single HTTP request. At some point the platform taps you on the shoulder and says "cool story, but still closing the connection."

So the web app doesn't try to cram the whole show into one HTTP round trip. It enqueues a job instead. That one shift changed the shape of the product in a very good way.

The Async Flow Is Simple on Purpose

At a high level, the flow looks like this:

Browser sends POST /api/quiz/jobs
Server validates the request, writes a pending record to KV, sends a queue message, returns 202 with a jobId
A background worker picks up the message, transitions the job to processing, and calls generateQuiz()
On success or failure the worker writes a final completed or failed record to KV
The browser polls GET /api/quiz/jobs/:id until it has an answer

This design isn't particularly exotic, and that's exactly the point. The frontend gets a clean async contract. The backend gets time to do real work. The user sees progress instead of a frozen spinner. And the queue consumer becomes the single place where all the long-running complexity lives, isolated from the HTTP layer entirely. Boring architecture is underrated. Boring is also what keeps 3 a.m. pager calls from happening.

Long-Running AI Features Are Mostly About State

When developers talk about AI architecture, the conversation almost always jumps straight to models and prompts. But once background jobs enter the picture, state management becomes just as important. For each quiz request, the system needs reliable answers to:

Was the job accepted and is it in the queue?
Is it actively running or still waiting for a worker?
Did it finish, and if not, why?
What should the UI show at each stage?

That's why the KV-backed job record matters so much. It's the shared source of truth between the web client and the worker. Without it, the queue is just fire-and-forget. With it, the queue becomes part of a coherent product experience.

Two Things Worth Getting Right: Idempotency and Failure Classification

Idempotency is one of those details that users never notice when it works. Queue deliveries aren't guaranteed to happen exactly once, so the worker checks the current job state before doing anything. If it's already completed or failed, it returns early. If the payload is invalid, it marks the job as failed rather than leaving it stranded forever. Same delivery arriving twice? No second quiz generated. This is boring to implement and really important to have. Idempotency won't win you likes on social media, but it will win you more sound sleep.

Failure classification is the other one. Not every error deserves a retry, and pretending otherwise just burns time and money:

Retryable: rate limiting, temporary provider failures, transient network issues
Non-retryable: invalid video ID, missing transcript, parse errors that indicate the data is fundamentally unusable

The queue consumer distinguishes between those two worlds. Here's what that actually looks like in the worker:

const nonRetryable =
  e instanceof QuizGenerationError &&
  (e.code === 'VALIDATION_ERROR' || e.code === 'TRANSCRIPT_NOT_FOUND' || e.code === 'PARSE_ERROR');
const exhausted = metadata.deliveryCount >= 10;

if (nonRetryable || exhausted) {
  await setQuizJobRecord(jobId, {
    status: 'failed',
    error: { message, code },
    // ...
  });
  return; // don't re-throw; queue won't retry
}

throw e; // re-throw so the queue retries this delivery

If the error code is one we know is permanent, or the job has already been attempted ten times, we write a terminal failed record and return cleanly. Otherwise we re-throw, which signals the queue to try again. Non-retryable errors get a real answer immediately. Retryable ones get another shot, up to the delivery limit. Nothing damages user trust faster than a spinner that never stops because the backend refused to admit defeat. We've all stared at that broken spinner before. Let's not build it on purpose, shall we?

Long Videos: A Different Strategy, Not Just More Tokens

Long transcripts aren't just short transcripts with more text. They change the economics of the whole run in ways that a bigger token budget doesn't fix alone. More text means more candidate topics, more chance of the model over-focusing on a popular section, and more redundancy in the output.

So there's a long-mode pipeline path for larger videos. Instead of segmenting the full transcript the same way as shorter content, the system builds overlapping time windows across the video, scores each candidate window with a lightweight analyzer call, selects the most useful ones, and only then runs segment workers on the chosen areas.

That's a very different strategy from "increase context window, hope for the best." The key insight is that the model doesn't need to read everything. It needs to read the right things. Designing a confined search space is often more valuable than expanding it. Throwing tokens at the problem is the distributed systems version of turning it off and on again.

Bounded Concurrency: Fast Without Being Reckless

Several pipeline stages are naturally parallel: segmenting candidate regions, analyzing long-mode windows, running per-segment question workers. Running them concurrently is a big win for latency. But "parallel" doesn't mean "launch unlimited calls and see what happens."

The system uses bounded concurrency for each parallel stage, configurable via environment variables. (See my article about the Leaky Bucket parallelization strategy for more info.) This keeps generation fast without turning provider rate limits into a self-inflicted outage. It's one of the least flashy and most consistently valuable parts of the implementation. Your API bill will thank you. Eventually.

Observability: Knowing Where Things Break

A multi-stage AI pipeline is only as debuggable as its logging. Every major step emits a structured log entry with a phase label, so when something goes wrong I can tell immediately whether it failed during transcript fetch, grammar cleanup, segmentation, a specific worker, meta-normalization, or ranking. Without that, every failure report becomes "AI did something weird," which isn't a diagnosis. It's basically a shrug.

A Few UX Details That Paid Off

The frontend has a handful of behaviors that look small but make the app feel much more solid:

Polling over blocking: the UI polls job status on an interval rather than holding an open connection, which plays nicely with serverless and browser tab switching
Keep the previous quiz visible: starting a new generation doesn't wipe the current quiz; the old result stays on screen until the new one is successful and ready to show
Abort vs. failure are different events: if the user cancels or submits a new URL, that isn't the same as the pipeline failing, and the quota refund logic treats them differently

That last one is subtle but it matters. Treating a user-initiated cancel the same as a server error would quietly over-consume the daily generation quota, which is both wrong and the kind of thing that erodes trust in ways that are hard to trace. Users notice when the app punishes them for bailing out.

The Bigger Picture

At the start of this project I thought I was building a quiz generator. By the end I realized I was building three systems that happen to cooperate: a transcript-grounded content pipeline, a long-running async job architecture, and a UI that makes both of those feel simple.

The model is one part of that system. The real work is everything wrapped around it: designing for failure, separating concerns, building for long tasks instead of pretending they're short ones.

High-quality AI products aren't usually the result of one impressively engineered prompt. They're the result of building a system that makes it hard for the model to fail in boring ways.

How I Used AI to Build YouTube Quizr - Part Three