Building VoiceAnki, Part II

Real Decks, Bad Formatting, and the Small Matter of Talking to Your Phone

Last time I wrote about VoiceAnki as the project that started as “what if Anki had a mouth and some manners” and then kept escalating.

This post is the sequel where the app met real decks, real speech errors, and the ancient software engineering tradition of discovering that your clean architecture was, in fact, a suggestion.

The short version:

  • the speech loop got less gullible
  • the grader got more structural
  • the logs stopped being decorative
  • I built a local robot to do smoke tests because my own voice was starting to file HR complaints
  • and we are now close enough to the edge of deterministic grading that the next layer is visible, but still carefully fenced off

This is not an “AI solves education” post.

It is a post about building a voice-first Android study app that has to survive:

  • imported decks with formatting from the cursed earth
  • speech recognition that is usually helpful and occasionally drunk
  • grading policy that has to be fast, fair, and local
  • users who absolutely do not care that the regex looked elegant in your notebook

Demo Decks Lie

There is a phase every voice app gets to enjoy where the demo looks great.

You ask a clean question. You answer with a clean sentence. The recognizer hands you a clean transcript. The evaluator gives you a clean pass. Everyone nods like this was a serious plan all along.

Then you point the app at real material.

That is when you meet answers like:

  • 1. foo2. bar
  • Successful: ... Unsuccessful: ...
  • Pros: ... Cons: ...
  • 1877-78
  • Gen. Milyutin
  • one huge paragraph that starts with the useful bit and then wanders into side quests

Imported decks are not malicious. They are just old, messy, human, and full of local conventions. In other words: exactly the kind of input software tends to hate.

The first big lesson of this branch was that the grader needed to stop pretending every card was basically the same problem. A short person-name fact is not the same thing as a date range. A date range is not the same thing as a compact list. A compact list is not the same thing as a long explanatory answer that a human will naturally summarize instead of reciting bullet-by-bullet like a haunted audiobook.

That sounds obvious now. It was less obvious when the system was still getting away with a lot of fuzzy matching and a relatively small pile of hand-reviewed examples.

Card Shape Beats Raw String Length

The biggest architectural shift in this branch is simple to say and annoyingly non-trivial to implement:

grade by answer shape, not just by answer text

That means the evaluator now spends more effort upfront figuring out what sort of thing it is looking at:

  • short factual answer
  • person name
  • short numeric answer
  • definition
  • compact list
  • explanatory multi-point answer
  • command-like or control-like utterance

Once you have that, the rest of the pipeline gets saner. You stop asking one grading rule to play twelve different sports at once.

We are still keeping the main grading path deterministic and fast. That is not nostalgia; it is product design. If a spoken flashcard app feels like it pauses to hold a committee meeting before deciding whether 1877 to 1878 means 1877-78, the illusion is gone.

The user experience needs to feel immediate.

That means the hot path still has to be cheap:

  • classify once
  • prepare candidate structure once
  • compare against compact evidence
  • decide

If later we add something smarter for borderline cases, it has to sit behind that path, not inside it.

Structure Beats Vibes

One of the most useful additions here is a new structured-answer parser. I am not going to dump the entire evaluator recipe into a public post, because some of that is still moving and some of it is the kind of thing you learn by burning hours in log review. But the broad move is worth talking about.

Instead of treating every stored answer as one opaque blob, VoiceAnki now tries to recognize when the answer is actually a structure:

  • a compact list
  • a numbered list
  • a labeled list
  • a longer explanatory list

That sounds modest. It is not modest. It changes the whole feel of grading.

Here is a trimmed version of the parser entry point:

fun parse(answerText: String): StructuredAnswerParse {
    val decoded = decodeAnswerText(answerText)
    val numberedItems = extractNumberedItems(decoded)
    if (numberedItems.size >= 2) {
        val items = numberedItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    val labeledItems = extractLabeledItems(decoded)
    if (labeledItems.size >= 2) {
        val items = labeledItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    return StructuredAnswerParse()
}

That is not magic. It is just the system finally admitting that:

  • formatting matters
  • import damage matters
  • labels matter
  • and if the stored answer is really a list, we should stop grading it like a paragraph that fell down the stairs

Another small but satisfying detail is handling glued list markers. This is the kind of bug that sounds fake until you meet it in the wild:

val source = answerText
    .replace('\n', ' ')
    .replace(Regex("(?<=[a-zA-Z])(?=[1-9][.)-])"), " ")
    .replace("\\s+".toRegex(), " ")
    .trim()

That one line exists because decks really do contain things like foo2. bar, and if you do not split that boundary correctly, you end up evaluating nonsense against nonsense and calling it rigor.

The public version of the lesson is:

real grading quality is often won or lost before you ever compare a transcript to anything

If candidate preparation is bad, downstream scoring does not matter much. You are just being wrong with more confidence.

Speech Software Is Mostly About Timing

There is another lie voice products tell when they are young: that speech recognition quality is the main problem.

It is a problem. It is not the only problem. A lot of the actual work is timing, turn-taking, partials, retries, and deciding when not to believe the recognizer’s last word on what just happened.

This branch did a bunch of work in the speech loop itself:

  • carrying multiple alternatives deeper into grading
  • preserving useful partials
  • separating answer listening from control language
  • treating very short answers differently from long ones
  • quietly retrying some short numeric misses instead of immediately punting to the UI

One of the safer excerpts here is the fallback path for partials:

private fun partialFallbackResult(
    error: Int,
    speechStarted: Boolean,
    strongPartialPhrases: List,
    partialPhrases: List,
): RecognitionResult.Transcript? {
    if (!speechStarted) {
        return null
    }

    val fallbackPhrases = mergePhrases(
        primary = strongPartialPhrases,
        secondary = partialPhrases,
    )

    if (fallbackPhrases.isEmpty()) {
        return null
    }

    return when (error) {
        SpeechRecognizer.ERROR_NO_MATCH -> RecognitionResult.Transcript(fallbackPhrases)
        else -> null
    }
}

This is one of those changes that sounds small until you look at user experience.

If the user said something real, the recognizer heard enough to produce useful partials, and the final result still collapsed into ERROR_NO_MATCH, the product should not act like the person never spoke. That is the kind of behavior that makes users think the app is being smug on purpose.

Arithmetic cards were especially good at exposing this. If the app cannot survive one-word answers like five, it does not matter how clever your long-answer scoring is. Nobody is impressed. They are just annoyed.

So a lot of recent work has been about making the short-answer path feel less brittle without turning the whole system into a thicket of deck-specific hacks.

Fast Matters More Than Fancy

One thing I want to be explicit about: there is a lot of temptation in this space to keep throwing more intelligence at grading until it feels “smart.”

That is not automatically a win.

For VoiceAnki, grading speed is part of the product. The user just spoke. The app needs to respond like it was listening, not like it has submitted a ticket.

That constraint shapes the whole design:

  • keep the deterministic path local
  • keep candidate preparation reusable
  • keep transcript-time scoring bounded
  • do not add a visible “thinking…” pause to the normal loop

There is secret sauce in the exact rubric and decision policy, and I am not going to dump that out here line-by-line. But the public-facing principle is straightforward:

the fast path has to stay boring

If the user notices grading latency, they stop trusting the rhythm of the interaction.

And voice UX is rhythm.

Logs Graduated From Debug Tool to Product Infrastructure

I used to think of logs as something you improve once the interesting engineering is done.

That was cute.

On a speech app, logs are part of the interesting engineering.

A bad miss can come from:

  • speech recognition
  • transcript selection
  • answer-shape classification
  • lexical comparison
  • summary-vs-list policy
  • command/control routing
  • deck formatting

That means “it got this wrong” is not one bug category. It is a small crime scene.

So this branch put a lot more effort into making the logs answer questions like:

  • what transcript did we actually choose?
  • what kind of answer did we think this card wanted?
  • what decision path fired?
  • what evidence made the evaluator accept or reject?

That turns review from:

  • “huh, weird”

into:

  • “the answer was parsed as a structured list, but the wrong branch still ran”
  • “the recognizer had a good partial and then dropped the final”
  • “the card was really a summary-shaped answer, but the evaluator treated it like a raw string match”

That is a much more productive kind of pain.

We Built a Tiny Robot Because Manual Smoke Testing Is a Scam

One of my favorite additions around this branch is a local Pipecat smoke-test agent.

This is not some grand autonomous tutoring system. It is a very specific little goblin.

Its job is:

  1. listen to VoiceAnki through the laptop mic
  2. wait for the phone to stop talking
  3. answer through the laptop speakers
  4. keep doing that long enough to flush out session-loop bugs

That sounds silly. It is also incredibly useful.

The helper has a VoiceAnki-specific prompt, local audio transport, transcript logging, and a blunt little repeat-limit rule so it does not get stuck asking for the question forever:

repeat_limit_rule = f"""
Temporary smoke-test rule:

- Track how many times you have said exactly "can you repeat the question" for the current card.
- If you have already asked {max_repeat_requests} times for the same card, do not ask again.
- Instead, say exactly: I don't know
- Use that forced failure to let VoiceAnki mark the card wrong and move to the next question.
""".strip()

That rule exists because, left to their own devices, voice systems will absolutely form little conversational sinkholes and sit there repeating themselves like two Roombas politely arguing in a closet.

I also finally wrote proper smoke-run capture scripts so the whole thing can run unattended and leave behind artifacts we can review later:

ANDROID_CAPTURE_PID="$(spawn_detached "$ROOT_DIR" "$ANDROID_LOG" \
  "$ADB_BIN" -s "$ADB_SERIAL" logcat -v time \
  VoiceAnkiSpeech:D VoiceAnkiEval:D VoiceAnkiSemantic:D AndroidRuntime:E '*:S')"

PIPECAT_CAPTURE_PID="$(spawn_detached "$PIPECAT_DIR" "$PIPECAT_LOG" \
  "$PIPECAT_PYTHON" agent.py --input-device "$PIPECAT_INPUT_DEVICE" \
  --output-device "$PIPECAT_OUTPUT_DEVICE")"

That gives each run:

  • filtered Android logs
  • Pipecat logs
  • run metadata
  • timestamped folders for later review

It turns out this matters a lot, because manual voice testing is expensive in a very dumb way. You can lose an hour just being the person who says Roosevelt into a phone over and over while watching adb logcat scroll by like the Matrix, except less profitable.

Once a little robot can do even part of that for you, bugs start showing up in clusters instead of as rumors.

The Branch Is About More Than Just One Deck

A lot of the pressure for these changes came from history decks, because history decks are very good at producing:

  • long answers
  • compressed spoken summaries
  • date ranges
  • names with ASR drift
  • multi-point answer blobs

But the goal is not “optimize for history.”

That would be a trap.

The real target is broader:

  • explanatory cards where users summarize instead of reciting
  • imported decks with broken structure
  • voice-native equivalence for dates and names
  • command/control phrases coexisting with answer content
  • better handling for cards where exact string equality is just the wrong abstraction

If the implementation only works because the source material happens to be one subject area, that is not a system. That is a souvenir.

What Landed, and What Is Still Moving

A fair amount of the branch is already real:

  • more answer-shape-aware evaluation
  • stronger short-answer handling
  • better transcript preservation
  • richer evaluator logs
  • local Pipecat smoke testing
  • unattended log capture for long runs

There is also important work underway, some of it not committed yet:

  • broader under-acceptance reduction for explanatory multi-point cards
  • cleaner parsing of ugly imported answer text
  • more voice-native normalization for dates and names
  • more explicit decision-source logging
  • more regression tests built from real reviewed misses, not just happy-path examples

That uncommitted work matters because this branch has been one of those very honest engineering branches where the review notes, the smoke-test notes, and the code all inform each other in tight loops.

Or, put less politely: the app keeps finding new ways to be wrong, and I keep taking notes.

That is good. It means the system is meeting reality.

Deterministic Grading Is Better Now, but It Is Not the Final Boss

This is the part where I want to be careful not to oversell the current system.

The deterministic grader is better than it was:

  • more structural
  • less naive
  • more debuggable
  • less likely to reject obviously good answers for ridiculous reasons

That is real progress.

But there is also a limit to how far you want to push deterministic grading before the whole thing turns into an overfitted museum of exceptions and folklore.

That does not mean the deterministic work was wasted.

It means it was the right layer to improve first:

  • command routing
  • control handling
  • structured parsing
  • short-answer resilience
  • person-name behavior
  • list-vs-summary handling
  • observability

Those are foundational. A later model-backed layer should inherit them, not bulldoze them.

That is why the on-device inference work I have been sketching is intentionally narrow and conservative. The likely next step is not “let a model grade everything.” It is closer to:

  • keep the cheap path cheap
  • keep the main loop immediate
  • use on-device adjudication only for a narrow band of borderline long-answer cases
  • keep abstention first-class
  • make it optional and Android-native

In other words: add one careful new tool, not a second religion.

The Main Lesson So Far

The main lesson from this phase of VoiceAnki is that speech products punish fake abstraction almost immediately.

If your system is too generic, it feels unfair. If it is too clever, it becomes slow. If it is too rigid, users hate it. If it is too permissive, grading stops meaning anything.

The job is to keep finding the narrow path where the app feels:

  • fast
  • fair
  • understandable
  • and boring in the best possible way

Not “maximally AI.” Not “academically pure.” Not “one more heroic regex.”

Just a study loop that feels natural enough that the user forgets how much machinery is underneath it.

And if, along the way, we end up with a better parser, a less gullible speech loop, a tiny local smoke-test goblin, and a cautious roadmap for on-device adjudication, that seems like a pretty decent trade.

Leave a comment