Merit, But Make It Legible

Featured

Merit, But Make It Legible

One of the more irritating features of modern life is that people love to say they reward merit when what they often reward is legibility.

Not raw capability.
Not force of will.
Not how much resistance someone had to push through to become good at something.

Legibility.

Did the achievement arrive in packaging the system already knows how to admire? Did it come with a famous school, recognizable institutions, polished references, family support, clean internships, the right tone, the right posture, the right little trail of approved breadcrumbs? If so, people relax. They call it excellence.

Meanwhile, if someone arrives at similar visible competence through a messier path — sparse resources, little formal support, public materials, self-direction, no safety net, and almost no room for error — the response is often weirdly diminished.

That person becomes scrappy.
Surprisingly strong.
Promising.
Impressive, considering.

Considering what, exactly?

What is being “considered” is usually the absence of prestige decoration. The person may have built nearly the same capability, or in some cases more durable capability, but because they did not emerge from a trusted institutional pipeline, people treat the result as somehow less real. Or more provisional. Or faintly suspicious. They get credit, but in the off-brand, slightly patronizing way society reserves for people who succeeded without first being pre-approved.

This is backwards in an important sense.

The person who had elite schooling, money, family support, institutional legitimacy, and low-friction access to opportunity may in fact be highly capable. None of this automatically disqualifies them. Plenty of advantaged people are genuinely excellent.

But there is still a difference between demonstrating excellence under supportive conditions and constructing yourself under weak ones.

The bootstrap path often demands a set of traits that institutions claim to admire but are not especially good at recognizing in the wild:

  • initiative
  • independence
  • persistence
  • improvisation
  • the ability to learn without structure
  • the ability to continue without validation
  • the ability to recover from mistakes that were actually costly

Those are not decorative virtues. Those are core builder traits.

And yet, because they do not come pre-certified by prestige systems, they are routinely under-read. Not merely under-resourced at the start — under-credited even after the fact.

That distinction matters.

Being under-resourced means you lacked inputs.
Being under-credited means the world misreads what you produced.

Those are different problems.

The first makes the climb harder.
The second makes the summit look smaller than it is.

A lot of evaluators will insist this is not bias, just pragmatism. They will say elite labels are useful proxies. And to be fair, they are. Institutions act as compression algorithms. They save busy people the trouble of asking inconvenient questions like:

  • How hard was this path, actually?
  • How much support was quietly embedded in the background?
  • How much independent force did this person have to generate on their own?
  • How many hidden cushions were mistaken for personal greatness?

These are not questions most systems are built to ask, because they are expensive to answer and mildly destabilizing to the mythology. It is much easier to see Harvard, billionaire parents, polished confidence, and familiar signals, then conclude: obviously exceptional.

Clean. Efficient. Safe.

It is much less comfortable to look at someone who assembled themselves from public materials, intermittent guidance, and sheer stubbornness, then admit that what you are seeing may represent a more violent act of self-construction.

The elite profile is often treated as natural greatness.
The bootstrap profile is often treated as an anomaly.

But anomalies are sometimes just reality showing through the branding.

This does not mean the bootstrap person is always better. That would just be reverse snobbery with better PR. The point is narrower and more important: achievement is frequently judged by how frictionless it looks, not by how much force was required to make it happen.

And force matters.

Especially in domains where the environment is unstable, where there is no syllabus, where support is partial, where nobody is coming to organize your progress for you. In those situations, the ability to move without structure, learn without permission, and continue without applause is not some charming side trait. It is often the thing itself.

That person may not sound as polished.
They may not tell the story as elegantly.
They may not have the right names on the résumé.
They may not know how to perform legitimacy in the dialect gatekeepers prefer.

But sometimes they built more real capability with less help and less slack.

And the world, being the world, often reads that as scrappy instead of formidable.

Which is convenient, because formidable would force people to rethink what they are actually rewarding.

Building VoiceAnki, Part II

Featured

Real Decks, Bad Formatting, and the Small Matter of Talking to Your Phone

Last time I wrote about VoiceAnki as the project that started as “what if Anki had a mouth and some manners” and then kept escalating.

This post is the sequel where the app met real decks, real speech errors, and the ancient software engineering tradition of discovering that your clean architecture was, in fact, a suggestion.

The short version:

  • the speech loop got less gullible
  • the grader got more structural
  • the logs stopped being decorative
  • I built a local robot to do smoke tests because my own voice was starting to file HR complaints
  • and we are now close enough to the edge of deterministic grading that the next layer is visible, but still carefully fenced off

This is not an “AI solves education” post.

It is a post about building a voice-first Android study app that has to survive:

  • imported decks with formatting from the cursed earth
  • speech recognition that is usually helpful and occasionally drunk
  • grading policy that has to be fast, fair, and local
  • users who absolutely do not care that the regex looked elegant in your notebook

Demo Decks Lie

There is a phase every voice app gets to enjoy where the demo looks great.

You ask a clean question. You answer with a clean sentence. The recognizer hands you a clean transcript. The evaluator gives you a clean pass. Everyone nods like this was a serious plan all along.

Then you point the app at real material.

That is when you meet answers like:

  • 1. foo2. bar
  • Successful: ... Unsuccessful: ...
  • Pros: ... Cons: ...
  • 1877-78
  • Gen. Milyutin
  • one huge paragraph that starts with the useful bit and then wanders into side quests

Imported decks are not malicious. They are just old, messy, human, and full of local conventions. In other words: exactly the kind of input software tends to hate.

The first big lesson of this branch was that the grader needed to stop pretending every card was basically the same problem. A short person-name fact is not the same thing as a date range. A date range is not the same thing as a compact list. A compact list is not the same thing as a long explanatory answer that a human will naturally summarize instead of reciting bullet-by-bullet like a haunted audiobook.

That sounds obvious now. It was less obvious when the system was still getting away with a lot of fuzzy matching and a relatively small pile of hand-reviewed examples.

Card Shape Beats Raw String Length

The biggest architectural shift in this branch is simple to say and annoyingly non-trivial to implement:

grade by answer shape, not just by answer text

That means the evaluator now spends more effort upfront figuring out what sort of thing it is looking at:

  • short factual answer
  • person name
  • short numeric answer
  • definition
  • compact list
  • explanatory multi-point answer
  • command-like or control-like utterance

Once you have that, the rest of the pipeline gets saner. You stop asking one grading rule to play twelve different sports at once.

We are still keeping the main grading path deterministic and fast. That is not nostalgia; it is product design. If a spoken flashcard app feels like it pauses to hold a committee meeting before deciding whether 1877 to 1878 means 1877-78, the illusion is gone.

The user experience needs to feel immediate.

That means the hot path still has to be cheap:

  • classify once
  • prepare candidate structure once
  • compare against compact evidence
  • decide

If later we add something smarter for borderline cases, it has to sit behind that path, not inside it.

Structure Beats Vibes

One of the most useful additions here is a new structured-answer parser. I am not going to dump the entire evaluator recipe into a public post, because some of that is still moving and some of it is the kind of thing you learn by burning hours in log review. But the broad move is worth talking about.

Instead of treating every stored answer as one opaque blob, VoiceAnki now tries to recognize when the answer is actually a structure:

  • a compact list
  • a numbered list
  • a labeled list
  • a longer explanatory list

That sounds modest. It is not modest. It changes the whole feel of grading.

Here is a trimmed version of the parser entry point:

fun parse(answerText: String): StructuredAnswerParse {
    val decoded = decodeAnswerText(answerText)
    val numberedItems = extractNumberedItems(decoded)
    if (numberedItems.size >= 2) {
        val items = numberedItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    val labeledItems = extractLabeledItems(decoded)
    if (labeledItems.size >= 2) {
        val items = labeledItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    return StructuredAnswerParse()
}

That is not magic. It is just the system finally admitting that:

  • formatting matters
  • import damage matters
  • labels matter
  • and if the stored answer is really a list, we should stop grading it like a paragraph that fell down the stairs

Another small but satisfying detail is handling glued list markers. This is the kind of bug that sounds fake until you meet it in the wild:

val source = answerText
    .replace('\n', ' ')
    .replace(Regex("(?<=[a-zA-Z])(?=[1-9][.)-])"), " ")
    .replace("\\s+".toRegex(), " ")
    .trim()

That one line exists because decks really do contain things like foo2. bar, and if you do not split that boundary correctly, you end up evaluating nonsense against nonsense and calling it rigor.

The public version of the lesson is:

real grading quality is often won or lost before you ever compare a transcript to anything

If candidate preparation is bad, downstream scoring does not matter much. You are just being wrong with more confidence.

Speech Software Is Mostly About Timing

There is another lie voice products tell when they are young: that speech recognition quality is the main problem.

It is a problem. It is not the only problem. A lot of the actual work is timing, turn-taking, partials, retries, and deciding when not to believe the recognizer’s last word on what just happened.

This branch did a bunch of work in the speech loop itself:

  • carrying multiple alternatives deeper into grading
  • preserving useful partials
  • separating answer listening from control language
  • treating very short answers differently from long ones
  • quietly retrying some short numeric misses instead of immediately punting to the UI

One of the safer excerpts here is the fallback path for partials:

private fun partialFallbackResult(
    error: Int,
    speechStarted: Boolean,
    strongPartialPhrases: List,
    partialPhrases: List,
): RecognitionResult.Transcript? {
    if (!speechStarted) {
        return null
    }

    val fallbackPhrases = mergePhrases(
        primary = strongPartialPhrases,
        secondary = partialPhrases,
    )

    if (fallbackPhrases.isEmpty()) {
        return null
    }

    return when (error) {
        SpeechRecognizer.ERROR_NO_MATCH -> RecognitionResult.Transcript(fallbackPhrases)
        else -> null
    }
}

This is one of those changes that sounds small until you look at user experience.

If the user said something real, the recognizer heard enough to produce useful partials, and the final result still collapsed into ERROR_NO_MATCH, the product should not act like the person never spoke. That is the kind of behavior that makes users think the app is being smug on purpose.

Arithmetic cards were especially good at exposing this. If the app cannot survive one-word answers like five, it does not matter how clever your long-answer scoring is. Nobody is impressed. They are just annoyed.

So a lot of recent work has been about making the short-answer path feel less brittle without turning the whole system into a thicket of deck-specific hacks.

Fast Matters More Than Fancy

One thing I want to be explicit about: there is a lot of temptation in this space to keep throwing more intelligence at grading until it feels “smart.”

That is not automatically a win.

For VoiceAnki, grading speed is part of the product. The user just spoke. The app needs to respond like it was listening, not like it has submitted a ticket.

That constraint shapes the whole design:

  • keep the deterministic path local
  • keep candidate preparation reusable
  • keep transcript-time scoring bounded
  • do not add a visible “thinking…” pause to the normal loop

There is secret sauce in the exact rubric and decision policy, and I am not going to dump that out here line-by-line. But the public-facing principle is straightforward:

the fast path has to stay boring

If the user notices grading latency, they stop trusting the rhythm of the interaction.

And voice UX is rhythm.

Logs Graduated From Debug Tool to Product Infrastructure

I used to think of logs as something you improve once the interesting engineering is done.

That was cute.

On a speech app, logs are part of the interesting engineering.

A bad miss can come from:

  • speech recognition
  • transcript selection
  • answer-shape classification
  • lexical comparison
  • summary-vs-list policy
  • command/control routing
  • deck formatting

That means “it got this wrong” is not one bug category. It is a small crime scene.

So this branch put a lot more effort into making the logs answer questions like:

  • what transcript did we actually choose?
  • what kind of answer did we think this card wanted?
  • what decision path fired?
  • what evidence made the evaluator accept or reject?

That turns review from:

  • “huh, weird”

into:

  • “the answer was parsed as a structured list, but the wrong branch still ran”
  • “the recognizer had a good partial and then dropped the final”
  • “the card was really a summary-shaped answer, but the evaluator treated it like a raw string match”

That is a much more productive kind of pain.

We Built a Tiny Robot Because Manual Smoke Testing Is a Scam

One of my favorite additions around this branch is a local Pipecat smoke-test agent.

This is not some grand autonomous tutoring system. It is a very specific little goblin.

Its job is:

  1. listen to VoiceAnki through the laptop mic
  2. wait for the phone to stop talking
  3. answer through the laptop speakers
  4. keep doing that long enough to flush out session-loop bugs

That sounds silly. It is also incredibly useful.

The helper has a VoiceAnki-specific prompt, local audio transport, transcript logging, and a blunt little repeat-limit rule so it does not get stuck asking for the question forever:

repeat_limit_rule = f"""
Temporary smoke-test rule:

- Track how many times you have said exactly "can you repeat the question" for the current card.
- If you have already asked {max_repeat_requests} times for the same card, do not ask again.
- Instead, say exactly: I don't know
- Use that forced failure to let VoiceAnki mark the card wrong and move to the next question.
""".strip()

That rule exists because, left to their own devices, voice systems will absolutely form little conversational sinkholes and sit there repeating themselves like two Roombas politely arguing in a closet.

I also finally wrote proper smoke-run capture scripts so the whole thing can run unattended and leave behind artifacts we can review later:

ANDROID_CAPTURE_PID="$(spawn_detached "$ROOT_DIR" "$ANDROID_LOG" \
  "$ADB_BIN" -s "$ADB_SERIAL" logcat -v time \
  VoiceAnkiSpeech:D VoiceAnkiEval:D VoiceAnkiSemantic:D AndroidRuntime:E '*:S')"

PIPECAT_CAPTURE_PID="$(spawn_detached "$PIPECAT_DIR" "$PIPECAT_LOG" \
  "$PIPECAT_PYTHON" agent.py --input-device "$PIPECAT_INPUT_DEVICE" \
  --output-device "$PIPECAT_OUTPUT_DEVICE")"

That gives each run:

  • filtered Android logs
  • Pipecat logs
  • run metadata
  • timestamped folders for later review

It turns out this matters a lot, because manual voice testing is expensive in a very dumb way. You can lose an hour just being the person who says Roosevelt into a phone over and over while watching adb logcat scroll by like the Matrix, except less profitable.

Once a little robot can do even part of that for you, bugs start showing up in clusters instead of as rumors.

The Branch Is About More Than Just One Deck

A lot of the pressure for these changes came from history decks, because history decks are very good at producing:

  • long answers
  • compressed spoken summaries
  • date ranges
  • names with ASR drift
  • multi-point answer blobs

But the goal is not “optimize for history.”

That would be a trap.

The real target is broader:

  • explanatory cards where users summarize instead of reciting
  • imported decks with broken structure
  • voice-native equivalence for dates and names
  • command/control phrases coexisting with answer content
  • better handling for cards where exact string equality is just the wrong abstraction

If the implementation only works because the source material happens to be one subject area, that is not a system. That is a souvenir.

What Landed, and What Is Still Moving

A fair amount of the branch is already real:

  • more answer-shape-aware evaluation
  • stronger short-answer handling
  • better transcript preservation
  • richer evaluator logs
  • local Pipecat smoke testing
  • unattended log capture for long runs

There is also important work underway, some of it not committed yet:

  • broader under-acceptance reduction for explanatory multi-point cards
  • cleaner parsing of ugly imported answer text
  • more voice-native normalization for dates and names
  • more explicit decision-source logging
  • more regression tests built from real reviewed misses, not just happy-path examples

That uncommitted work matters because this branch has been one of those very honest engineering branches where the review notes, the smoke-test notes, and the code all inform each other in tight loops.

Or, put less politely: the app keeps finding new ways to be wrong, and I keep taking notes.

That is good. It means the system is meeting reality.

Deterministic Grading Is Better Now, but It Is Not the Final Boss

This is the part where I want to be careful not to oversell the current system.

The deterministic grader is better than it was:

  • more structural
  • less naive
  • more debuggable
  • less likely to reject obviously good answers for ridiculous reasons

That is real progress.

But there is also a limit to how far you want to push deterministic grading before the whole thing turns into an overfitted museum of exceptions and folklore.

That does not mean the deterministic work was wasted.

It means it was the right layer to improve first:

  • command routing
  • control handling
  • structured parsing
  • short-answer resilience
  • person-name behavior
  • list-vs-summary handling
  • observability

Those are foundational. A later model-backed layer should inherit them, not bulldoze them.

That is why the on-device inference work I have been sketching is intentionally narrow and conservative. The likely next step is not “let a model grade everything.” It is closer to:

  • keep the cheap path cheap
  • keep the main loop immediate
  • use on-device adjudication only for a narrow band of borderline long-answer cases
  • keep abstention first-class
  • make it optional and Android-native

In other words: add one careful new tool, not a second religion.

The Main Lesson So Far

The main lesson from this phase of VoiceAnki is that speech products punish fake abstraction almost immediately.

If your system is too generic, it feels unfair. If it is too clever, it becomes slow. If it is too rigid, users hate it. If it is too permissive, grading stops meaning anything.

The job is to keep finding the narrow path where the app feels:

  • fast
  • fair
  • understandable
  • and boring in the best possible way

Not “maximally AI.” Not “academically pure.” Not “one more heroic regex.”

Just a study loop that feels natural enough that the user forgets how much machinery is underneath it.

And if, along the way, we end up with a better parser, a less gullible speech loop, a tiny local smoke-test goblin, and a cautious roadmap for on-device adjudication, that seems like a pretty decent trade.

VoiceAnki

Building VoiceAnki: A Voice-First Study App That Kept Growing

What This Project Is

VoiceAnki started as a pretty simple idea: what if flashcard review felt more like a conversation and less like tapping through tiny buttons?

The core goal was to make studying possible in a more hands-free, audio-first way. Instead of treating voice as a gimmick layered on top of a normal flashcard app, the project pushed toward something more opinionated:

  • speak the prompt
  • listen for the answer
  • evaluate the response
  • keep the review loop moving without constant screen interaction

Over time, that turned into a much larger app than the original idea suggested. What exists now is not just a voice button on a flashcard screen. It is a full Android app with a session runtime, deck import pipeline, history, settings, AnkiWeb integration, and an increasingly serious answer-evaluation system.

This post is a look back at the work that went into it, what changed along the way, and what turned out to be harder than expected.

The Starting Point

At the beginning, the product shape was intentionally narrow:

  • Android only
  • local deck storage
  • spoken prompts
  • spoken answers
  • deterministic grading
  • lightweight study history

That focus mattered. It kept the project from immediately collapsing into a vague “AI tutor” idea. The first real work was not around machine learning at all. It was around building a dependable study loop:

  • a card queue
  • review scheduling
  • a reducer-driven session state machine
  • text-to-speech
  • Android speech recognition
  • foreground session behavior so the app could survive longer interactions

That part of the app is still the backbone of everything else. Even the newer AI and semantic work only makes sense because there is already a deterministic study engine underneath it.

Turning It Into a Real App

Once the core loop existed, the app started growing in the more familiar directions any real product eventually has to grow.

The project gained:

  • a home screen that lists decks
  • deck detail views
  • a settings screen for answer mode, speech rate, listening window, and grading behavior
  • session history
  • a persistent Room-backed database
  • DataStore-backed settings

That was the moment it stopped feeling like a prototype and started feeling like an app with real internal structure.

One theme that kept coming up was that nearly every “simple” feature touched more systems than expected. A new setting was never just a toggle. It usually had to travel through:

  • settings storage
  • view models
  • UI state
  • runtime configuration
  • sometimes the session reducer itself

That kind of wiring is not glamorous, but it is what makes later experimentation possible without the whole app turning into spaghetti.

Importing Decks Instead of Pretending

One of the biggest shifts in the project was deciding that the app should not live forever on a demo deck.

That meant building a real import path.

There are two different import stories in the app now:

  1. importing from files
  2. importing from AnkiWeb

The file import work led to a full import pipeline:

  • parse a deck file
  • turn it into an internal draft
  • preview the import
  • commit it into the local database

That draft step turned out to be especially useful. It created a clean boundary between “we successfully fetched or parsed something” and “we are ready to persist it as a real deck.” That became important later when the app started pulling content from the web rather than only from local files.

The .apkg path was also a turning point. Anki package import sounds straightforward until you actually have to do it on-device:

  • unzip the package
  • extract and read the SQLite content
  • resolve media references
  • map notes, cards, models, and templates into something your own app understands

That is the kind of work that is easy to underestimate from a distance. It is not especially flashy, but it is exactly the sort of feature that makes an app useful in the real world.

AnkiWeb: From Scraping to a Better Product Decision

AnkiWeb support was one of the most iterative parts of the project.

The first instinct was what many apps would try first: scrape the shared-deck pages and build a native search/detail flow on top of that. That approach looked promising at first, but it ran straight into the reality of the modern web:

  • JavaScript-heavy pages
  • Cloudflare-style challenge behavior
  • markup that is not stable enough to treat as a public API

The project went through several rounds of trying to make that scraper path more resilient, including:

  • improving network setup and headers
  • hardening HTML parsing
  • using a WebView to render pages instead of assuming static HTML

That work was valuable, but it also taught an important product lesson: sometimes the best engineering move is to change the shape of the feature.

The eventual direction became much better:

  • use a visible in-app browser activity for AnkiWeb
  • let the user browse the real site
  • intercept .apkg downloads in-app
  • store the download privately
  • create an import draft
  • jump straight into the existing preview/import flow

That was a much more honest solution. It stopped fighting the site and started using the app’s own strengths: import, preview, and local persistence.

Making Voice Feel Like the Main Interface

The heart of the app is still the study session runtime.

A lot of the work here was not about adding more UI, but about making the voice loop feel coherent:

  • when prompts are spoken
  • when the app starts listening
  • how long the listening window should last
  • when partial recognition should be trusted
  • when to stop early on a strong answer
  • when to reveal the answer
  • how self-grading and automatic grading fit together

On Android, speech is never just “call the speech API and you’re done.” There are always edge cases:

  • permissions
  • recognizer flavor differences
  • partial results versus final results
  • cancellation timing
  • audio focus
  • device quirks

A lot of this project became an exercise in being honest about those constraints and designing around them instead of pretending they do not exist.

That honesty also showed up in the app’s session state model. The runtime is not a pile of callbacks. It is built around explicit states and events, which makes it much easier to reason about what the app thinks is happening at any given moment.

That structure paid off again and again as more features got layered in.

Answer Evaluation: From Exact Matching to Something Smarter

The earliest evaluator was mostly deterministic:

  • normalize text
  • compare against accepted answers
  • allow fuzzy matching where appropriate

That still works well for many cards. In fact, it is still the right answer for:

  • arithmetic
  • spelling
  • short identifiers
  • cases where a near miss should absolutely not pass

But as soon as the app started touching longer answers and more natural language, the limits became obvious. A strict string-oriented evaluator can be technically consistent while still feeling wrong to a human being.

That led to the semantic grading work.

The first step was not “let AI handle grading.” It was a more conservative plan:

  • keep deterministic matching first
  • add a semantic fallback only when lexical matching is not enough
  • use on-device embeddings rather than a cloud-first model

That design choice mattered. It kept the project grounded. Semantic grading was not supposed to replace the rest of the evaluator. It was supposed to rescue reasonable answers that were being unfairly rejected.

Semantic Grading Turned Out to Be Harder Than the Idea

The semantic work brought some of the most interesting engineering problems in the whole project.

The app now includes:

  • a semantic evaluator
  • an embedding cache
  • a decision policy with accept / unsure / reject bands
  • a bundled sentence-embedding model

But the path there was not smooth.

One of the first real blockers was that the original MediaPipe dependency being used for text embeddings was simply too old. On-device initialization was crashing natively on the target phone. The fix was not a clever code workaround. The real fix was dependency modernization. Once the library was upgraded to a current version, the embedder could initialize successfully.

That was a good reminder that “AI bugs” are often just normal software engineering bugs wearing a more dramatic outfit.

The second challenge was more subtle: just because semantic scoring works does not mean it should be trusted blindly.

This showed up especially clearly on a command-heavy CS50-style deck. Some answers that felt obviously related were accepted. Some answers that felt obviously wrong were also accepted. Other short command answers that a human would probably allow were rejected.

That forced a more nuanced policy:

  • semantic scoring is useful
  • but command-like and syntax-heavy answers need lexical anchors
  • shorthand answers like tail for tail <file> should still be allowed
  • vague phrases like not sure should never pass just because an embedding score looks high

That is exactly the kind of product problem that makes this sort of project interesting. The challenge is not just “can the model produce a number?” The challenge is whether the resulting behavior matches what a real learner would expect.

AI Mode and the Difference Between “Plumbing” and “Experience”

Another large branch of work explored a fuller AI mode using Gemini live audio and tool-calling ideas.

This part of the project went through multiple milestones:

  • plumbing mode flags through settings, navigation, and runtime state
  • adding a live client shell
  • integrating bidirectional audio
  • wiring tool calls into the existing reducer-driven session logic
  • adding fallback behavior when live transport fails

This was useful work, but it also created a good internal standard for honesty. It became important to distinguish between:

  • a feature being “wired through the app”
  • a feature being “technically alive”
  • a feature being “good enough to present honestly as a user-facing experience”

A lot of AI product work gets fuzzy on that distinction. This project benefited from repeatedly pulling those apart.

The result is a codebase that now has real AI-related infrastructure and experiments, but still treats deterministic study behavior as the stable center of the app.

That turned out to be the right posture.

A Better Product Through Better Constraints

One of the more surprising themes in the project was that constraints improved the product.

Examples:

  • trying to scrape AnkiWeb forced a rethink that led to a better in-app browser + import handoff
  • a crashing on-device semantic path forced a proper dependency upgrade instead of magical thinking
  • overly broad semantic grading on command decks forced a more human grading policy
  • navigation crashes around import preview forced a more correct SavedStateHandle setup

None of those were “fun” problems in the moment, but they each moved the project toward something sturdier and more coherent.

The app is better because it had to survive those collisions with reality.

What Exists Now

At this point, the project includes a meaningful amount of real functionality:

  • voice-first study sessions
  • spoken prompts and spoken answers
  • persistent review scheduling
  • settings and history
  • deck import from local files
  • .apkg import support
  • AnkiWeb browsing and direct import handoff
  • bundled starter decks
  • semantic grading infrastructure
  • on-device text embeddings for semantic evaluation
  • experimental AI/live-session infrastructure

There is also a growing body of product and platform planning around where the app could go next:

  • Gemini-assisted study features
  • stronger semantic grading policies
  • Wear OS companion support
  • car-aware or Android Auto-adjacent ideas

Not all of those are finished products, but they represent something important: the project is no longer just a pile of features. It has a direction.

What I Learned From Building It

The biggest lesson is that “voice-first study app” sounds smaller than it really is.

You are not just building:

  • a UI
  • a speech recognizer
  • a deck importer

You are building the glue between all of them, and the glue is where most of the actual engineering lives.

Another lesson is that good product behavior often comes from restraint, not ambition.

The best parts of this project are not the ones where the app tries to be magical. They are the parts where it:

  • stays deterministic when it should
  • uses ML as support rather than theater
  • preserves clear state boundaries
  • avoids pretending unstable integrations are already polished product experiences

That kind of discipline is not always flashy, but it is what makes a project feel trustworthy.

What Comes Next

The next stage of work is less about piling on new surfaces and more about sharpening the judgment of the app.

The biggest open question is not “can we add more AI?” It is:

how do we make the app accept the right answers, reject the wrong ones, and feel fair to the learner?

That likely means:

  • better semantic policies
  • deck-sensitive grading behavior
  • clearer settings around evaluation style
  • more real-world testing across different kinds of decks

There is still plenty of room to grow, but the project is now at an interesting point: it already does a lot, and the challenge is no longer proving that the idea can exist. The challenge is making it consistently good.

That is a much better problem to have.

ROS2 OSX brew formula

Featured

Getting ROS 2 Working on macOS, Then Packaging It for Homebrew

ROS 2 on macOS is one of those things that technically works, but often feels harder than it should. The official source-build path is real, but in practice it can turn into a long chain of dependency issues, middleware decisions, Python problems, Qt mismatches, and package combinations that work on one machine but not another.

I wanted a better answer than “it builds on my laptop.” The goal was to get ROS 2 running reliably on macOS, verify the tools people actually use in the beginner tutorials and early development workflows, and package the result so other developers could install it with Homebrew instead of rebuilding the whole stack from scratch.

That work is now complete. The result is a Homebrew-installable formula called ros2-kilted-core: a tested, curated ROS 2 Kilted environment for macOS.

The problem

ROS 2 is well supported on Linux. On macOS, the story is less polished.

The source-build path exists, but it is easy to end up in a state where the build partially succeeds, some tools launch, others fail, and the final setup is too fragile to recommend to anyone else. A successful compile is not the same thing as a usable development environment.

That was the real problem to solve: not just making ROS 2 build once, but making it practical.

That meant getting the core runtime working, verifying the tools used in the beginner tutorials, making sure the GUI tools actually launched, and confirming that the result could support real development instead of merely surviving a single build command.

What I built

This project started with a source checkout of ROS 2 Kilted on macOS and a curated build of the packages needed for a realistic developer workflow.

That included:

  • building the core ROS 2 runtime on macOS
  • standardizing on Fast DDS as the default supported middleware path
  • verifying demo talker/listener nodes
  • getting turtlesim working
  • validating the beginner CLI tools, including:
    • ros2 node
    • ros2 topic
    • ros2 service
    • ros2 action
    • ros2 param
    • ros2 interface
    • ros2 launch
    • ros2 doctor
    • ros2 bag
  • getting rqt_graph, rqt_console, and rqt_service_caller working on macOS
  • creating a separate tutorial workspace for beginner client-library examples
  • packaging the result into a Homebrew formula

The result is not a theoretical “this should probably work” setup. It is a tested ROS 2 environment for macOS, built from source and packaged for reuse.

The macOS-specific work

A large part of the effort was in solving the smaller platform-specific issues that tend to make ROS 2 on macOS feel unreliable.

That included:

  • choosing a package set broad enough to be useful but small enough to maintain realistically on macOS
  • narrowing the middleware path so runtime behavior stayed predictable
  • handling Python and Qt GUI dependencies cleanly
  • fixing a Qt5/Qt6 header clash affecting turtlesim
  • patching the rqt path so it used a working PyQt setup on macOS
  • dealing with vendor packages that would otherwise try to download sources during the build
  • bundling Python build and runtime tooling in a reproducible way
  • validating the final result outside the original development workspace

In other words, this was less about running one successful build command and more about taking a fragile source build and turning it into a repeatable installation.

What the Homebrew formula installs

The Homebrew formula is called ros2-kilted-core.

It installs a curated ROS 2 macOS build that includes:

  • the core ROS 2 runtime
  • Fast DDS as the supported default RMW path
  • the main ROS 2 CLI tools
  • turtlesim
  • rqt_graph
  • rqt_console
  • rqt_service_caller
  • ros2 bag
  • the rest of the validated beginner and developer toolchain

It is intentionally not a full “everything in ROS 2” desktop distribution. It is a curated macOS-focused build designed to be practical for tutorials and development.

The main benefit is that users do not need to manually clone the source workspace, run vcs import, assemble the Python build environment, or rediscover the same macOS-specific fixes. Homebrew downloads the packaged source bundle and builds from that.

Why I packaged this as a custom Homebrew tap

I packaged this as a custom Homebrew tap rather than submitting it to homebrew/core.

That was the right fit for a few reasons:

  • it is a curated ROS 2 distribution, not a tiny standalone utility
  • it is specifically tuned for macOS
  • it includes a practical set of development and tutorial tools
  • it is easier to maintain and iterate in a dedicated tap than in the main Homebrew formula collection

That means the package is installable through Homebrew, but maintained in its own GitHub repository.

Installation

The Homebrew tap is here:

nigeldaniels/homebrew-ros2-kilted

Install it with:

brew install nigeldaniels/ros2-kilted/ros2-kilted-core

The package uses ros2-kilted-prefixed commands instead of replacing the global ros2 command, which makes it safer to install alongside other ROS environments.

Why this matters

A lot of developers want to experiment with ROS 2 on macOS, work through the tutorials, or do real development without switching to Linux immediately. The source-build path exists, but it is still rough enough that many people give up before they get to the interesting part.

This project makes that path much more approachable.

Instead of “it should work if everything goes right,” the result is now:

  • a working ROS 2 source build on macOS
  • a verified set of beginner and development tools
  • a reusable Homebrew installation path for other developers

That makes ROS 2 on macOS far more practical than it was before.

Final thoughts

ROS 2 on macOS is still not the smoothest platform story in robotics, but it becomes much more usable once the setup is curated, tested, and packaged properly.

That was the point of this work: get ROS 2 working on macOS, make sure the important tooling actually runs, and package it so other developers can install it without repeating the same setup process by hand.

If this saves someone else from spending a weekend chasing build failures, Python issues, middleware confusion, and Qt breakage, then it was worth doing.