Thoughts on the Mistakes of the Social Web

The internet was social before the social web. That part gets forgotten. People talked on IRC, forums, mailing lists, Usenet, AIM, Discord-style rooms, and all kinds of weird little places. The difference was that those places had context. You were not just “an account.” You were a person in a room, with some history, some reputation, and some reason for being there.

If you linked to your own thing in one of those spaces, people usually did not lose their minds as long as it was relevant. The question was not, “Did you make this?” The question was, “Is this useful here?” That is a much healthier standard. Sometimes the best link in the conversation is your own link because you are the person who wrote the thing, built the thing, documented the thing, or found the thing.

The social web broke that in a strange way. It created a huge opening for credibility fraud. Suddenly everyone could perform expertise, manufacture popularity, juice engagement, buy followers, write in brand voice, growth-hack sincerity, and pretend to be a participant while actually acting like a little attention-extraction machine. The feed turned normal human sharing into a suspicious transaction.

So now we live in this dumb world where links are both the blood vessels of the web and somehow treated like contraband. A web with no links is barely the web. It is just a set of private malls with recommendation engines and security guards. “Do not link to your own stuff” sounds noble until you realize it mostly helps people who are already big enough that other people link to them automatically.

PageRank made sense in a world where someone else might find your weird little page and link to it from their weird little page. That was the old bargain: publish something good, and the graph of the web slowly discovers it. But a lot of that middle layer is gone or weakened. Personal sites, blogrolls, directories, small forums, and independent linking culture got paved over by platforms. Now the system still wants backlinks, but the places where people actually gather often punish the behavior needed to create them.

That is one of the great little ironies of the modern web. The machine wants evidence that the world cares, but the world has been trained to treat public linking as spam unless it comes from someone already blessed by the machine. Nice little closed loop there. Very elegant. Completely cursed.

There is an important exception here: I am much less hostile to things like Bluesky, Mastodon, ActivityPub, AT Protocol, and other systems that at least try to make the social layer protocol-shaped instead of purely platform-shaped. That matters. Federation is not magic fairy dust, and protocol people can still be annoying in the very special way protocol people are annoying, but the architecture is pointed in a better direction.

A federated or protocol-based social system is not the same animal as a giant closed platform casino. If identity, distribution, clients, moderation, and hosting can be separated, then users are not trapped in quite the same way. The conversation can move. The client can change. The server can change. Communities can set local norms. The graph is not just locked in some corporate basement next to the engagement-optimization goblin.

That does not make every federated system good. It does not mean Bsky or Mastodon or anything else automatically solves the human problems. People can bring status games, mobs, spam, and weird little dominance rituals anywhere. Give humans a protocol and we will eventually find a way to argue about the chairs. But protocol-based social is at least trying to preserve some of what made the internet good: links, portability, interoperability, local context, and the possibility that no single company gets to be the landlord of human conversation.

So the problem is not “people talking online.” That would be an insane take. The internet is one of the best machines humans ever made for finding each other. The problem is the platform-owned social web: permanent, indexed, engagement-maximized, reputation-scored, and monetized within an inch of its life.

I understand why everyone built social features. In the pre-AI era, if you wanted a big site, users were the cheapest way to get content. Users wrote the posts, uploaded the photos, made the comments, tagged the pages, reviewed the restaurants, liked the posts, ranked the content, argued with each other, moderated each other, and generated the graph. The whole thing rode on the backs of users because paying people to produce and organize all that stuff was expensive as hell.

But that bargain had a cost. The user did not just contribute to the product. The user became the product, the inventory, the moderation problem, the credibility signal, and eventually the unpaid little hamster powering the engagement wheel.

And then because everything was public, permanent, indexed, and monetized, normal social behavior got weird. A casual thought became content. A disagreement became a searchable artifact. A joke became evidence. A person became a profile. A community became a growth channel. Human interaction got shrink-wrapped, barcoded, and stacked on a pallet in the warehouse of the feed.

I do think the social web has value. I am not saying people should stop talking online. But social interaction is often temporary, contextual, and messy. The web is durable, searchable, and decontextualized. Those are not naturally the same thing.

That mismatch is where a lot of the damage came from. We took ephemeral human behavior and made it permanent infrastructure. We took conversations that should have lived in rooms and put them on billboards. Then we acted surprised when everyone got performative, defensive, spammy, paranoid, or insane.

Maybe the healthier split is simple: let the web be good at durable reference, and let social be good at human context. Links, pages, sources, guides, documents, indexes — those belong on the web. Jokes, arguments, half-formed thoughts, “you had to be there” moments, and random social chatter probably belong somewhere smaller, softer, more local, or at least more portable than the giant engagement platforms.

AI changes the economics here. It may now be possible to build useful information systems without forcing users to generate the entire content layer. Machines can parse public sources, organize messy information, summarize, classify, dedupe, and turn scattered material into something usable. Humans can review and steer instead of being mined for every post, like, comment, and scrap of attention.

That does not mean AI slop should replace human culture. Please, God, no. The last thing we need is the web turning into a haunted vending machine full of synthetic LinkedIn posts. But it does mean we may not need to make every useful site into a little social casino anymore.

The mistake of the social web was not that people talked to each other. Talking is good. The mistake was turning talk into permanent content, content into ranking fuel, ranking into status, status into credibility, and credibility into a fraud market.

The web should have links. People should be allowed to point at things. Making something useful and saying “here, I made this” should not automatically be treated like some moral failure. That is how the web breathes.

A link-hostile web is an anti-web. It is a graph afraid of its own edges.

But a protocol-shaped social web? A federated web? A web where people can talk without every conversation becoming feed chum for the same giant machines?

That might be worth saving.

LessVibes release

fuck the claw

lessvibes

lessvibes is an early JetBrains plugin project aimed at a pretty specific problem in the AI coding era: code can appear in your project faster than you can really read it.

The point of lessvibes is not to block AI tools or shame anyone for using them. The point is to make AI-assisted coding more visible and more hands-on. The plugin is meant to notice when a burst of code lands, track whether the affected files were actually opened, and help the developer step through a likely code path instead of just trusting the vibes and moving on.

In its current form, the project is focused on PyCharm first. The rough direction is:

  • track likely assisted or bulk-generated changes
  • show which files were touched
  • show which of those files were never opened
  • show which files were opened but never really edited
  • give the user a way to open a bounded left-to-right code-flow view

The project lives here:

https://github.com/nigeldaniels/lessvibes

Installing It In PyCharm

Right now, the simplest way to install lessvibes is from a locally built plugin zip.

  1. Clone the repo.
  2. From the project root, run:
./gradlew buildPlugin
  1. That produces a plugin archive at:
build/distributions/lessvibes-0.1.0.zip
  1. In PyCharm, open:
Settings / Preferences -> Plugins -> gear icon -> Install Plugin from Disk...
  1. Select the generated zip file.
  2. Restart the IDE if PyCharm asks.

After restart, the plugin should appear as the lessvibes tool window.

Installing It In Similar JetBrains IDEs

Because lessvibes is a JetBrains plugin, it may also load in similar IntelliJ-platform IDEs. That said, this project is being built with PyCharm in mind first, so anything outside PyCharm should be treated as experimental for now.

The install flow is basically the same:

  1. Build the plugin zip with ./gradlew buildPlugin
  2. Open the target JetBrains IDE
  3. Use Install Plugin from Disk...
  4. Pick the generated zip
  5. Restart the IDE

Important Warning

This project is still very much a work in progress.

It is mostly untested, the heuristics are still rough, and the code-flow logic is best-effort rather than guaranteed runtime truth. If you try it today, you should expect edges, gaps, and wrong guesses.

That said, the idea is real, the first plugin scaffold exists, and contributions are absolutely welcome.

If this problem sounds interesting to you, open an issue, send a PR, or just poke around the code here:

https://github.com/nigeldaniels/lessvibes

LessVibes: Because apparently I needed to vibe code a plugin to help me vibe code less

lessvibes: a plugin for making sure I do not become a fucking moron any faster than time already requires

I have been thinking a lot about AI coding tools lately.

Not in the fake moral-panic way. Not in the “real programmers type every character by hand in a dark room lit only by vim” way. I use the tools. The tools are useful. Anyone pretending otherwise is either lying or writing Java for fun.

The problem is not that AI coding tools are bad.

The problem is that they are too convenient in exactly the wrong place.

They make it dangerously easy to end up with code in your repo that you did not really read, did not really trace, and do not really understand. A giant slab of output appears, you skim it, rename two variables, maybe run the tests, and now congratulations: you are responsible for a system you technically approved but do not actually know.

This seems bad.

So I have been working on a plugin idea called lessvibes.

The basic goal is simple: I want something in the editor that pushes back, at least a little, against the smooth-brained workflow of sitting there like a fucking lemming waiting to click Accept.

Because let’s be honest, something has to.

If nothing pushes back on that loop, a lot of us are going to get dumber. Not overnight. Not in some dramatic sci-fi way. Just slowly, comfortably, one magical completion at a time, until our main skill is recognizing when the machine produced something that looks plausible.

That is not a great direction for software development, or for my own brain, it never got much exercise to begin with.

What lessvibes is supposed to do

The core idea behind lessvibes is that the IDE should notice when a coding session has turned into something passive.

Not “evil.” Not “fraudulent.” Just passive.

If a giant block of code lands in the editor faster than any human would normally type it, that should be treated as a moment worth paying attention to. Not because generated code is automatically wrong, but because that is the exact moment where understanding tends to quietly fall off a cliff.

So the plugin would try to estimate how a session is actually happening.

Not with fake certainty. Not with some creepy fantasy that it can prove who “really authored” every line. More like a transparent, best-effort read based on signals such as:

  • manual typing and editing
  • paste events and bulk insertions
  • accepted AI completions when the IDE exposes them
  • edits that arrive much faster than a human would normally type
  • file opens, tab focus, scrolling, dwell time, follow-up edits
  • whether the files that got changed were ever actually opened and manually touched afterward

The point is not to solve philosophy.

The point is to notice when the workflow has quietly become:

accept output, skim it badly, and emotionally hope for the best

That is a real workflow now. A lot of people are doing it. Some of them are doing it successfully, which honestly makes it even more dangerous.

The part I actually care about

The feature I like most is the code-flow view.

If the plugin decides a change was probably AI-generated, heavily pasted, or otherwise suspiciously vibey, it should open a bounded left-to-right execution path through the code.

Up to five panes.

The leftmost pane starts at the main entry point. Then each pane to the right shows the next relevant file in the flow. Not the whole dependency graph. Not one of those giant architecture diagrams that looks like a serial killer’s wall. Just the important path, bounded on purpose, so a normal human can actually follow it.

That boundedness matters.

A lot of developer tools confuse “more complete” with “more useful.” I do not want a giant spiderweb of boxes and arrows that makes me want to close my laptop and go lie on the floor. I want something that says:

hey idiot, it starts here, then goes here, then here, then here. maybe read these before you pretend you understand the feature

That is the product.

And if the project uses containers, the leftmost pane should be split horizontally. Top half: the main code entry point. Bottom half: the Dockerfile or Compose file. Because a surprising amount of confusion in modern software is not just “what does this code do?” It is “what the hell is the runtime context?” The app starts in one place, the environment is defined somewhere else, and both are easy to ignore when the code arrived in your editor in a burst of machine confidence 30 seconds ago.

So the plugin is not just about surfacing generated code. It is about surfacing execution and context together, in a way that is fast enough that I might actually use it instead of admiring it once and forgetting it exists.

The real workflow I want

What I want the plugin to encourage is an older and healthier loop:

open -> read -> edit

Not:

accept -> vibe -> move on

So lessvibes should care about things like:

  • which generated files were never opened
  • which generated files were opened but never manually edited
  • which inserted blocks later got revised by hand
  • whether the important files in a generated path ever saw real follow-up edits
  • whether I spent time reading and navigating, or just waved the code through customs

To me, those are more meaningful metrics than some bullshit chart about lines added.

I do not care how many lines “I wrote” if a glorified autocomplete snowblower dumped them into my repo and I never looked back. The meaningful question is whether I actually went into the code after it appeared.

That is what matters.

Not purity. Not virtue. Not pretending I am above the tools. Just evidence that my brain stayed in the loop.

What this is not

This is not an anti-AI project.

It is also not a surveillance tool, and if it turns into one the whole thing should be thrown in the trash.

It should not block AI. It should not shame people for using AI. It should not send sensitive code off-machine by default. It should not pretend it always knows exactly what happened. And it definitely should not become one of those dead-eyed enterprise dashboards where some manager decides Steve was only 63% hands-on this sprint and therefore needs a meeting.

Absolutely not.

If this idea works at all, it only works if users trust it.

So the defaults should be:

  • local-first
  • clear about what signals are being tracked
  • no hidden scoring
  • opt-in nudges
  • exportable or resettable session history

Basically: helpful, not narc shit.

Why I think this matters

AI coding tools are very good at helping code appear.

They are much worse at helping you metabolize what just happened.

That is the gap.

Right now the ecosystem is mostly optimized around speed, convenience, and the pleasant little dopamine hit of watching the diff get bigger. But “the diff got bigger” and “I understand the system better” are very much not the same thing. In fact, they may now be moving in opposite directions.

And I do not think the answer is fake purity where everyone pretends to reject the tools and return to some mythological golden age of hand-crafted software. The tools are here. They are useful. I am going to use them. Most people are.

What I want is some counter-pressure.

Something that makes it a little harder to drift into a state where I am technically present for the coding session but functionally just acting as a biological approval button.

Because that is the real failure mode here.

Not evil AI. Not the end of programming. Just a slow humiliating slide into becoming the guy who watches the machine cook and occasionally says, “yeah that looks right.”

I would rather avoid that outcome if possible.

So yeah

That is lessvibes.

A plugin for making sure I do not become a fucking moron any faster than time already requires.

If AI is going to stay in the editor, then the editor should do more than help me accept code. It should help me understand what just landed, where it flows, what I have actually looked at, and whether I touched the important parts with my own brain before I ship some haunted Kotlin side project into the world.

That seems like a worthwhile improvement.

Merit, But Make It Legible

Featured

Merit, But Make It Legible

One of the more irritating features of modern life is that people love to say they reward merit when what they often reward is legibility.

Not raw capability.
Not force of will.
Not how much resistance someone had to push through to become good at something.

Legibility.

Did the achievement arrive in packaging the system already knows how to admire? Did it come with a famous school, recognizable institutions, polished references, family support, clean internships, the right tone, the right posture, the right little trail of approved breadcrumbs? If so, people relax. They call it excellence.

Meanwhile, if someone arrives at similar visible competence through a messier path — sparse resources, little formal support, public materials, self-direction, no safety net, and almost no room for error — the response is often weirdly diminished.

That person becomes scrappy.
Surprisingly strong.
Promising.
Impressive, considering.

Considering what, exactly?

What is being “considered” is usually the absence of prestige decoration. The person may have built nearly the same capability, or in some cases more durable capability, but because they did not emerge from a trusted institutional pipeline, people treat the result as somehow less real. Or more provisional. Or faintly suspicious. They get credit, but in the off-brand, slightly patronizing way society reserves for people who succeeded without first being pre-approved.

This is backwards in an important sense.

The person who had elite schooling, money, family support, institutional legitimacy, and low-friction access to opportunity may in fact be highly capable. None of this automatically disqualifies them. Plenty of advantaged people are genuinely excellent.

But there is still a difference between demonstrating excellence under supportive conditions and constructing yourself under weak ones.

The bootstrap path often demands a set of traits that institutions claim to admire but are not especially good at recognizing in the wild:

  • initiative
  • independence
  • persistence
  • improvisation
  • the ability to learn without structure
  • the ability to continue without validation
  • the ability to recover from mistakes that were actually costly

Those are not decorative virtues. Those are core builder traits.

And yet, because they do not come pre-certified by prestige systems, they are routinely under-read. Not merely under-resourced at the start — under-credited even after the fact.

That distinction matters.

Being under-resourced means you lacked inputs.
Being under-credited means the world misreads what you produced.

Those are different problems.

The first makes the climb harder.
The second makes the summit look smaller than it is.

A lot of evaluators will insist this is not bias, just pragmatism. They will say elite labels are useful proxies. And to be fair, they are. Institutions act as compression algorithms. They save busy people the trouble of asking inconvenient questions like:

  • How hard was this path, actually?
  • How much support was quietly embedded in the background?
  • How much independent force did this person have to generate on their own?
  • How many hidden cushions were mistaken for personal greatness?

These are not questions most systems are built to ask, because they are expensive to answer and mildly destabilizing to the mythology. It is much easier to see Harvard, billionaire parents, polished confidence, and familiar signals, then conclude: obviously exceptional.

Clean. Efficient. Safe.

It is much less comfortable to look at someone who assembled themselves from public materials, intermittent guidance, and sheer stubbornness, then admit that what you are seeing may represent a more violent act of self-construction.

The elite profile is often treated as natural greatness.
The bootstrap profile is often treated as an anomaly.

But anomalies are sometimes just reality showing through the branding.

This does not mean the bootstrap person is always better. That would just be reverse snobbery with better PR. The point is narrower and more important: achievement is frequently judged by how frictionless it looks, not by how much force was required to make it happen.

And force matters.

Especially in domains where the environment is unstable, where there is no syllabus, where support is partial, where nobody is coming to organize your progress for you. In those situations, the ability to move without structure, learn without permission, and continue without applause is not some charming side trait. It is often the thing itself.

That person may not sound as polished.
They may not tell the story as elegantly.
They may not have the right names on the résumé.
They may not know how to perform legitimacy in the dialect gatekeepers prefer.

But sometimes they built more real capability with less help and less slack.

And the world, being the world, often reads that as scrappy instead of formidable.

Which is convenient, because formidable would force people to rethink what they are actually rewarding.

Building VoiceAnki, Part II

Featured

Real Decks, Bad Formatting, and the Small Matter of Talking to Your Phone

Last time I wrote about VoiceAnki as the project that started as “what if Anki had a mouth and some manners” and then kept escalating.

This post is the sequel where the app met real decks, real speech errors, and the ancient software engineering tradition of discovering that your clean architecture was, in fact, a suggestion.

The short version:

  • the speech loop got less gullible
  • the grader got more structural
  • the logs stopped being decorative
  • I built a local robot to do smoke tests because my own voice was starting to file HR complaints
  • and we are now close enough to the edge of deterministic grading that the next layer is visible, but still carefully fenced off

This is not an “AI solves education” post.

It is a post about building a voice-first Android study app that has to survive:

  • imported decks with formatting from the cursed earth
  • speech recognition that is usually helpful and occasionally drunk
  • grading policy that has to be fast, fair, and local
  • users who absolutely do not care that the regex looked elegant in your notebook

Demo Decks Lie

There is a phase every voice app gets to enjoy where the demo looks great.

You ask a clean question. You answer with a clean sentence. The recognizer hands you a clean transcript. The evaluator gives you a clean pass. Everyone nods like this was a serious plan all along.

Then you point the app at real material.

That is when you meet answers like:

  • 1. foo2. bar
  • Successful: ... Unsuccessful: ...
  • Pros: ... Cons: ...
  • 1877-78
  • Gen. Milyutin
  • one huge paragraph that starts with the useful bit and then wanders into side quests

Imported decks are not malicious. They are just old, messy, human, and full of local conventions. In other words: exactly the kind of input software tends to hate.

The first big lesson of this branch was that the grader needed to stop pretending every card was basically the same problem. A short person-name fact is not the same thing as a date range. A date range is not the same thing as a compact list. A compact list is not the same thing as a long explanatory answer that a human will naturally summarize instead of reciting bullet-by-bullet like a haunted audiobook.

That sounds obvious now. It was less obvious when the system was still getting away with a lot of fuzzy matching and a relatively small pile of hand-reviewed examples.

Card Shape Beats Raw String Length

The biggest architectural shift in this branch is simple to say and annoyingly non-trivial to implement:

grade by answer shape, not just by answer text

That means the evaluator now spends more effort upfront figuring out what sort of thing it is looking at:

  • short factual answer
  • person name
  • short numeric answer
  • definition
  • compact list
  • explanatory multi-point answer
  • command-like or control-like utterance

Once you have that, the rest of the pipeline gets saner. You stop asking one grading rule to play twelve different sports at once.

We are still keeping the main grading path deterministic and fast. That is not nostalgia; it is product design. If a spoken flashcard app feels like it pauses to hold a committee meeting before deciding whether 1877 to 1878 means 1877-78, the illusion is gone.

The user experience needs to feel immediate.

That means the hot path still has to be cheap:

  • classify once
  • prepare candidate structure once
  • compare against compact evidence
  • decide

If later we add something smarter for borderline cases, it has to sit behind that path, not inside it.

Structure Beats Vibes

One of the most useful additions here is a new structured-answer parser. I am not going to dump the entire evaluator recipe into a public post, because some of that is still moving and some of it is the kind of thing you learn by burning hours in log review. But the broad move is worth talking about.

Instead of treating every stored answer as one opaque blob, VoiceAnki now tries to recognize when the answer is actually a structure:

  • a compact list
  • a numbered list
  • a labeled list
  • a longer explanatory list

That sounds modest. It is not modest. It changes the whole feel of grading.

Here is a trimmed version of the parser entry point:

fun parse(answerText: String): StructuredAnswerParse {
    val decoded = decodeAnswerText(answerText)
    val numberedItems = extractNumberedItems(decoded)
    if (numberedItems.size >= 2) {
        val items = numberedItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    val labeledItems = extractLabeledItems(decoded)
    if (labeledItems.size >= 2) {
        val items = labeledItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    return StructuredAnswerParse()
}

That is not magic. It is just the system finally admitting that:

  • formatting matters
  • import damage matters
  • labels matter
  • and if the stored answer is really a list, we should stop grading it like a paragraph that fell down the stairs

Another small but satisfying detail is handling glued list markers. This is the kind of bug that sounds fake until you meet it in the wild:

val source = answerText
    .replace('\n', ' ')
    .replace(Regex("(?<=[a-zA-Z])(?=[1-9][.)-])"), " ")
    .replace("\\s+".toRegex(), " ")
    .trim()

That one line exists because decks really do contain things like foo2. bar, and if you do not split that boundary correctly, you end up evaluating nonsense against nonsense and calling it rigor.

The public version of the lesson is:

real grading quality is often won or lost before you ever compare a transcript to anything

If candidate preparation is bad, downstream scoring does not matter much. You are just being wrong with more confidence.

Speech Software Is Mostly About Timing

There is another lie voice products tell when they are young: that speech recognition quality is the main problem.

It is a problem. It is not the only problem. A lot of the actual work is timing, turn-taking, partials, retries, and deciding when not to believe the recognizer’s last word on what just happened.

This branch did a bunch of work in the speech loop itself:

  • carrying multiple alternatives deeper into grading
  • preserving useful partials
  • separating answer listening from control language
  • treating very short answers differently from long ones
  • quietly retrying some short numeric misses instead of immediately punting to the UI

One of the safer excerpts here is the fallback path for partials:

private fun partialFallbackResult(
    error: Int,
    speechStarted: Boolean,
    strongPartialPhrases: List,
    partialPhrases: List,
): RecognitionResult.Transcript? {
    if (!speechStarted) {
        return null
    }

    val fallbackPhrases = mergePhrases(
        primary = strongPartialPhrases,
        secondary = partialPhrases,
    )

    if (fallbackPhrases.isEmpty()) {
        return null
    }

    return when (error) {
        SpeechRecognizer.ERROR_NO_MATCH -> RecognitionResult.Transcript(fallbackPhrases)
        else -> null
    }
}

This is one of those changes that sounds small until you look at user experience.

If the user said something real, the recognizer heard enough to produce useful partials, and the final result still collapsed into ERROR_NO_MATCH, the product should not act like the person never spoke. That is the kind of behavior that makes users think the app is being smug on purpose.

Arithmetic cards were especially good at exposing this. If the app cannot survive one-word answers like five, it does not matter how clever your long-answer scoring is. Nobody is impressed. They are just annoyed.

So a lot of recent work has been about making the short-answer path feel less brittle without turning the whole system into a thicket of deck-specific hacks.

Fast Matters More Than Fancy

One thing I want to be explicit about: there is a lot of temptation in this space to keep throwing more intelligence at grading until it feels “smart.”

That is not automatically a win.

For VoiceAnki, grading speed is part of the product. The user just spoke. The app needs to respond like it was listening, not like it has submitted a ticket.

That constraint shapes the whole design:

  • keep the deterministic path local
  • keep candidate preparation reusable
  • keep transcript-time scoring bounded
  • do not add a visible “thinking…” pause to the normal loop

There is secret sauce in the exact rubric and decision policy, and I am not going to dump that out here line-by-line. But the public-facing principle is straightforward:

the fast path has to stay boring

If the user notices grading latency, they stop trusting the rhythm of the interaction.

And voice UX is rhythm.

Logs Graduated From Debug Tool to Product Infrastructure

I used to think of logs as something you improve once the interesting engineering is done.

That was cute.

On a speech app, logs are part of the interesting engineering.

A bad miss can come from:

  • speech recognition
  • transcript selection
  • answer-shape classification
  • lexical comparison
  • summary-vs-list policy
  • command/control routing
  • deck formatting

That means “it got this wrong” is not one bug category. It is a small crime scene.

So this branch put a lot more effort into making the logs answer questions like:

  • what transcript did we actually choose?
  • what kind of answer did we think this card wanted?
  • what decision path fired?
  • what evidence made the evaluator accept or reject?

That turns review from:

  • “huh, weird”

into:

  • “the answer was parsed as a structured list, but the wrong branch still ran”
  • “the recognizer had a good partial and then dropped the final”
  • “the card was really a summary-shaped answer, but the evaluator treated it like a raw string match”

That is a much more productive kind of pain.

We Built a Tiny Robot Because Manual Smoke Testing Is a Scam

One of my favorite additions around this branch is a local Pipecat smoke-test agent.

This is not some grand autonomous tutoring system. It is a very specific little goblin.

Its job is:

  1. listen to VoiceAnki through the laptop mic
  2. wait for the phone to stop talking
  3. answer through the laptop speakers
  4. keep doing that long enough to flush out session-loop bugs

That sounds silly. It is also incredibly useful.

The helper has a VoiceAnki-specific prompt, local audio transport, transcript logging, and a blunt little repeat-limit rule so it does not get stuck asking for the question forever:

repeat_limit_rule = f"""
Temporary smoke-test rule:

- Track how many times you have said exactly "can you repeat the question" for the current card.
- If you have already asked {max_repeat_requests} times for the same card, do not ask again.
- Instead, say exactly: I don't know
- Use that forced failure to let VoiceAnki mark the card wrong and move to the next question.
""".strip()

That rule exists because, left to their own devices, voice systems will absolutely form little conversational sinkholes and sit there repeating themselves like two Roombas politely arguing in a closet.

I also finally wrote proper smoke-run capture scripts so the whole thing can run unattended and leave behind artifacts we can review later:

ANDROID_CAPTURE_PID="$(spawn_detached "$ROOT_DIR" "$ANDROID_LOG" \
  "$ADB_BIN" -s "$ADB_SERIAL" logcat -v time \
  VoiceAnkiSpeech:D VoiceAnkiEval:D VoiceAnkiSemantic:D AndroidRuntime:E '*:S')"

PIPECAT_CAPTURE_PID="$(spawn_detached "$PIPECAT_DIR" "$PIPECAT_LOG" \
  "$PIPECAT_PYTHON" agent.py --input-device "$PIPECAT_INPUT_DEVICE" \
  --output-device "$PIPECAT_OUTPUT_DEVICE")"

That gives each run:

  • filtered Android logs
  • Pipecat logs
  • run metadata
  • timestamped folders for later review

It turns out this matters a lot, because manual voice testing is expensive in a very dumb way. You can lose an hour just being the person who says Roosevelt into a phone over and over while watching adb logcat scroll by like the Matrix, except less profitable.

Once a little robot can do even part of that for you, bugs start showing up in clusters instead of as rumors.

The Branch Is About More Than Just One Deck

A lot of the pressure for these changes came from history decks, because history decks are very good at producing:

  • long answers
  • compressed spoken summaries
  • date ranges
  • names with ASR drift
  • multi-point answer blobs

But the goal is not “optimize for history.”

That would be a trap.

The real target is broader:

  • explanatory cards where users summarize instead of reciting
  • imported decks with broken structure
  • voice-native equivalence for dates and names
  • command/control phrases coexisting with answer content
  • better handling for cards where exact string equality is just the wrong abstraction

If the implementation only works because the source material happens to be one subject area, that is not a system. That is a souvenir.

What Landed, and What Is Still Moving

A fair amount of the branch is already real:

  • more answer-shape-aware evaluation
  • stronger short-answer handling
  • better transcript preservation
  • richer evaluator logs
  • local Pipecat smoke testing
  • unattended log capture for long runs

There is also important work underway, some of it not committed yet:

  • broader under-acceptance reduction for explanatory multi-point cards
  • cleaner parsing of ugly imported answer text
  • more voice-native normalization for dates and names
  • more explicit decision-source logging
  • more regression tests built from real reviewed misses, not just happy-path examples

That uncommitted work matters because this branch has been one of those very honest engineering branches where the review notes, the smoke-test notes, and the code all inform each other in tight loops.

Or, put less politely: the app keeps finding new ways to be wrong, and I keep taking notes.

That is good. It means the system is meeting reality.

Deterministic Grading Is Better Now, but It Is Not the Final Boss

This is the part where I want to be careful not to oversell the current system.

The deterministic grader is better than it was:

  • more structural
  • less naive
  • more debuggable
  • less likely to reject obviously good answers for ridiculous reasons

That is real progress.

But there is also a limit to how far you want to push deterministic grading before the whole thing turns into an overfitted museum of exceptions and folklore.

That does not mean the deterministic work was wasted.

It means it was the right layer to improve first:

  • command routing
  • control handling
  • structured parsing
  • short-answer resilience
  • person-name behavior
  • list-vs-summary handling
  • observability

Those are foundational. A later model-backed layer should inherit them, not bulldoze them.

That is why the on-device inference work I have been sketching is intentionally narrow and conservative. The likely next step is not “let a model grade everything.” It is closer to:

  • keep the cheap path cheap
  • keep the main loop immediate
  • use on-device adjudication only for a narrow band of borderline long-answer cases
  • keep abstention first-class
  • make it optional and Android-native

In other words: add one careful new tool, not a second religion.

The Main Lesson So Far

The main lesson from this phase of VoiceAnki is that speech products punish fake abstraction almost immediately.

If your system is too generic, it feels unfair. If it is too clever, it becomes slow. If it is too rigid, users hate it. If it is too permissive, grading stops meaning anything.

The job is to keep finding the narrow path where the app feels:

  • fast
  • fair
  • understandable
  • and boring in the best possible way

Not “maximally AI.” Not “academically pure.” Not “one more heroic regex.”

Just a study loop that feels natural enough that the user forgets how much machinery is underneath it.

And if, along the way, we end up with a better parser, a less gullible speech loop, a tiny local smoke-test goblin, and a cautious roadmap for on-device adjudication, that seems like a pretty decent trade.

VoiceAnki

Building VoiceAnki: A Voice-First Study App That Kept Growing

What This Project Is

VoiceAnki started as a pretty simple idea: what if flashcard review felt more like a conversation and less like tapping through tiny buttons?

The core goal was to make studying possible in a more hands-free, audio-first way. Instead of treating voice as a gimmick layered on top of a normal flashcard app, the project pushed toward something more opinionated:

  • speak the prompt
  • listen for the answer
  • evaluate the response
  • keep the review loop moving without constant screen interaction

Over time, that turned into a much larger app than the original idea suggested. What exists now is not just a voice button on a flashcard screen. It is a full Android app with a session runtime, deck import pipeline, history, settings, AnkiWeb integration, and an increasingly serious answer-evaluation system.

This post is a look back at the work that went into it, what changed along the way, and what turned out to be harder than expected.

The Starting Point

At the beginning, the product shape was intentionally narrow:

  • Android only
  • local deck storage
  • spoken prompts
  • spoken answers
  • deterministic grading
  • lightweight study history

That focus mattered. It kept the project from immediately collapsing into a vague “AI tutor” idea. The first real work was not around machine learning at all. It was around building a dependable study loop:

  • a card queue
  • review scheduling
  • a reducer-driven session state machine
  • text-to-speech
  • Android speech recognition
  • foreground session behavior so the app could survive longer interactions

That part of the app is still the backbone of everything else. Even the newer AI and semantic work only makes sense because there is already a deterministic study engine underneath it.

Turning It Into a Real App

Once the core loop existed, the app started growing in the more familiar directions any real product eventually has to grow.

The project gained:

  • a home screen that lists decks
  • deck detail views
  • a settings screen for answer mode, speech rate, listening window, and grading behavior
  • session history
  • a persistent Room-backed database
  • DataStore-backed settings

That was the moment it stopped feeling like a prototype and started feeling like an app with real internal structure.

One theme that kept coming up was that nearly every “simple” feature touched more systems than expected. A new setting was never just a toggle. It usually had to travel through:

  • settings storage
  • view models
  • UI state
  • runtime configuration
  • sometimes the session reducer itself

That kind of wiring is not glamorous, but it is what makes later experimentation possible without the whole app turning into spaghetti.

Importing Decks Instead of Pretending

One of the biggest shifts in the project was deciding that the app should not live forever on a demo deck.

That meant building a real import path.

There are two different import stories in the app now:

  1. importing from files
  2. importing from AnkiWeb

The file import work led to a full import pipeline:

  • parse a deck file
  • turn it into an internal draft
  • preview the import
  • commit it into the local database

That draft step turned out to be especially useful. It created a clean boundary between “we successfully fetched or parsed something” and “we are ready to persist it as a real deck.” That became important later when the app started pulling content from the web rather than only from local files.

The .apkg path was also a turning point. Anki package import sounds straightforward until you actually have to do it on-device:

  • unzip the package
  • extract and read the SQLite content
  • resolve media references
  • map notes, cards, models, and templates into something your own app understands

That is the kind of work that is easy to underestimate from a distance. It is not especially flashy, but it is exactly the sort of feature that makes an app useful in the real world.

AnkiWeb: From Scraping to a Better Product Decision

AnkiWeb support was one of the most iterative parts of the project.

The first instinct was what many apps would try first: scrape the shared-deck pages and build a native search/detail flow on top of that. That approach looked promising at first, but it ran straight into the reality of the modern web:

  • JavaScript-heavy pages
  • Cloudflare-style challenge behavior
  • markup that is not stable enough to treat as a public API

The project went through several rounds of trying to make that scraper path more resilient, including:

  • improving network setup and headers
  • hardening HTML parsing
  • using a WebView to render pages instead of assuming static HTML

That work was valuable, but it also taught an important product lesson: sometimes the best engineering move is to change the shape of the feature.

The eventual direction became much better:

  • use a visible in-app browser activity for AnkiWeb
  • let the user browse the real site
  • intercept .apkg downloads in-app
  • store the download privately
  • create an import draft
  • jump straight into the existing preview/import flow

That was a much more honest solution. It stopped fighting the site and started using the app’s own strengths: import, preview, and local persistence.

Making Voice Feel Like the Main Interface

The heart of the app is still the study session runtime.

A lot of the work here was not about adding more UI, but about making the voice loop feel coherent:

  • when prompts are spoken
  • when the app starts listening
  • how long the listening window should last
  • when partial recognition should be trusted
  • when to stop early on a strong answer
  • when to reveal the answer
  • how self-grading and automatic grading fit together

On Android, speech is never just “call the speech API and you’re done.” There are always edge cases:

  • permissions
  • recognizer flavor differences
  • partial results versus final results
  • cancellation timing
  • audio focus
  • device quirks

A lot of this project became an exercise in being honest about those constraints and designing around them instead of pretending they do not exist.

That honesty also showed up in the app’s session state model. The runtime is not a pile of callbacks. It is built around explicit states and events, which makes it much easier to reason about what the app thinks is happening at any given moment.

That structure paid off again and again as more features got layered in.

Answer Evaluation: From Exact Matching to Something Smarter

The earliest evaluator was mostly deterministic:

  • normalize text
  • compare against accepted answers
  • allow fuzzy matching where appropriate

That still works well for many cards. In fact, it is still the right answer for:

  • arithmetic
  • spelling
  • short identifiers
  • cases where a near miss should absolutely not pass

But as soon as the app started touching longer answers and more natural language, the limits became obvious. A strict string-oriented evaluator can be technically consistent while still feeling wrong to a human being.

That led to the semantic grading work.

The first step was not “let AI handle grading.” It was a more conservative plan:

  • keep deterministic matching first
  • add a semantic fallback only when lexical matching is not enough
  • use on-device embeddings rather than a cloud-first model

That design choice mattered. It kept the project grounded. Semantic grading was not supposed to replace the rest of the evaluator. It was supposed to rescue reasonable answers that were being unfairly rejected.

Semantic Grading Turned Out to Be Harder Than the Idea

The semantic work brought some of the most interesting engineering problems in the whole project.

The app now includes:

  • a semantic evaluator
  • an embedding cache
  • a decision policy with accept / unsure / reject bands
  • a bundled sentence-embedding model

But the path there was not smooth.

One of the first real blockers was that the original MediaPipe dependency being used for text embeddings was simply too old. On-device initialization was crashing natively on the target phone. The fix was not a clever code workaround. The real fix was dependency modernization. Once the library was upgraded to a current version, the embedder could initialize successfully.

That was a good reminder that “AI bugs” are often just normal software engineering bugs wearing a more dramatic outfit.

The second challenge was more subtle: just because semantic scoring works does not mean it should be trusted blindly.

This showed up especially clearly on a command-heavy CS50-style deck. Some answers that felt obviously related were accepted. Some answers that felt obviously wrong were also accepted. Other short command answers that a human would probably allow were rejected.

That forced a more nuanced policy:

  • semantic scoring is useful
  • but command-like and syntax-heavy answers need lexical anchors
  • shorthand answers like tail for tail should still be allowed
  • vague phrases like not sure should never pass just because an embedding score looks high

That is exactly the kind of product problem that makes this sort of project interesting. The challenge is not just “can the model produce a number?” The challenge is whether the resulting behavior matches what a real learner would expect.

AI Mode and the Difference Between “Plumbing” and “Experience”

Another large branch of work explored a fuller AI mode using Gemini live audio and tool-calling ideas.

This part of the project went through multiple milestones:

  • plumbing mode flags through settings, navigation, and runtime state
  • adding a live client shell
  • integrating bidirectional audio
  • wiring tool calls into the existing reducer-driven session logic
  • adding fallback behavior when live transport fails

This was useful work, but it also created a good internal standard for honesty. It became important to distinguish between:

  • a feature being “wired through the app”
  • a feature being “technically alive”
  • a feature being “good enough to present honestly as a user-facing experience”

A lot of AI product work gets fuzzy on that distinction. This project benefited from repeatedly pulling those apart.

The result is a codebase that now has real AI-related infrastructure and experiments, but still treats deterministic study behavior as the stable center of the app.

That turned out to be the right posture.

A Better Product Through Better Constraints

One of the more surprising themes in the project was that constraints improved the product.

Examples:

  • trying to scrape AnkiWeb forced a rethink that led to a better in-app browser + import handoff
  • a crashing on-device semantic path forced a proper dependency upgrade instead of magical thinking
  • overly broad semantic grading on command decks forced a more human grading policy
  • navigation crashes around import preview forced a more correct SavedStateHandle setup

None of those were “fun” problems in the moment, but they each moved the project toward something sturdier and more coherent.

The app is better because it had to survive those collisions with reality.

What Exists Now

At this point, the project includes a meaningful amount of real functionality:

  • voice-first study sessions
  • spoken prompts and spoken answers
  • persistent review scheduling
  • settings and history
  • deck import from local files
  • .apkg import support
  • AnkiWeb browsing and direct import handoff
  • bundled starter decks
  • semantic grading infrastructure
  • on-device text embeddings for semantic evaluation
  • experimental AI/live-session infrastructure

There is also a growing body of product and platform planning around where the app could go next:

  • Gemini-assisted study features
  • stronger semantic grading policies
  • Wear OS companion support
  • car-aware or Android Auto-adjacent ideas

Not all of those are finished products, but they represent something important: the project is no longer just a pile of features. It has a direction.

What I Learned From Building It

The biggest lesson is that “voice-first study app” sounds smaller than it really is.

You are not just building:

  • a UI
  • a speech recognizer
  • a deck importer

You are building the glue between all of them, and the glue is where most of the actual engineering lives.

Another lesson is that good product behavior often comes from restraint, not ambition.

The best parts of this project are not the ones where the app tries to be magical. They are the parts where it:

  • stays deterministic when it should
  • uses ML as support rather than theater
  • preserves clear state boundaries
  • avoids pretending unstable integrations are already polished product experiences

That kind of discipline is not always flashy, but it is what makes a project feel trustworthy.

What Comes Next

The next stage of work is less about piling on new surfaces and more about sharpening the judgment of the app.

The biggest open question is not “can we add more AI?” It is:

how do we make the app accept the right answers, reject the wrong ones, and feel fair to the learner?

That likely means:

  • better semantic policies
  • deck-sensitive grading behavior
  • clearer settings around evaluation style
  • more real-world testing across different kinds of decks

There is still plenty of room to grow, but the project is now at an interesting point: it already does a lot, and the challenge is no longer proving that the idea can exist. The challenge is making it consistently good.

That is a much better problem to have.