Building VoiceAnki: A Voice-First Study App That Kept Growing
What This Project Is
VoiceAnki started as a pretty simple idea: what if flashcard review felt more like a conversation and less like tapping through tiny buttons?
The core goal was to make studying possible in a more hands-free, audio-first way. Instead of treating voice as a gimmick layered on top of a normal flashcard app, the project pushed toward something more opinionated:
- speak the prompt
- listen for the answer
- evaluate the response
- keep the review loop moving without constant screen interaction
Over time, that turned into a much larger app than the original idea suggested. What exists now is not just a voice button on a flashcard screen. It is a full Android app with a session runtime, deck import pipeline, history, settings, AnkiWeb integration, and an increasingly serious answer-evaluation system.
This post is a look back at the work that went into it, what changed along the way, and what turned out to be harder than expected.
The Starting Point
At the beginning, the product shape was intentionally narrow:
- Android only
- local deck storage
- spoken prompts
- spoken answers
- deterministic grading
- lightweight study history
That focus mattered. It kept the project from immediately collapsing into a vague “AI tutor” idea. The first real work was not around machine learning at all. It was around building a dependable study loop:
- a card queue
- review scheduling
- a reducer-driven session state machine
- text-to-speech
- Android speech recognition
- foreground session behavior so the app could survive longer interactions
That part of the app is still the backbone of everything else. Even the newer AI and semantic work only makes sense because there is already a deterministic study engine underneath it.
Turning It Into a Real App
Once the core loop existed, the app started growing in the more familiar directions any real product eventually has to grow.
The project gained:
- a home screen that lists decks
- deck detail views
- a settings screen for answer mode, speech rate, listening window, and grading behavior
- session history
- a persistent Room-backed database
- DataStore-backed settings
That was the moment it stopped feeling like a prototype and started feeling like an app with real internal structure.
One theme that kept coming up was that nearly every “simple” feature touched more systems than expected. A new setting was never just a toggle. It usually had to travel through:
- settings storage
- view models
- UI state
- runtime configuration
- sometimes the session reducer itself
That kind of wiring is not glamorous, but it is what makes later experimentation possible without the whole app turning into spaghetti.
Importing Decks Instead of Pretending
One of the biggest shifts in the project was deciding that the app should not live forever on a demo deck.
That meant building a real import path.
There are two different import stories in the app now:
- importing from files
- importing from AnkiWeb
The file import work led to a full import pipeline:
- parse a deck file
- turn it into an internal draft
- preview the import
- commit it into the local database
That draft step turned out to be especially useful. It created a clean boundary between “we successfully fetched or parsed something” and “we are ready to persist it as a real deck.” That became important later when the app started pulling content from the web rather than only from local files.
The .apkg path was also a turning point. Anki package import sounds straightforward until you actually have to do it on-device:
- unzip the package
- extract and read the SQLite content
- resolve media references
- map notes, cards, models, and templates into something your own app understands
That is the kind of work that is easy to underestimate from a distance. It is not especially flashy, but it is exactly the sort of feature that makes an app useful in the real world.
AnkiWeb: From Scraping to a Better Product Decision
AnkiWeb support was one of the most iterative parts of the project.
The first instinct was what many apps would try first: scrape the shared-deck pages and build a native search/detail flow on top of that. That approach looked promising at first, but it ran straight into the reality of the modern web:
- JavaScript-heavy pages
- Cloudflare-style challenge behavior
- markup that is not stable enough to treat as a public API
The project went through several rounds of trying to make that scraper path more resilient, including:
- improving network setup and headers
- hardening HTML parsing
- using a WebView to render pages instead of assuming static HTML
That work was valuable, but it also taught an important product lesson: sometimes the best engineering move is to change the shape of the feature.
The eventual direction became much better:
- use a visible in-app browser activity for AnkiWeb
- let the user browse the real site
- intercept
.apkgdownloads in-app - store the download privately
- create an import draft
- jump straight into the existing preview/import flow
That was a much more honest solution. It stopped fighting the site and started using the app’s own strengths: import, preview, and local persistence.
Making Voice Feel Like the Main Interface
The heart of the app is still the study session runtime.
A lot of the work here was not about adding more UI, but about making the voice loop feel coherent:
- when prompts are spoken
- when the app starts listening
- how long the listening window should last
- when partial recognition should be trusted
- when to stop early on a strong answer
- when to reveal the answer
- how self-grading and automatic grading fit together
On Android, speech is never just “call the speech API and you’re done.” There are always edge cases:
- permissions
- recognizer flavor differences
- partial results versus final results
- cancellation timing
- audio focus
- device quirks
A lot of this project became an exercise in being honest about those constraints and designing around them instead of pretending they do not exist.
That honesty also showed up in the app’s session state model. The runtime is not a pile of callbacks. It is built around explicit states and events, which makes it much easier to reason about what the app thinks is happening at any given moment.
That structure paid off again and again as more features got layered in.
Answer Evaluation: From Exact Matching to Something Smarter
The earliest evaluator was mostly deterministic:
- normalize text
- compare against accepted answers
- allow fuzzy matching where appropriate
That still works well for many cards. In fact, it is still the right answer for:
- arithmetic
- spelling
- short identifiers
- cases where a near miss should absolutely not pass
But as soon as the app started touching longer answers and more natural language, the limits became obvious. A strict string-oriented evaluator can be technically consistent while still feeling wrong to a human being.
That led to the semantic grading work.
The first step was not “let AI handle grading.” It was a more conservative plan:
- keep deterministic matching first
- add a semantic fallback only when lexical matching is not enough
- use on-device embeddings rather than a cloud-first model
That design choice mattered. It kept the project grounded. Semantic grading was not supposed to replace the rest of the evaluator. It was supposed to rescue reasonable answers that were being unfairly rejected.
Semantic Grading Turned Out to Be Harder Than the Idea
The semantic work brought some of the most interesting engineering problems in the whole project.
The app now includes:
- a semantic evaluator
- an embedding cache
- a decision policy with accept / unsure / reject bands
- a bundled sentence-embedding model
But the path there was not smooth.
One of the first real blockers was that the original MediaPipe dependency being used for text embeddings was simply too old. On-device initialization was crashing natively on the target phone. The fix was not a clever code workaround. The real fix was dependency modernization. Once the library was upgraded to a current version, the embedder could initialize successfully.
That was a good reminder that “AI bugs” are often just normal software engineering bugs wearing a more dramatic outfit.
The second challenge was more subtle: just because semantic scoring works does not mean it should be trusted blindly.
This showed up especially clearly on a command-heavy CS50-style deck. Some answers that felt obviously related were accepted. Some answers that felt obviously wrong were also accepted. Other short command answers that a human would probably allow were rejected.
That forced a more nuanced policy:
- semantic scoring is useful
- but command-like and syntax-heavy answers need lexical anchors
- shorthand answers like
tailfortail <file>should still be allowed - vague phrases like
not sureshould never pass just because an embedding score looks high
That is exactly the kind of product problem that makes this sort of project interesting. The challenge is not just “can the model produce a number?” The challenge is whether the resulting behavior matches what a real learner would expect.
AI Mode and the Difference Between “Plumbing” and “Experience”
Another large branch of work explored a fuller AI mode using Gemini live audio and tool-calling ideas.
This part of the project went through multiple milestones:
- plumbing mode flags through settings, navigation, and runtime state
- adding a live client shell
- integrating bidirectional audio
- wiring tool calls into the existing reducer-driven session logic
- adding fallback behavior when live transport fails
This was useful work, but it also created a good internal standard for honesty. It became important to distinguish between:
- a feature being “wired through the app”
- a feature being “technically alive”
- a feature being “good enough to present honestly as a user-facing experience”
A lot of AI product work gets fuzzy on that distinction. This project benefited from repeatedly pulling those apart.
The result is a codebase that now has real AI-related infrastructure and experiments, but still treats deterministic study behavior as the stable center of the app.
That turned out to be the right posture.
A Better Product Through Better Constraints
One of the more surprising themes in the project was that constraints improved the product.
Examples:
- trying to scrape AnkiWeb forced a rethink that led to a better in-app browser + import handoff
- a crashing on-device semantic path forced a proper dependency upgrade instead of magical thinking
- overly broad semantic grading on command decks forced a more human grading policy
- navigation crashes around import preview forced a more correct
SavedStateHandlesetup
None of those were “fun” problems in the moment, but they each moved the project toward something sturdier and more coherent.
The app is better because it had to survive those collisions with reality.
What Exists Now
At this point, the project includes a meaningful amount of real functionality:
- voice-first study sessions
- spoken prompts and spoken answers
- persistent review scheduling
- settings and history
- deck import from local files
.apkgimport support- AnkiWeb browsing and direct import handoff
- bundled starter decks
- semantic grading infrastructure
- on-device text embeddings for semantic evaluation
- experimental AI/live-session infrastructure
There is also a growing body of product and platform planning around where the app could go next:
- Gemini-assisted study features
- stronger semantic grading policies
- Wear OS companion support
- car-aware or Android Auto-adjacent ideas
Not all of those are finished products, but they represent something important: the project is no longer just a pile of features. It has a direction.
What I Learned From Building It
The biggest lesson is that “voice-first study app” sounds smaller than it really is.
You are not just building:
- a UI
- a speech recognizer
- a deck importer
You are building the glue between all of them, and the glue is where most of the actual engineering lives.
Another lesson is that good product behavior often comes from restraint, not ambition.
The best parts of this project are not the ones where the app tries to be magical. They are the parts where it:
- stays deterministic when it should
- uses ML as support rather than theater
- preserves clear state boundaries
- avoids pretending unstable integrations are already polished product experiences
That kind of discipline is not always flashy, but it is what makes a project feel trustworthy.
What Comes Next
The next stage of work is less about piling on new surfaces and more about sharpening the judgment of the app.
The biggest open question is not “can we add more AI?” It is:
how do we make the app accept the right answers, reject the wrong ones, and feel fair to the learner?
That likely means:
- better semantic policies
- deck-sensitive grading behavior
- clearer settings around evaluation style
- more real-world testing across different kinds of decks
There is still plenty of room to grow, but the project is now at an interesting point: it already does a lot, and the challenge is no longer proving that the idea can exist. The challenge is making it consistently good.
That is a much better problem to have.