The Company I always wanted

I think people misunderstand what I mean when I say I want to build a startup.

They hear startup and think billion dollar empire, venture capital, hustle culture, some guy on LinkedIn saying “we’re changing the world” because he made a dashboard for invoices or whatever.

And sure, getting rich would be nice. I am not going to pretend money is bad. Money solves real problems. Money buys freedom. Money keeps the lights on.

But that was never really the dream.

The dream was way simpler than that.

I wanted enough money to pay the bills, rent an office, buy some food, keep the computers on, and hack with my friends.

That’s it.

A room. Some desks. Some snacks. Good internet. Maybe a whiteboard. Maybe too many monitors. People arguing about systems and programming languages and product ideas. Somebody building something weird in the corner. Somebody else reading tech news for fun. Somebody pushing a deploy. Somebody saying “wait, what if we just…” and then everyone loses two hours chasing the idea because it might actually work.

That’s the part I always wanted.

Not the empire. Not the press release. Not the fake founder mythology. Not the “we are disrupting X” nonsense.

I wanted the workshop.

The funny thing is my friends already do this. They already build things for fun. They already read about technology for fun. They already have opinions about databases and operating systems and AI and networks and whatever else. They already spend their free time doing the thing most companies have to pay people to pretend to care about.

So part of me is like: why couldn’t we build a company out of that?

Not a company where the joy gets crushed under meetings and process and status games. Not a company where snacks replace compensation or where “we’re a family” means “please work nights for free.”

I mean a real company.

One that makes enough money that people can show up and be paid like adults.

One that has customers, and bills, and boring operational stuff handled, but where the center of gravity is still making things.

A company where the work still feels alive.

I think when we were younger this was easier to understand. If somebody made something on a computer, that was cool. A website, a script, a game, a server doing something weird, whatever. The first reaction was not a policy analysis. It was “whoa, you made that?”

Now everything gets filtered through takes.

Is it good for society? Is it cringe? Is it a startup grift? Is it AI slop? Is it capitalism? Is it replacing someone? Is it going to become a monopoly someday?

Some of those are fair questions. But man, if that is always the first reaction, it kills something.

There used to be a basic joy in making the computer do a thing. I still have that. I do not think I ever lost it.

And I guess I am realizing that the company I always wanted was not really about winning capitalism. It was about making a place where that joy could survive adulthood.

Because adulthood changes things. People have partners, kids, mortgages, health problems, parents getting older, responsibilities. Sitting in a room with friends, free food, and computers does not sound as magical to everyone as it once did. To some people it sounds like one more obligation.

I get that.

So maybe the dream has to grow up too.

Not a hacker house. Not a grind cave. Not “sleep under your desk until we exit.”

More like a sane little lab.

A software shop with a soul.

A place where people can come in, do good work, make real things, get paid, eat lunch, laugh, argue about dumb technical details, and then go home without feeling like the company owns them.

That sounds modest compared to the billion dollar startup dream, but to me it feels bigger in the ways that matter.

Because the point was never just money.

The point was: can we make our natural behavior economically sustainable?

Can the thing we already love doing become the thing that pays for the room?

Can we build something useful enough that it buys us more time to build more things?

That is the company I always wanted.

Not a unicorn.

A workshop with revenue.

A place where the fridge is full, the servers are running, the work is interesting, and the people in the room still think making stuff on computers is cool.

All your CDN are belong to US

Underminr Detection: DNS Says One Thing, TLS Says Another

The detection story for Underminr is actually pretty simple, which is why it is so annoying.

You are not looking for some magic evil packet. You are looking for the layers of the connection telling different stories.

Normal web traffic looks something like this:

DNS: I want goodsite.com
DNS answer: goodsite.com is at 104.x.x.x
TLS: I am connecting to 104.x.x.x
SNI: goodsite.com
HTTP Host: goodsite.com

Everything lines up. Boring. Fine.

Underminr-style traffic looks more like this:

DNS: I want goodsite.com
DNS answer: goodsite.com is at 104.x.x.x
TLS: I am connecting to 104.x.x.x
SNI: sketchy-domain.ai
HTTP Host: sketchy-domain.ai

That is the trick.

The machine uses an allowed domain to get a perfectly valid CDN IP, then reuses that IP to talk to some other tenant living behind the same CDN.

DNSSEC does not really save you here, because the DNS answer can be totally legitimate. DNSSEC can prove that the DNS answer was authentic. Great. Congratulations. The lie happens later, at the TLS / HTTP / CDN routing layer.

So the detector has to correlate things across layers:

What DNS name did this endpoint ask for?
What IP did DNS return?
What IP did the endpoint connect to?
What SNI did it present?
What HTTP Host header did it use?
Did that hostname ever get resolved normally?

In other words, you are looking for this mismatch:

Resolved: harmless-site.com -> CDN edge IP
Connected to: CDN edge IP
Claimed SNI: evil-site.com

That is the “oh come on” moment.

The really cursed version is when the attacker splits it into steps. First they make a clean-looking connection to the allowed domain, then they come back to the same CDN edge IP and swap the SNI / Host header to the real target.

So detection has to remember recent DNS answers and recent CDN connections per machine. It is not enough to look at one packet in isolation.

The direct-to-IP version is even more annoying because there may be no DNS lookup for the bad domain at all. Then the question becomes:

Why is this machine connecting directly to a shared CDN IP
while presenting an SNI / Host name that it never resolved?

That is suspicious as hell.

And with ECH, because of course we needed one more layer of pain, the real SNI can be encrypted. At that point you need endpoint visibility, controlled DNS, policy around HTTPS / SVCB records, or some way to block or strip ECH in managed environments.

So the whole thing boils down to this:

DNS says one thing. TLS / HTTP says another. The CDN accepts both because shared edge infrastructure is a haunted house.

Honestly, it feels like Dan Kaminsky may have faked his own death and come back on the dark side.

Obviously joking. Mostly.

But this is exactly the kind of DNS-adjacent, “everything technically works as designed and that is the problem” nonsense that makes the internet beautiful and horrible at the same time.

The practical detection rule is basically:

For each endpoint:
remember recent DNS answers:
allowed-domain.com -> CDN_IP
watch outbound TLS / HTTP:
endpoint -> CDN_IP:443
SNI / Host = some-other-domain.com
alert if:
the CDN IP came from a recent allowed DNS answer
but the SNI / Host does not match that DNS name
and the SNI / Host name was not separately resolved
in a normal allowed way

That is Underminr detection.

Not “block the IP,” because the IP is probably Cloudflare, Akamai, Fastly, or some other giant shared CDN edge.

Not “DNSSEC fixes it,” because DNSSEC only proves the DNS answer was real.

The actual problem is that the trust decision was made on one name, and the connection was later used for another name.

It is a cross-layer trust bug.

The internet has a lot of those.

Because apparently we enjoy pain.

Thoughts on the Mistakes of the Social Web

The internet was social before the social web. That part gets forgotten. People talked on IRC, forums, mailing lists, Usenet, AIM, Discord-style rooms, and all kinds of weird little places. The difference was that those places had context. You were not just “an account.” You were a person in a room, with some history, some reputation, and some reason for being there.

If you linked to your own thing in one of those spaces, people usually did not lose their minds as long as it was relevant. The question was not, “Did you make this?” The question was, “Is this useful here?” That is a much healthier standard. Sometimes the best link in the conversation is your own link because you are the person who wrote the thing, built the thing, documented the thing, or found the thing.

The social web broke that in a strange way. It created a huge opening for credibility fraud. Suddenly everyone could perform expertise, manufacture popularity, juice engagement, buy followers, write in brand voice, growth-hack sincerity, and pretend to be a participant while actually acting like a little attention-extraction machine. The feed turned normal human sharing into a suspicious transaction.

So now we live in this dumb world where links are both the blood vessels of the web and somehow treated like contraband. A web with no links is barely the web. It is just a set of private malls with recommendation engines and security guards. “Do not link to your own stuff” sounds noble until you realize it mostly helps people who are already big enough that other people link to them automatically.

PageRank made sense in a world where someone else might find your weird little page and link to it from their weird little page. That was the old bargain: publish something good, and the graph of the web slowly discovers it. But a lot of that middle layer is gone or weakened. Personal sites, blogrolls, directories, small forums, and independent linking culture got paved over by platforms. Now the system still wants backlinks, but the places where people actually gather often punish the behavior needed to create them.

That is one of the great little ironies of the modern web. The machine wants evidence that the world cares, but the world has been trained to treat public linking as spam unless it comes from someone already blessed by the machine. Nice little closed loop there. Very elegant. Completely cursed.

There is an important exception here: I am much less hostile to things like Bluesky, Mastodon, ActivityPub, AT Protocol, and other systems that at least try to make the social layer protocol-shaped instead of purely platform-shaped. That matters. Federation is not magic fairy dust, and protocol people can still be annoying in the very special way protocol people are annoying, but the architecture is pointed in a better direction.

A federated or protocol-based social system is not the same animal as a giant closed platform casino. If identity, distribution, clients, moderation, and hosting can be separated, then users are not trapped in quite the same way. The conversation can move. The client can change. The server can change. Communities can set local norms. The graph is not just locked in some corporate basement next to the engagement-optimization goblin.

That does not make every federated system good. It does not mean Bsky or Mastodon or anything else automatically solves the human problems. People can bring status games, mobs, spam, and weird little dominance rituals anywhere. Give humans a protocol and we will eventually find a way to argue about the chairs. But protocol-based social is at least trying to preserve some of what made the internet good: links, portability, interoperability, local context, and the possibility that no single company gets to be the landlord of human conversation.

So the problem is not “people talking online.” That would be an insane take. The internet is one of the best machines humans ever made for finding each other. The problem is the platform-owned social web: permanent, indexed, engagement-maximized, reputation-scored, and monetized within an inch of its life.

I understand why everyone built social features. In the pre-AI era, if you wanted a big site, users were the cheapest way to get content. Users wrote the posts, uploaded the photos, made the comments, tagged the pages, reviewed the restaurants, liked the posts, ranked the content, argued with each other, moderated each other, and generated the graph. The whole thing rode on the backs of users because paying people to produce and organize all that stuff was expensive as hell.

But that bargain had a cost. The user did not just contribute to the product. The user became the product, the inventory, the moderation problem, the credibility signal, and eventually the unpaid little hamster powering the engagement wheel.

And then because everything was public, permanent, indexed, and monetized, normal social behavior got weird. A casual thought became content. A disagreement became a searchable artifact. A joke became evidence. A person became a profile. A community became a growth channel. Human interaction got shrink-wrapped, barcoded, and stacked on a pallet in the warehouse of the feed.

I do think the social web has value. I am not saying people should stop talking online. But social interaction is often temporary, contextual, and messy. The web is durable, searchable, and decontextualized. Those are not naturally the same thing.

That mismatch is where a lot of the damage came from. We took ephemeral human behavior and made it permanent infrastructure. We took conversations that should have lived in rooms and put them on billboards. Then we acted surprised when everyone got performative, defensive, spammy, paranoid, or insane.

Maybe the healthier split is simple: let the web be good at durable reference, and let social be good at human context. Links, pages, sources, guides, documents, indexes — those belong on the web. Jokes, arguments, half-formed thoughts, “you had to be there” moments, and random social chatter probably belong somewhere smaller, softer, more local, or at least more portable than the giant engagement platforms.

AI changes the economics here. It may now be possible to build useful information systems without forcing users to generate the entire content layer. Machines can parse public sources, organize messy information, summarize, classify, dedupe, and turn scattered material into something usable. Humans can review and steer instead of being mined for every post, like, comment, and scrap of attention.

That does not mean AI slop should replace human culture. Please, God, no. The last thing we need is the web turning into a haunted vending machine full of synthetic LinkedIn posts. But it does mean we may not need to make every useful site into a little social casino anymore.

The mistake of the social web was not that people talked to each other. Talking is good. The mistake was turning talk into permanent content, content into ranking fuel, ranking into status, status into credibility, and credibility into a fraud market.

The web should have links. People should be allowed to point at things. Making something useful and saying “here, I made this” should not automatically be treated like some moral failure. That is how the web breathes.

A link-hostile web is an anti-web. It is a graph afraid of its own edges.

But a protocol-shaped social web? A federated web? A web where people can talk without every conversation becoming feed chum for the same giant machines?

That might be worth saving.

A Love Letter to San Leandro

San Leandro is the kind of city that reveals itself slowly. https://sanleandrodaily.com

It does not usually announce itself with the same volume as some of its neighbors. It does not have to. Its character lives in smaller, steadier things: a public library that matters, neighborhoods that still feel like neighborhoods, local restaurants people genuinely return to, parks that get used, and a shoreline that can change the tone of an entire day.

That quiet substance is part of why I keep coming back to it.

San Leandro feels practical in the best sense of the word. It feels lived in. It feels useful. It is a place where civic infrastructure still matters, where community programming is not an afterthought, and where small businesses still help define the texture of daily life. There is a kind of dignity in that. A city does not need to perform for the outside world to be worth loving.

From a technical point of view, San Leandro is more interesting than people sometimes realize. It has real industrial history, a meaningful business base, and even its own unusual infrastructure story through Lit San Leandro and the city’s long-running interest in connectivity and modern economic development. That combination of public life, local identity, and technical ambition is rare. It is one of the reasons building software for this city feels so worthwhile.

That is also the spirit behind San Leandro Daily.

The goal is not to build a generic city app and stamp a local name on it. The goal is to build something that respects the actual rhythms of San Leandro: library events, public meetings, neighborhood happenings, family programs, local deals, and the many small signals that tell you a city is alive if you are paying attention. Under the hood, the work is fairly straightforward on purpose: a FastAPI backend, PostgreSQL for content, and a cross-platform mobile stack that keeps the product maintainable. The technology matters, but only because it helps the city show up more clearly.

That is the heart of it for me. San Leandro deserves clear attention. It deserves tools that make it easier to see what is already here.

Some cities demand to be noticed. San Leandro rewards noticing.

LessVibes release

fuck the claw

lessvibes

lessvibes is an early JetBrains plugin project aimed at a pretty specific problem in the AI coding era: code can appear in your project faster than you can really read it.

The point of lessvibes is not to block AI tools or shame anyone for using them. The point is to make AI-assisted coding more visible and more hands-on. The plugin is meant to notice when a burst of code lands, track whether the affected files were actually opened, and help the developer step through a likely code path instead of just trusting the vibes and moving on.

In its current form, the project is focused on PyCharm first. The rough direction is:

  • track likely assisted or bulk-generated changes
  • show which files were touched
  • show which of those files were never opened
  • show which files were opened but never really edited
  • give the user a way to open a bounded left-to-right code-flow view

The project lives here:

https://github.com/nigeldaniels/lessvibes

Installing It In PyCharm

Right now, the simplest way to install lessvibes is from a locally built plugin zip.

  1. Clone the repo.
  2. From the project root, run:
./gradlew buildPlugin
  1. That produces a plugin archive at:
build/distributions/lessvibes-0.1.0.zip
  1. In PyCharm, open:
Settings / Preferences -> Plugins -> gear icon -> Install Plugin from Disk...
  1. Select the generated zip file.
  2. Restart the IDE if PyCharm asks.

After restart, the plugin should appear as the lessvibes tool window.

Installing It In Similar JetBrains IDEs

Because lessvibes is a JetBrains plugin, it may also load in similar IntelliJ-platform IDEs. That said, this project is being built with PyCharm in mind first, so anything outside PyCharm should be treated as experimental for now.

The install flow is basically the same:

  1. Build the plugin zip with ./gradlew buildPlugin
  2. Open the target JetBrains IDE
  3. Use Install Plugin from Disk...
  4. Pick the generated zip
  5. Restart the IDE

Important Warning

This project is still very much a work in progress.

It is mostly untested, the heuristics are still rough, and the code-flow logic is best-effort rather than guaranteed runtime truth. If you try it today, you should expect edges, gaps, and wrong guesses.

That said, the idea is real, the first plugin scaffold exists, and contributions are absolutely welcome.

If this problem sounds interesting to you, open an issue, send a PR, or just poke around the code here:

https://github.com/nigeldaniels/lessvibes

LessVibes: Because apparently I needed to vibe code a plugin to help me vibe code less

lessvibes: a plugin for making sure I do not become a fucking moron any faster than time already requires

I have been thinking a lot about AI coding tools lately.

Not in the fake moral-panic way. Not in the “real programmers type every character by hand in a dark room lit only by vim” way. I use the tools. The tools are useful. Anyone pretending otherwise is either lying or writing Java for fun.

The problem is not that AI coding tools are bad.

The problem is that they are too convenient in exactly the wrong place.

They make it dangerously easy to end up with code in your repo that you did not really read, did not really trace, and do not really understand. A giant slab of output appears, you skim it, rename two variables, maybe run the tests, and now congratulations: you are responsible for a system you technically approved but do not actually know.

This seems bad.

So I have been working on a plugin idea called lessvibes.

The basic goal is simple: I want something in the editor that pushes back, at least a little, against the smooth-brained workflow of sitting there like a fucking lemming waiting to click Accept.

Because let’s be honest, something has to.

If nothing pushes back on that loop, a lot of us are going to get dumber. Not overnight. Not in some dramatic sci-fi way. Just slowly, comfortably, one magical completion at a time, until our main skill is recognizing when the machine produced something that looks plausible.

That is not a great direction for software development, or for my own brain, it never got much exercise to begin with.

What lessvibes is supposed to do

The core idea behind lessvibes is that the IDE should notice when a coding session has turned into something passive.

Not “evil.” Not “fraudulent.” Just passive.

If a giant block of code lands in the editor faster than any human would normally type it, that should be treated as a moment worth paying attention to. Not because generated code is automatically wrong, but because that is the exact moment where understanding tends to quietly fall off a cliff.

So the plugin would try to estimate how a session is actually happening.

Not with fake certainty. Not with some creepy fantasy that it can prove who “really authored” every line. More like a transparent, best-effort read based on signals such as:

  • manual typing and editing
  • paste events and bulk insertions
  • accepted AI completions when the IDE exposes them
  • edits that arrive much faster than a human would normally type
  • file opens, tab focus, scrolling, dwell time, follow-up edits
  • whether the files that got changed were ever actually opened and manually touched afterward

The point is not to solve philosophy.

The point is to notice when the workflow has quietly become:

accept output, skim it badly, and emotionally hope for the best

That is a real workflow now. A lot of people are doing it. Some of them are doing it successfully, which honestly makes it even more dangerous.

The part I actually care about

The feature I like most is the code-flow view.

If the plugin decides a change was probably AI-generated, heavily pasted, or otherwise suspiciously vibey, it should open a bounded left-to-right execution path through the code.

Up to five panes.

The leftmost pane starts at the main entry point. Then each pane to the right shows the next relevant file in the flow. Not the whole dependency graph. Not one of those giant architecture diagrams that looks like a serial killer’s wall. Just the important path, bounded on purpose, so a normal human can actually follow it.

That boundedness matters.

A lot of developer tools confuse “more complete” with “more useful.” I do not want a giant spiderweb of boxes and arrows that makes me want to close my laptop and go lie on the floor. I want something that says:

hey idiot, it starts here, then goes here, then here, then here. maybe read these before you pretend you understand the feature

That is the product.

And if the project uses containers, the leftmost pane should be split horizontally. Top half: the main code entry point. Bottom half: the Dockerfile or Compose file. Because a surprising amount of confusion in modern software is not just “what does this code do?” It is “what the hell is the runtime context?” The app starts in one place, the environment is defined somewhere else, and both are easy to ignore when the code arrived in your editor in a burst of machine confidence 30 seconds ago.

So the plugin is not just about surfacing generated code. It is about surfacing execution and context together, in a way that is fast enough that I might actually use it instead of admiring it once and forgetting it exists.

The real workflow I want

What I want the plugin to encourage is an older and healthier loop:

open -> read -> edit

Not:

accept -> vibe -> move on

So lessvibes should care about things like:

  • which generated files were never opened
  • which generated files were opened but never manually edited
  • which inserted blocks later got revised by hand
  • whether the important files in a generated path ever saw real follow-up edits
  • whether I spent time reading and navigating, or just waved the code through customs

To me, those are more meaningful metrics than some bullshit chart about lines added.

I do not care how many lines “I wrote” if a glorified autocomplete snowblower dumped them into my repo and I never looked back. The meaningful question is whether I actually went into the code after it appeared.

That is what matters.

Not purity. Not virtue. Not pretending I am above the tools. Just evidence that my brain stayed in the loop.

What this is not

This is not an anti-AI project.

It is also not a surveillance tool, and if it turns into one the whole thing should be thrown in the trash.

It should not block AI. It should not shame people for using AI. It should not send sensitive code off-machine by default. It should not pretend it always knows exactly what happened. And it definitely should not become one of those dead-eyed enterprise dashboards where some manager decides Steve was only 63% hands-on this sprint and therefore needs a meeting.

Absolutely not.

If this idea works at all, it only works if users trust it.

So the defaults should be:

  • local-first
  • clear about what signals are being tracked
  • no hidden scoring
  • opt-in nudges
  • exportable or resettable session history

Basically: helpful, not narc shit.

Why I think this matters

AI coding tools are very good at helping code appear.

They are much worse at helping you metabolize what just happened.

That is the gap.

Right now the ecosystem is mostly optimized around speed, convenience, and the pleasant little dopamine hit of watching the diff get bigger. But “the diff got bigger” and “I understand the system better” are very much not the same thing. In fact, they may now be moving in opposite directions.

And I do not think the answer is fake purity where everyone pretends to reject the tools and return to some mythological golden age of hand-crafted software. The tools are here. They are useful. I am going to use them. Most people are.

What I want is some counter-pressure.

Something that makes it a little harder to drift into a state where I am technically present for the coding session but functionally just acting as a biological approval button.

Because that is the real failure mode here.

Not evil AI. Not the end of programming. Just a slow humiliating slide into becoming the guy who watches the machine cook and occasionally says, “yeah that looks right.”

I would rather avoid that outcome if possible.

So yeah

That is lessvibes.

A plugin for making sure I do not become a fucking moron any faster than time already requires.

If AI is going to stay in the editor, then the editor should do more than help me accept code. It should help me understand what just landed, where it flows, what I have actually looked at, and whether I touched the important parts with my own brain before I ship some haunted Kotlin side project into the world.

That seems like a worthwhile improvement.

Building VoiceAnki, Part II

Featured

Real Decks, Bad Formatting, and the Small Matter of Talking to Your Phone

Last time I wrote about VoiceAnki as the project that started as “what if Anki had a mouth and some manners” and then kept escalating.

This post is the sequel where the app met real decks, real speech errors, and the ancient software engineering tradition of discovering that your clean architecture was, in fact, a suggestion.

The short version:

  • the speech loop got less gullible
  • the grader got more structural
  • the logs stopped being decorative
  • I built a local robot to do smoke tests because my own voice was starting to file HR complaints
  • and we are now close enough to the edge of deterministic grading that the next layer is visible, but still carefully fenced off

This is not an “AI solves education” post.

It is a post about building a voice-first Android study app that has to survive:

  • imported decks with formatting from the cursed earth
  • speech recognition that is usually helpful and occasionally drunk
  • grading policy that has to be fast, fair, and local
  • users who absolutely do not care that the regex looked elegant in your notebook

Demo Decks Lie

There is a phase every voice app gets to enjoy where the demo looks great.

You ask a clean question. You answer with a clean sentence. The recognizer hands you a clean transcript. The evaluator gives you a clean pass. Everyone nods like this was a serious plan all along.

Then you point the app at real material.

That is when you meet answers like:

  • 1. foo2. bar
  • Successful: ... Unsuccessful: ...
  • Pros: ... Cons: ...
  • 1877-78
  • Gen. Milyutin
  • one huge paragraph that starts with the useful bit and then wanders into side quests

Imported decks are not malicious. They are just old, messy, human, and full of local conventions. In other words: exactly the kind of input software tends to hate.

The first big lesson of this branch was that the grader needed to stop pretending every card was basically the same problem. A short person-name fact is not the same thing as a date range. A date range is not the same thing as a compact list. A compact list is not the same thing as a long explanatory answer that a human will naturally summarize instead of reciting bullet-by-bullet like a haunted audiobook.

That sounds obvious now. It was less obvious when the system was still getting away with a lot of fuzzy matching and a relatively small pile of hand-reviewed examples.

Card Shape Beats Raw String Length

The biggest architectural shift in this branch is simple to say and annoyingly non-trivial to implement:

grade by answer shape, not just by answer text

That means the evaluator now spends more effort upfront figuring out what sort of thing it is looking at:

  • short factual answer
  • person name
  • short numeric answer
  • definition
  • compact list
  • explanatory multi-point answer
  • command-like or control-like utterance

Once you have that, the rest of the pipeline gets saner. You stop asking one grading rule to play twelve different sports at once.

We are still keeping the main grading path deterministic and fast. That is not nostalgia; it is product design. If a spoken flashcard app feels like it pauses to hold a committee meeting before deciding whether 1877 to 1878 means 1877-78, the illusion is gone.

The user experience needs to feel immediate.

That means the hot path still has to be cheap:

  • classify once
  • prepare candidate structure once
  • compare against compact evidence
  • decide

If later we add something smarter for borderline cases, it has to sit behind that path, not inside it.

Structure Beats Vibes

One of the most useful additions here is a new structured-answer parser. I am not going to dump the entire evaluator recipe into a public post, because some of that is still moving and some of it is the kind of thing you learn by burning hours in log review. But the broad move is worth talking about.

Instead of treating every stored answer as one opaque blob, VoiceAnki now tries to recognize when the answer is actually a structure:

  • a compact list
  • a numbered list
  • a labeled list
  • a longer explanatory list

That sounds modest. It is not modest. It changes the whole feel of grading.

Here is a trimmed version of the parser entry point:

fun parse(answerText: String): StructuredAnswerParse {
    val decoded = decodeAnswerText(answerText)
    val numberedItems = extractNumberedItems(decoded)
    if (numberedItems.size >= 2) {
        val items = numberedItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    val labeledItems = extractLabeledItems(decoded)
    if (labeledItems.size >= 2) {
        val items = labeledItems.map(::buildItem)
        return StructuredAnswerParse(
            kind = classifyKind(items),
            items = items,
        )
    }

    return StructuredAnswerParse()
}

That is not magic. It is just the system finally admitting that:

  • formatting matters
  • import damage matters
  • labels matter
  • and if the stored answer is really a list, we should stop grading it like a paragraph that fell down the stairs

Another small but satisfying detail is handling glued list markers. This is the kind of bug that sounds fake until you meet it in the wild:

val source = answerText
    .replace('\n', ' ')
    .replace(Regex("(?<=[a-zA-Z])(?=[1-9][.)-])"), " ")
    .replace("\\s+".toRegex(), " ")
    .trim()

That one line exists because decks really do contain things like foo2. bar, and if you do not split that boundary correctly, you end up evaluating nonsense against nonsense and calling it rigor.

The public version of the lesson is:

real grading quality is often won or lost before you ever compare a transcript to anything

If candidate preparation is bad, downstream scoring does not matter much. You are just being wrong with more confidence.

Speech Software Is Mostly About Timing

There is another lie voice products tell when they are young: that speech recognition quality is the main problem.

It is a problem. It is not the only problem. A lot of the actual work is timing, turn-taking, partials, retries, and deciding when not to believe the recognizer’s last word on what just happened.

This branch did a bunch of work in the speech loop itself:

  • carrying multiple alternatives deeper into grading
  • preserving useful partials
  • separating answer listening from control language
  • treating very short answers differently from long ones
  • quietly retrying some short numeric misses instead of immediately punting to the UI

One of the safer excerpts here is the fallback path for partials:

private fun partialFallbackResult(
    error: Int,
    speechStarted: Boolean,
    strongPartialPhrases: List,
    partialPhrases: List,
): RecognitionResult.Transcript? {
    if (!speechStarted) {
        return null
    }

    val fallbackPhrases = mergePhrases(
        primary = strongPartialPhrases,
        secondary = partialPhrases,
    )

    if (fallbackPhrases.isEmpty()) {
        return null
    }

    return when (error) {
        SpeechRecognizer.ERROR_NO_MATCH -> RecognitionResult.Transcript(fallbackPhrases)
        else -> null
    }
}

This is one of those changes that sounds small until you look at user experience.

If the user said something real, the recognizer heard enough to produce useful partials, and the final result still collapsed into ERROR_NO_MATCH, the product should not act like the person never spoke. That is the kind of behavior that makes users think the app is being smug on purpose.

Arithmetic cards were especially good at exposing this. If the app cannot survive one-word answers like five, it does not matter how clever your long-answer scoring is. Nobody is impressed. They are just annoyed.

So a lot of recent work has been about making the short-answer path feel less brittle without turning the whole system into a thicket of deck-specific hacks.

Fast Matters More Than Fancy

One thing I want to be explicit about: there is a lot of temptation in this space to keep throwing more intelligence at grading until it feels “smart.”

That is not automatically a win.

For VoiceAnki, grading speed is part of the product. The user just spoke. The app needs to respond like it was listening, not like it has submitted a ticket.

That constraint shapes the whole design:

  • keep the deterministic path local
  • keep candidate preparation reusable
  • keep transcript-time scoring bounded
  • do not add a visible “thinking…” pause to the normal loop

There is secret sauce in the exact rubric and decision policy, and I am not going to dump that out here line-by-line. But the public-facing principle is straightforward:

the fast path has to stay boring

If the user notices grading latency, they stop trusting the rhythm of the interaction.

And voice UX is rhythm.

Logs Graduated From Debug Tool to Product Infrastructure

I used to think of logs as something you improve once the interesting engineering is done.

That was cute.

On a speech app, logs are part of the interesting engineering.

A bad miss can come from:

  • speech recognition
  • transcript selection
  • answer-shape classification
  • lexical comparison
  • summary-vs-list policy
  • command/control routing
  • deck formatting

That means “it got this wrong” is not one bug category. It is a small crime scene.

So this branch put a lot more effort into making the logs answer questions like:

  • what transcript did we actually choose?
  • what kind of answer did we think this card wanted?
  • what decision path fired?
  • what evidence made the evaluator accept or reject?

That turns review from:

  • “huh, weird”

into:

  • “the answer was parsed as a structured list, but the wrong branch still ran”
  • “the recognizer had a good partial and then dropped the final”
  • “the card was really a summary-shaped answer, but the evaluator treated it like a raw string match”

That is a much more productive kind of pain.

We Built a Tiny Robot Because Manual Smoke Testing Is a Scam

One of my favorite additions around this branch is a local Pipecat smoke-test agent.

This is not some grand autonomous tutoring system. It is a very specific little goblin.

Its job is:

  1. listen to VoiceAnki through the laptop mic
  2. wait for the phone to stop talking
  3. answer through the laptop speakers
  4. keep doing that long enough to flush out session-loop bugs

That sounds silly. It is also incredibly useful.

The helper has a VoiceAnki-specific prompt, local audio transport, transcript logging, and a blunt little repeat-limit rule so it does not get stuck asking for the question forever:

repeat_limit_rule = f"""
Temporary smoke-test rule:

- Track how many times you have said exactly "can you repeat the question" for the current card.
- If you have already asked {max_repeat_requests} times for the same card, do not ask again.
- Instead, say exactly: I don't know
- Use that forced failure to let VoiceAnki mark the card wrong and move to the next question.
""".strip()

That rule exists because, left to their own devices, voice systems will absolutely form little conversational sinkholes and sit there repeating themselves like two Roombas politely arguing in a closet.

I also finally wrote proper smoke-run capture scripts so the whole thing can run unattended and leave behind artifacts we can review later:

ANDROID_CAPTURE_PID="$(spawn_detached "$ROOT_DIR" "$ANDROID_LOG" \
  "$ADB_BIN" -s "$ADB_SERIAL" logcat -v time \
  VoiceAnkiSpeech:D VoiceAnkiEval:D VoiceAnkiSemantic:D AndroidRuntime:E '*:S')"

PIPECAT_CAPTURE_PID="$(spawn_detached "$PIPECAT_DIR" "$PIPECAT_LOG" \
  "$PIPECAT_PYTHON" agent.py --input-device "$PIPECAT_INPUT_DEVICE" \
  --output-device "$PIPECAT_OUTPUT_DEVICE")"

That gives each run:

  • filtered Android logs
  • Pipecat logs
  • run metadata
  • timestamped folders for later review

It turns out this matters a lot, because manual voice testing is expensive in a very dumb way. You can lose an hour just being the person who says Roosevelt into a phone over and over while watching adb logcat scroll by like the Matrix, except less profitable.

Once a little robot can do even part of that for you, bugs start showing up in clusters instead of as rumors.

The Branch Is About More Than Just One Deck

A lot of the pressure for these changes came from history decks, because history decks are very good at producing:

  • long answers
  • compressed spoken summaries
  • date ranges
  • names with ASR drift
  • multi-point answer blobs

But the goal is not “optimize for history.”

That would be a trap.

The real target is broader:

  • explanatory cards where users summarize instead of reciting
  • imported decks with broken structure
  • voice-native equivalence for dates and names
  • command/control phrases coexisting with answer content
  • better handling for cards where exact string equality is just the wrong abstraction

If the implementation only works because the source material happens to be one subject area, that is not a system. That is a souvenir.

What Landed, and What Is Still Moving

A fair amount of the branch is already real:

  • more answer-shape-aware evaluation
  • stronger short-answer handling
  • better transcript preservation
  • richer evaluator logs
  • local Pipecat smoke testing
  • unattended log capture for long runs

There is also important work underway, some of it not committed yet:

  • broader under-acceptance reduction for explanatory multi-point cards
  • cleaner parsing of ugly imported answer text
  • more voice-native normalization for dates and names
  • more explicit decision-source logging
  • more regression tests built from real reviewed misses, not just happy-path examples

That uncommitted work matters because this branch has been one of those very honest engineering branches where the review notes, the smoke-test notes, and the code all inform each other in tight loops.

Or, put less politely: the app keeps finding new ways to be wrong, and I keep taking notes.

That is good. It means the system is meeting reality.

Deterministic Grading Is Better Now, but It Is Not the Final Boss

This is the part where I want to be careful not to oversell the current system.

The deterministic grader is better than it was:

  • more structural
  • less naive
  • more debuggable
  • less likely to reject obviously good answers for ridiculous reasons

That is real progress.

But there is also a limit to how far you want to push deterministic grading before the whole thing turns into an overfitted museum of exceptions and folklore.

That does not mean the deterministic work was wasted.

It means it was the right layer to improve first:

  • command routing
  • control handling
  • structured parsing
  • short-answer resilience
  • person-name behavior
  • list-vs-summary handling
  • observability

Those are foundational. A later model-backed layer should inherit them, not bulldoze them.

That is why the on-device inference work I have been sketching is intentionally narrow and conservative. The likely next step is not “let a model grade everything.” It is closer to:

  • keep the cheap path cheap
  • keep the main loop immediate
  • use on-device adjudication only for a narrow band of borderline long-answer cases
  • keep abstention first-class
  • make it optional and Android-native

In other words: add one careful new tool, not a second religion.

The Main Lesson So Far

The main lesson from this phase of VoiceAnki is that speech products punish fake abstraction almost immediately.

If your system is too generic, it feels unfair. If it is too clever, it becomes slow. If it is too rigid, users hate it. If it is too permissive, grading stops meaning anything.

The job is to keep finding the narrow path where the app feels:

  • fast
  • fair
  • understandable
  • and boring in the best possible way

Not “maximally AI.” Not “academically pure.” Not “one more heroic regex.”

Just a study loop that feels natural enough that the user forgets how much machinery is underneath it.

And if, along the way, we end up with a better parser, a less gullible speech loop, a tiny local smoke-test goblin, and a cautious roadmap for on-device adjudication, that seems like a pretty decent trade.

VoiceAnki

Building VoiceAnki: A Voice-First Study App That Kept Growing

What This Project Is

VoiceAnki started as a pretty simple idea: what if flashcard review felt more like a conversation and less like tapping through tiny buttons?

The core goal was to make studying possible in a more hands-free, audio-first way. Instead of treating voice as a gimmick layered on top of a normal flashcard app, the project pushed toward something more opinionated:

  • speak the prompt
  • listen for the answer
  • evaluate the response
  • keep the review loop moving without constant screen interaction

Over time, that turned into a much larger app than the original idea suggested. What exists now is not just a voice button on a flashcard screen. It is a full Android app with a session runtime, deck import pipeline, history, settings, AnkiWeb integration, and an increasingly serious answer-evaluation system.

This post is a look back at the work that went into it, what changed along the way, and what turned out to be harder than expected.

The Starting Point

At the beginning, the product shape was intentionally narrow:

  • Android only
  • local deck storage
  • spoken prompts
  • spoken answers
  • deterministic grading
  • lightweight study history

That focus mattered. It kept the project from immediately collapsing into a vague “AI tutor” idea. The first real work was not around machine learning at all. It was around building a dependable study loop:

  • a card queue
  • review scheduling
  • a reducer-driven session state machine
  • text-to-speech
  • Android speech recognition
  • foreground session behavior so the app could survive longer interactions

That part of the app is still the backbone of everything else. Even the newer AI and semantic work only makes sense because there is already a deterministic study engine underneath it.

Turning It Into a Real App

Once the core loop existed, the app started growing in the more familiar directions any real product eventually has to grow.

The project gained:

  • a home screen that lists decks
  • deck detail views
  • a settings screen for answer mode, speech rate, listening window, and grading behavior
  • session history
  • a persistent Room-backed database
  • DataStore-backed settings

That was the moment it stopped feeling like a prototype and started feeling like an app with real internal structure.

One theme that kept coming up was that nearly every “simple” feature touched more systems than expected. A new setting was never just a toggle. It usually had to travel through:

  • settings storage
  • view models
  • UI state
  • runtime configuration
  • sometimes the session reducer itself

That kind of wiring is not glamorous, but it is what makes later experimentation possible without the whole app turning into spaghetti.

Importing Decks Instead of Pretending

One of the biggest shifts in the project was deciding that the app should not live forever on a demo deck.

That meant building a real import path.

There are two different import stories in the app now:

  1. importing from files
  2. importing from AnkiWeb

The file import work led to a full import pipeline:

  • parse a deck file
  • turn it into an internal draft
  • preview the import
  • commit it into the local database

That draft step turned out to be especially useful. It created a clean boundary between “we successfully fetched or parsed something” and “we are ready to persist it as a real deck.” That became important later when the app started pulling content from the web rather than only from local files.

The .apkg path was also a turning point. Anki package import sounds straightforward until you actually have to do it on-device:

  • unzip the package
  • extract and read the SQLite content
  • resolve media references
  • map notes, cards, models, and templates into something your own app understands

That is the kind of work that is easy to underestimate from a distance. It is not especially flashy, but it is exactly the sort of feature that makes an app useful in the real world.

AnkiWeb: From Scraping to a Better Product Decision

AnkiWeb support was one of the most iterative parts of the project.

The first instinct was what many apps would try first: scrape the shared-deck pages and build a native search/detail flow on top of that. That approach looked promising at first, but it ran straight into the reality of the modern web:

  • JavaScript-heavy pages
  • Cloudflare-style challenge behavior
  • markup that is not stable enough to treat as a public API

The project went through several rounds of trying to make that scraper path more resilient, including:

  • improving network setup and headers
  • hardening HTML parsing
  • using a WebView to render pages instead of assuming static HTML

That work was valuable, but it also taught an important product lesson: sometimes the best engineering move is to change the shape of the feature.

The eventual direction became much better:

  • use a visible in-app browser activity for AnkiWeb
  • let the user browse the real site
  • intercept .apkg downloads in-app
  • store the download privately
  • create an import draft
  • jump straight into the existing preview/import flow

That was a much more honest solution. It stopped fighting the site and started using the app’s own strengths: import, preview, and local persistence.

Making Voice Feel Like the Main Interface

The heart of the app is still the study session runtime.

A lot of the work here was not about adding more UI, but about making the voice loop feel coherent:

  • when prompts are spoken
  • when the app starts listening
  • how long the listening window should last
  • when partial recognition should be trusted
  • when to stop early on a strong answer
  • when to reveal the answer
  • how self-grading and automatic grading fit together

On Android, speech is never just “call the speech API and you’re done.” There are always edge cases:

  • permissions
  • recognizer flavor differences
  • partial results versus final results
  • cancellation timing
  • audio focus
  • device quirks

A lot of this project became an exercise in being honest about those constraints and designing around them instead of pretending they do not exist.

That honesty also showed up in the app’s session state model. The runtime is not a pile of callbacks. It is built around explicit states and events, which makes it much easier to reason about what the app thinks is happening at any given moment.

That structure paid off again and again as more features got layered in.

Answer Evaluation: From Exact Matching to Something Smarter

The earliest evaluator was mostly deterministic:

  • normalize text
  • compare against accepted answers
  • allow fuzzy matching where appropriate

That still works well for many cards. In fact, it is still the right answer for:

  • arithmetic
  • spelling
  • short identifiers
  • cases where a near miss should absolutely not pass

But as soon as the app started touching longer answers and more natural language, the limits became obvious. A strict string-oriented evaluator can be technically consistent while still feeling wrong to a human being.

That led to the semantic grading work.

The first step was not “let AI handle grading.” It was a more conservative plan:

  • keep deterministic matching first
  • add a semantic fallback only when lexical matching is not enough
  • use on-device embeddings rather than a cloud-first model

That design choice mattered. It kept the project grounded. Semantic grading was not supposed to replace the rest of the evaluator. It was supposed to rescue reasonable answers that were being unfairly rejected.

Semantic Grading Turned Out to Be Harder Than the Idea

The semantic work brought some of the most interesting engineering problems in the whole project.

The app now includes:

  • a semantic evaluator
  • an embedding cache
  • a decision policy with accept / unsure / reject bands
  • a bundled sentence-embedding model

But the path there was not smooth.

One of the first real blockers was that the original MediaPipe dependency being used for text embeddings was simply too old. On-device initialization was crashing natively on the target phone. The fix was not a clever code workaround. The real fix was dependency modernization. Once the library was upgraded to a current version, the embedder could initialize successfully.

That was a good reminder that “AI bugs” are often just normal software engineering bugs wearing a more dramatic outfit.

The second challenge was more subtle: just because semantic scoring works does not mean it should be trusted blindly.

This showed up especially clearly on a command-heavy CS50-style deck. Some answers that felt obviously related were accepted. Some answers that felt obviously wrong were also accepted. Other short command answers that a human would probably allow were rejected.

That forced a more nuanced policy:

  • semantic scoring is useful
  • but command-like and syntax-heavy answers need lexical anchors
  • shorthand answers like tail for tail should still be allowed
  • vague phrases like not sure should never pass just because an embedding score looks high

That is exactly the kind of product problem that makes this sort of project interesting. The challenge is not just “can the model produce a number?” The challenge is whether the resulting behavior matches what a real learner would expect.

AI Mode and the Difference Between “Plumbing” and “Experience”

Another large branch of work explored a fuller AI mode using Gemini live audio and tool-calling ideas.

This part of the project went through multiple milestones:

  • plumbing mode flags through settings, navigation, and runtime state
  • adding a live client shell
  • integrating bidirectional audio
  • wiring tool calls into the existing reducer-driven session logic
  • adding fallback behavior when live transport fails

This was useful work, but it also created a good internal standard for honesty. It became important to distinguish between:

  • a feature being “wired through the app”
  • a feature being “technically alive”
  • a feature being “good enough to present honestly as a user-facing experience”

A lot of AI product work gets fuzzy on that distinction. This project benefited from repeatedly pulling those apart.

The result is a codebase that now has real AI-related infrastructure and experiments, but still treats deterministic study behavior as the stable center of the app.

That turned out to be the right posture.

A Better Product Through Better Constraints

One of the more surprising themes in the project was that constraints improved the product.

Examples:

  • trying to scrape AnkiWeb forced a rethink that led to a better in-app browser + import handoff
  • a crashing on-device semantic path forced a proper dependency upgrade instead of magical thinking
  • overly broad semantic grading on command decks forced a more human grading policy
  • navigation crashes around import preview forced a more correct SavedStateHandle setup

None of those were “fun” problems in the moment, but they each moved the project toward something sturdier and more coherent.

The app is better because it had to survive those collisions with reality.

What Exists Now

At this point, the project includes a meaningful amount of real functionality:

  • voice-first study sessions
  • spoken prompts and spoken answers
  • persistent review scheduling
  • settings and history
  • deck import from local files
  • .apkg import support
  • AnkiWeb browsing and direct import handoff
  • bundled starter decks
  • semantic grading infrastructure
  • on-device text embeddings for semantic evaluation
  • experimental AI/live-session infrastructure

There is also a growing body of product and platform planning around where the app could go next:

  • Gemini-assisted study features
  • stronger semantic grading policies
  • Wear OS companion support
  • car-aware or Android Auto-adjacent ideas

Not all of those are finished products, but they represent something important: the project is no longer just a pile of features. It has a direction.

What I Learned From Building It

The biggest lesson is that “voice-first study app” sounds smaller than it really is.

You are not just building:

  • a UI
  • a speech recognizer
  • a deck importer

You are building the glue between all of them, and the glue is where most of the actual engineering lives.

Another lesson is that good product behavior often comes from restraint, not ambition.

The best parts of this project are not the ones where the app tries to be magical. They are the parts where it:

  • stays deterministic when it should
  • uses ML as support rather than theater
  • preserves clear state boundaries
  • avoids pretending unstable integrations are already polished product experiences

That kind of discipline is not always flashy, but it is what makes a project feel trustworthy.

What Comes Next

The next stage of work is less about piling on new surfaces and more about sharpening the judgment of the app.

The biggest open question is not “can we add more AI?” It is:

how do we make the app accept the right answers, reject the wrong ones, and feel fair to the learner?

That likely means:

  • better semantic policies
  • deck-sensitive grading behavior
  • clearer settings around evaluation style
  • more real-world testing across different kinds of decks

There is still plenty of room to grow, but the project is now at an interesting point: it already does a lot, and the challenge is no longer proving that the idea can exist. The challenge is making it consistently good.

That is a much better problem to have.

ROS2 OSX brew formula

Featured

Getting ROS 2 Working on macOS, Then Packaging It for Homebrew

ROS 2 on macOS is one of those things that technically works, but often feels harder than it should. The official source-build path is real, but in practice it can turn into a long chain of dependency issues, middleware decisions, Python problems, Qt mismatches, and package combinations that work on one machine but not another.

I wanted a better answer than “it builds on my laptop.” The goal was to get ROS 2 running reliably on macOS, verify the tools people actually use in the beginner tutorials and early development workflows, and package the result so other developers could install it with Homebrew instead of rebuilding the whole stack from scratch.

That work is now complete. The result is a Homebrew-installable formula called ros2-kilted-core: a tested, curated ROS 2 Kilted environment for macOS.

The problem

ROS 2 is well supported on Linux. On macOS, the story is less polished.

The source-build path exists, but it is easy to end up in a state where the build partially succeeds, some tools launch, others fail, and the final setup is too fragile to recommend to anyone else. A successful compile is not the same thing as a usable development environment.

That was the real problem to solve: not just making ROS 2 build once, but making it practical.

That meant getting the core runtime working, verifying the tools used in the beginner tutorials, making sure the GUI tools actually launched, and confirming that the result could support real development instead of merely surviving a single build command.

What I built

This project started with a source checkout of ROS 2 Kilted on macOS and a curated build of the packages needed for a realistic developer workflow.

That included:

  • building the core ROS 2 runtime on macOS
  • standardizing on Fast DDS as the default supported middleware path
  • verifying demo talker/listener nodes
  • getting turtlesim working
  • validating the beginner CLI tools, including:
    • ros2 node
    • ros2 topic
    • ros2 service
    • ros2 action
    • ros2 param
    • ros2 interface
    • ros2 launch
    • ros2 doctor
    • ros2 bag
  • getting rqt_graph, rqt_console, and rqt_service_caller working on macOS
  • creating a separate tutorial workspace for beginner client-library examples
  • packaging the result into a Homebrew formula

The result is not a theoretical “this should probably work” setup. It is a tested ROS 2 environment for macOS, built from source and packaged for reuse.

The macOS-specific work

A large part of the effort was in solving the smaller platform-specific issues that tend to make ROS 2 on macOS feel unreliable.

That included:

  • choosing a package set broad enough to be useful but small enough to maintain realistically on macOS
  • narrowing the middleware path so runtime behavior stayed predictable
  • handling Python and Qt GUI dependencies cleanly
  • fixing a Qt5/Qt6 header clash affecting turtlesim
  • patching the rqt path so it used a working PyQt setup on macOS
  • dealing with vendor packages that would otherwise try to download sources during the build
  • bundling Python build and runtime tooling in a reproducible way
  • validating the final result outside the original development workspace

In other words, this was less about running one successful build command and more about taking a fragile source build and turning it into a repeatable installation.

What the Homebrew formula installs

The Homebrew formula is called ros2-kilted-core.

It installs a curated ROS 2 macOS build that includes:

  • the core ROS 2 runtime
  • Fast DDS as the supported default RMW path
  • the main ROS 2 CLI tools
  • turtlesim
  • rqt_graph
  • rqt_console
  • rqt_service_caller
  • ros2 bag
  • the rest of the validated beginner and developer toolchain

It is intentionally not a full “everything in ROS 2” desktop distribution. It is a curated macOS-focused build designed to be practical for tutorials and development.

The main benefit is that users do not need to manually clone the source workspace, run vcs import, assemble the Python build environment, or rediscover the same macOS-specific fixes. Homebrew downloads the packaged source bundle and builds from that.

Why I packaged this as a custom Homebrew tap

I packaged this as a custom Homebrew tap rather than submitting it to homebrew/core.

That was the right fit for a few reasons:

  • it is a curated ROS 2 distribution, not a tiny standalone utility
  • it is specifically tuned for macOS
  • it includes a practical set of development and tutorial tools
  • it is easier to maintain and iterate in a dedicated tap than in the main Homebrew formula collection

That means the package is installable through Homebrew, but maintained in its own GitHub repository.

Installation

The Homebrew tap is here:

nigeldaniels/homebrew-ros2-kilted

Install it with:

brew install nigeldaniels/ros2-kilted/ros2-kilted-core

The package uses ros2-kilted-prefixed commands instead of replacing the global ros2 command, which makes it safer to install alongside other ROS environments.

Why this matters

A lot of developers want to experiment with ROS 2 on macOS, work through the tutorials, or do real development without switching to Linux immediately. The source-build path exists, but it is still rough enough that many people give up before they get to the interesting part.

This project makes that path much more approachable.

Instead of “it should work if everything goes right,” the result is now:

  • a working ROS 2 source build on macOS
  • a verified set of beginner and development tools
  • a reusable Homebrew installation path for other developers

That makes ROS 2 on macOS far more practical than it was before.

Final thoughts

ROS 2 on macOS is still not the smoothest platform story in robotics, but it becomes much more usable once the setup is curated, tested, and packaged properly.

That was the point of this work: get ROS 2 working on macOS, make sure the important tooling actually runs, and package it so other developers can install it without repeating the same setup process by hand.

If this saves someone else from spending a weekend chasing build failures, Python issues, middleware confusion, and Qt breakage, then it was worth doing.