NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...7 min read

My Journey to a reliable and enjoyable locally hosted voice assistant

Share
NOW LET US Article – My Journey to a reliable and enjoyable locally hosted voice assistant

A detailed account of switching from cloud-based services like Google Home to a fully local, private, and powerful voice assistant using Home Assistant, llama.cpp, and dedicated hardware.

I have been watching HomeAssistant’s progress with assist for some time. We previously used Google Home via Nest Minis, and have switched to using fully local assist backed by local first + llama.cpp (previously Ollama). In this post I will share the steps I took to get to where I am today, the decisions I made and why they were the best for my use case specifically.

Links to Additional Improvements

Here are links to additional improvements posted about in this thread.

New Features

Fixing Unwanted HA / LLM Behaviors

Optimizing Performance

Hardware Details

I have tested a wide variety of hardware from a 3050 to a 3090, most modern discrete GPUs can be used for local assist effectively, it just depends on your expectations of capability and speed for what hardware is required.

I am running HomeAssistant on my UnRaid NAS, specs are not really important as it has nothing to do with HA Voice.

Voice Hardware:

  • 1 HA Voice Preview Satellite
  • 2 Satellite1 Small Squircle Enclosures
  • 1 Pixel 7a used as a satellite/ hub with View Assist

Voice Server Hardware:

  • Beelink MiniPC with USB4 (the exact model isn’t important as long as it has USB4)
  • USB4 eGPU enclosure

GPUs

The below table shows GPUs that I have tested with this setup. Response time will vary based on the model that is used.

| GPU | Model Class | Response Time (after prompt caching) | Notes | | RTX 3090 24GB | 20B-30B MoE, 9B Dense | 1 - 2 seconds | Efficiently and quickly runs models that are optimal for this setup. | | RX 7900XTX 24GB | 20B-30B MoE, 9B Dense | 1 - 2 seconds | Efficiently and quickly runs models that are optimal for this setup. | | RTX 5060Ti 16GB | 20B MoE, 9B Dense | 1.5 - 3 seconds | Quick enough to run models that are optimal for this setup with responses < 3 seconds. | | RX 9060XT 16GB | 20B MoE, 9B Dense | 1.5 - 4 seconds | Quick enough to run models that are optimal for this setup with responses < 4 seconds. | | RTX 3050 8GB | 4B Dense | 3 seconds | Good for running small models with basic functionality. |

Models

The below table shows the models I have tested using this setup with various features and their performance.

All models below are good for basic tool calling. Advanced features are listed with the models quality at reliably reproducing the desired behavior.

| Model | Multi device tool calls (1) | Understands context cues (2) | Parses misheard commands (3) | Ignores unexpected text from false positives (4) | | GGML GPT-OSS:20B MXFP4 | | | | | | Unsloth Qwen3.5-35B-A3B MXFP4_MOE | | | | | | Unsloth Qwen3-VL:8B-Instruct Q6_K_XL | | | | | | Unsloth Qwen3-30B-A3B-Instruct Q4_K_XL | | | | | | Unsloth GLM 4.7 Flash (30B) Q4_K_XL | | | | | | Unsloth Qwen3:4b-Instruct 2507 Q6_K_XL | | | |

(1) Handles commands like “Turn on the fan and off the lights”

(2) Understands when it is in a particular area and does not ask “which light?” when there is only one light in the area, but does correctly ask when there are multiple of the device type in the given area.

(3) Is able to parse misheard commands (ex: “turn on the pan”) and reliably execute the intended command

(4) Is able to reliably ignore unwanted input without being negatively affected by misheard text that was an intended command.

Voice Server Software:

Model Runner:

llama.cpp is recommended for optimal performance, see my reply below for details.

Speech to Text (Voice In):

The following are Speech to Text options that I have tested:

| Software | Model | Notes | | Wyoming ONNX ASR | Nvidia Parakeet V2 | Specifically running via the OpenVINO branch which optimizes CPU inference time down to ~ 0.3 seconds | | Rhasspy Faster Whisper | Nvidia Parakeet V2 | Slower due to running directly via ONNX CPU which is slower than OpenVINO |

Text to Speech (Voice Out):

| Software | Notes | | Kokoro TTS | Provides ability to mix and match multiple voices / tones to get desired output. Handles all text well. | | Piper running on CPU (TTS) | Has multiple voices which can be picked from, works for general text but struggles with currency, phone numbers, and addresses. |

Home Assistant LLM Integrations

  • LLM Conversation Provides improvements to the base conversation to improve default experience talking with Assist
  • LLM Intents to provide additional tools for Assist (Web Search, Place Search, Weather Forecast)

The Journey

My point in posting this is not to suggest that what I have done is “the right way” or even something others should replicate. But I learned a lot throughout this process and I figured it would be worth sharing so others could get a better idea of what to expect, pitfalls, etc.

The Problem

Throughout the last year or two we have noticed that Google Assistant through these Nest Minis has gotten progressively dumber / worse while also not bringing any new features. This is generally fine as the WAF was still much higher than not having voice, but it became increasingly annoying as we were met with more and more “Sorry, I can’t help with that” or “I don’t know the answer to that, but according to XYZ source here is the answer”. It generally worked, but not reliably and was often a fuss to get answers to arbitrary questions.

Then there is the usual privacy concern of having online microphones throughout your home, and the annoyance that every time AWS or something else went down you couldn’t use voice to control lights in the house.

Starting Out

I started by playing with one of Ollama’s included models. Every few weeks I would connect Ollama to HA, spin up assist and try to use it. Every time I was disappointed and surprised by its lack of abilities and most of the time basic tool calls would not work. I do believe HA has made things better, but I think the biggest issue was my understanding.

Ollama models that you see on Ollama are not even close to exhaustive in terms of the models that can be run. And worse yet, the default :4b

models for example are often low quantization (Q4_K) which can cause a lot of problems. Once I learned about the ability to use HuggingFace to find GGUF models with higher quantizations, assist was immediately performing much better with no problems with tool calling.

Testing with Voice

After getting to the point where the fundamental basics were possible, I ordered a Voice Preview Edition to use for testing so I could get a better idea of the end-to-end experience. It took me some time to get things working well, originally I had WiFi reception issues where the ping was very inconsistent on the VPE (despite being next to the router) and this led to the speech output being stuttery and having a lot of mid-word pauses. After adjusting piper to use streaming and creating a new dedicated IoT network, the performance has been much better.

Making Assist Useful

Controlling device is great, and Ollama’s ability to adjust devices when the local processing missed a command was helpful. But to replace our speakers, Assist had to be capable of the following things:

  • Ability to give Day and Week Weather Forecasts
  • Ability to ask about a specific business to get opening / closing times
  • Ability to do general knowledge lookup to answer arbitrary questions
  • Ability to play music with search abilities entirely with voice

At first I was under the impression these would have to be built out separately, but I eventually found the brilliant llm-intents integration which provides a number of these services to Assist (and by extension, Ollama). Once setting these up, the results were mediocre.

The Importance of Your LLM Prompt

For those that want to see it, here is my prompt.

This is when I learned that the prompt will make or break your voice experience. The default HA prompt won’t get you very far, as LLMs need a lot of guidance to know what to do and when.

I generally improved my prompt by taking my current prompt and putting it into ChatGPT along with a description of the

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.