Show HN: Classify mechanical faults using Contrastive Language-Audio Pretraining

cardiag is an end-to-end audio-ML pipeline that cleans noisy mechanical recordings and uses a frozen CLAP model with linear heads to triage car faults.
cardiag
is an end-to-end audio-ML pipeline. It scrapes fault-sound clips from YouTube/TikTok, cleans the audio (isolating the mechanical sound from speech, music, and noise), embeds it with a frozen CLAP model, and trains small linear heads to triage the fault. It is exposed as a CLI and a live web app.
cardiag-demo.mp4
This is a proof of concept, and honest about what that means. Diagnosing a car fault
from a phone recording is genuinely hard, so cardiag
is built as a calibrated triage aid rather than a diagnoser: it tells you whether something sounds wrong, roughly where in the car it is, and a ranked shortlist of likely parts. When the audio won't support a call, it says "uncertain" instead of bluffing.
The real contribution is the cleaning + honest-training recipe, which is reusable on other audio datasets. The modest accuracy here reflects how hard the problem is from crude phone audio (we hit the literature ceiling); the
samemethod reaches 0.93 AUROC on clean engine audio. See docs/DEFENSE.md.
Two pages visualize the first two stages of the pipeline:
- Isolating the engine audio — an interactive look at the
clean()
cascade pulling a short mechanical span out of noisy YouTube audio (speech, music, road noise). - CLAP, visualized — how the frozen CLAP model turns those spans into the 512-d embedding the linear heads classify.
Measured out-of-sample, leakage-safe (by-video grouped CV over 1,031 video groups; permutation p = 0.0005). These are honest numbers, not a leaderboard.
| Capability | Result | vs. chance |
|---|---|---|
| Is something wrong? (fault/normal) | AUROC 0.79 [0.76, 0.83] |
0.50 |
| Where in the car? (6 zones) | right zone in top-3 ≈ 75% |
2× |
| Which part? (12+ families) | right part in top-3 ≈ 45–65% |
3–4× |
| Knows when it doesn't know | calibrated (ECE ≈ 0.04), returns UNCERTAIN |
— |
Full details, and the one head we demoted for failing out-of-sample (knock), are in docs/MODEL_CARD.md.
A fresh clone is immediately usable. A small pre-trained model ships in models/
,
and a synthetic demo clip is bundled, so nothing needs to be downloaded or scraped.
git clone <this-repo> && cd car-diagnosis
uv venv && source .venv/bin/activate
uv pip install -e ".[scrape,web,dev,viz]" # Python 3.11
cardiag doctor # preflight: what's installed
cardiag train --fixtures # a working model offline in ~2s (no scrape, no 2 GB download)
cardiag diagnose <clip.wav> # verdict + where-in-the-car + ranked parts
cardiag serve --model models # live web app: drop a clip / paste a link, "explain why"
Verify the whole thing end-to-end in an isolated worktree: bash scripts/clone_verify.sh
.
audio ──► clean() cascade ──► CLAP embedding ──► linear heads ──► Diagnosis
(isolate spans) (frozen, 512-d) (fault/region/ (calibrated,
part/knock) UNCERTAIN-aware)
There is one segmentation path. Scraped clips, your own recordings (cardiag ingest
, any length), and uploads at inference all flow through the same clean()
cascade that isolates short mechanical spans. Spans over ~10 s are split into windows so CLAP never silently truncates them. Training and serving share one embedding contract, so there is no train/serve skew.
cardiag diagnose clip.wav # full model: verdict + region + ranked parts
cardiag triage clip.wav # calibrated engine-vs-running-gear
cardiag clean clip.wav # isolate the mechanical sound (no model needed)
cardiag inspect clip.wav -o r.html # SEE/HEAR the pipeline: spans, spectrograms, scores
cardiag ingest ./my_audio --kind fault --cause wheel_bearing # bring your own audio
cardiag scrape youtube|tiktok # build a corpus (Reddit is deprecated — too noisy)
cardiag train # train on your corpus
Add --json
to any inference command for machine-readable output.
- docs/DEFENSE.md — the honest case that a deliberately crude method earns a real triage result.
- docs/MODEL_CARD.md — per-head metrics, intended use, limitations.
- docs/architecture.md — pipeline diagrams.
- docs/scraping-guide.md — start-to-finish corpus building.
Valid for social-style / targeted-upload audio (YouTube, TikTok, or a phone clip a user records deliberately). It is not a safety-critical or standalone diagnostic. It is a triage assistant that narrows where to look and is honest about its uncertainty. Model files are joblib artifacts: load only ones you trust.
License: see LICENSE.
Source: Hacker News












