Show HN: Classify mechanical faults using Contrastive Language-Audio Pretraining

cardiag is an end-to-end audio-ML pipeline that cleans noisy mechanical recordings and uses a frozen CLAP model with linear heads to triage car faults.

cardiag

is an end-to-end audio-ML pipeline. It scrapes fault-sound clips from YouTube/TikTok, cleans the audio (isolating the mechanical sound from speech, music, and noise), embeds it with a frozen CLAP model, and trains small linear heads to triage the fault. It is exposed as a CLI and a live web app.

cardiag-demo.mp4

This is a proof of concept, and honest about what that means. Diagnosing a car fault from a phone recording is genuinely hard, so cardiag

is built as a calibrated triage aid rather than a diagnoser: it tells you whether something sounds wrong, roughly where in the car it is, and a ranked shortlist of likely parts. When the audio won't support a call, it says "uncertain" instead of bluffing.

The real contribution is the cleaning + honest-training recipe, which is reusable on other audio datasets. The modest accuracy here reflects how hard the problem is from crude phone audio (we hit the literature ceiling); the

samemethod reaches 0.93 AUROC on clean engine audio. See docs/DEFENSE.md.

Two pages visualize the first two stages of the pipeline:

Isolating the engine audio — an interactive look at the clean()

cascade pulling a short mechanical span out of noisy YouTube audio (speech, music, road noise). - CLAP, visualized — how the frozen CLAP model turns those spans into the 512-d embedding the linear heads classify.

Measured out-of-sample, leakage-safe (by-video grouped CV over 1,031 video groups; permutation p = 0.0005). These are honest numbers, not a leaderboard.

| Capability | Result | vs. chance | |---|---|---| | Is something wrong? (fault/normal) | AUROC 0.79 [0.76, 0.83] | 0.50 | | Where in the car? (6 zones) | right zone in top-3 ≈ 75% | 2× | | Which part? (12+ families) | right part in top-3 ≈ 45–65% | 3–4× | | Knows when it doesn't know | calibrated (ECE ≈ 0.04), returns UNCERTAIN | — |

Full details, and the one head we demoted for failing out-of-sample (knock), are in docs/MODEL_CARD.md.

A fresh clone is immediately usable. A small pre-trained model ships in models/ , and a synthetic demo clip is bundled, so nothing needs to be downloaded or scraped.

git clone <this-repo> && cd car-diagnosis
uv venv && source .venv/bin/activate
uv pip install -e ".[scrape,web,dev,viz]" # Python 3.11
cardiag doctor # preflight: what's installed
cardiag train --fixtures # a working model offline in ~2s (no scrape, no 2 GB download)
cardiag diagnose <clip.wav> # verdict + where-in-the-car + ranked parts
cardiag serve --model models # live web app: drop a clip / paste a link, "explain why"

Verify the whole thing end-to-end in an isolated worktree: bash scripts/clone_verify.sh .

audio ──► clean() cascade ──► CLAP embedding ──► linear heads ──► Diagnosis
(isolate spans) (frozen, 512-d) (fault/region/ (calibrated,
part/knock) UNCERTAIN-aware)

There is one segmentation path. Scraped clips, your own recordings (cardiag ingest , any length), and uploads at inference all flow through the same clean()

cascade that isolates short mechanical spans. Spans over ~10 s are split into windows so CLAP never silently truncates them. Training and serving share one embedding contract, so there is no train/serve skew.

cardiag diagnose clip.wav # full model: verdict + region + ranked parts
cardiag triage clip.wav # calibrated engine-vs-running-gear
cardiag clean clip.wav # isolate the mechanical sound (no model needed)
cardiag inspect clip.wav -o r.html # SEE/HEAR the pipeline: spans, spectrograms, scores
cardiag ingest ./my_audio --kind fault --cause wheel_bearing # bring your own audio
cardiag scrape youtube|tiktok # build a corpus (Reddit is deprecated — too noisy)
cardiag train # train on your corpus

Add --json

to any inference command for machine-readable output.

docs/DEFENSE.md — the honest case that a deliberately crude method earns a real triage result.
docs/MODEL_CARD.md — per-head metrics, intended use, limitations.
docs/architecture.md — pipeline diagrams.
docs/scraping-guide.md — start-to-finish corpus building.

Valid for social-style / targeted-upload audio (YouTube, TikTok, or a phone clip a user records deliberately). It is not a safety-critical or standalone diagnostic. It is a triage assistant that narrows where to look and is honest about its uncertainty. Model files are joblib artifacts: load only ones you trust.

License: see LICENSE.

Source: Hacker News