Shipping huggingface_hub every week with AI, open tools, and a human in the loop

Hugging Face automated their huggingface_hub release cycle from weeks to a weekly cadence using GitHub Actions and open-weights AI models. By combining LLMs with deterministic guardrails, they streamline release notes generation while keeping a human in the loop for final approval.

huggingface_hub

is the Python client at the base of the Hugging Face ecosystem. transformers

, datasets

, diffusers

, sentence-transformers

and dozens of other libraries depend on it to talk to the Hub. Every week we don't ship a new release is a week of fixes and features stuck on main

. For a long time we released every 4 to 6 weeks. We now release every week from a single GitHub Actions workflow. We built it using open-source tools and open-weights models and kept a human in the loop at the one place where judgment matters. Nothing in this post requires a vendor contract, a closed model, or infrastructure you can't run yourself. That was a design goal from the start since we wanted a workflow other maintainers could pick up and adapt.

By the end of this post, you'll have everything you need to build your own.

The old process was partly automated, mostly manual.

Already in CI:

Publishing to PyPI once a tag was pushed.
Opening test branches in downstream libraries with the release candidate pinned.

Still manual, every single time:

Creating the release branch, bumping the version in __init__.py

, committing, tagging, pushing. - Watching the downstream CI runs and triaging failures.

Reading through every PR merged since the last release and writing release notes by hand: grouped by theme, with context, in a voice that didn't read like a git log

dump. - Cutting the stable release after the RC period.

Drafting an internal Slack announcement and social posts.
Opening the post-release PR to bump main

to the nextdev0

Writing good notes for a new version was the heavy part, aggregating tens of PRs on different topics. Nothing technically hard but a few hours of focused attention. Add the announcements on top and a minor release was easily a half-day of work spread over several days.

So we decided to streamline the whole thing. Looking at that list, the work splits in two.

Some steps are purely mechanical and can be automated: bumping the version, committing, tagging, pushing, opening downstream test branches, opening the post-release PR. Nobody needs to think about those. They just have to happen in the right order, every time, which is what a CI workflow is good at.

The rest is different. Writing release notes, deciding what to highlight, phrasing an announcement for a human audience: that's brain work. It's the kind of judgment that kept the release manual for years. This is where AI comes in, turning a blank page into a solid first draft in seconds. It's also where we have to be careful because a draft that looks confident and is subtly wrong is worse than no draft at all.

When we decided to fix this, we set one constraint up front: every moving part had to be something any maintainer could run themselves. No closed model behind an API we couldn't swap, no proprietary release platform, no secret sauce.

Here's the entire stack:

| Part | What it does | |---|---| GitHub Actions | Orchestrates the whole release | OpenCode | Agent runtime that drives the model | An open-weights model (currently GLM-5.2 from Z.ai) | Drafts the release notes and Slack announcement | HF Inference Providers | Serves the model | PyPI Trusted Publishing | Publishes the package |

The second principle: the model drafts, a human decides. Language models are good at turning thirty terse PR titles into readable release notes. They are not good at being trusted blindly. So the workflow is human-supervised: the model does the first pass, a deterministic script checks its work, and a human reviews and edits before anything ships (more on that below).

The full workflow is a single file, .github/workflows/release.yml

, triggered by hand from the Actions UI. It takes exactly one input:

on:
workflow_dispatch:
inputs:
release_type:
type: choice
options:
- minor-prerelease # cut an RC from main
- minor-release # promote the RC to final
- patch-release # bugfix on an existing release branch

From there, the jobs run roughly in this order:

**Prepare.**Compute the next version, create or reuse the release branch, bump__version__

, commit, tag, push.**Publish to PyPI.**Build and uploadhuggingface_hub

. In parallel, build and upload thehf

CLI as its own PyPI package.**Release notes.**Diff the commit range since the last tag, pull PR metadata from the GitHub API, and have the model draft a structured changelog (here's a recent one). Saved as adraftGitHub release.**Downstream test branches.**For RCs, open a branch intransformers

,datasets

,diffusers

,sentence-transformers

with the RC pinned, so their CI tells us fast if we broke something.**Slack announcement.**Read the notes and produce an internal announcement in our team voice.**Archive notes.**Upload both the raw AI draft and the human-edited version to a Hugging Face Bucket, side by side.**Post-release bump.**After a stable release, open a PR onmain

bumping to the nextdev0

.**Comment on shipped PRs.**Leave a "this shipped in vX.Y.Z" comment on every PR in the release.**Sync CLI docs.**Open a PR to our skills repo with the regeneratedhf

CLI skill docs.**Report to Slack.**Every step posts its status as a thread reply; a final job updates the root message with ✅ or ❌.

The remaining manual steps are reviewing and publishing the draft release notes, and reviewing and posting an internal Slack message. Those two steps are where we want a human in the loop.

Here's the failure mode everyone worries about with AI-generated release notes: the model quietly drops a PR or invents one that isn't in this release. A changelog that's almost right is worse than no changelog because nobody re-checks it.

We don't trust the generated release notes to be complete on the first try, we verify it deterministically. Before the model runs, a Python script retrieves all PRs that belong to the release and stores them as ground truth.

# Deterministic: extract PR numbers from squash-merge commits in the range.
PR_NUMBER_PATTERN = re.compile(r"\(#(\d+)\)$")
pr_numbers = [
int(m.group(1))
for commit in commits_since_last_tag
if (m := PR_NUMBER_PATTERN.search(commit.title))
]
save_manifest(pr_numbers) # the source of truth

Then model drafts the notes from them. Once done, we check its output against the initial list of PRs:

expected = set(load_manifest()) # what should be there
found = extract_pr_refs(notes_md) # what the model wrote (#1234 -> 1234)
missing = expected - found # silently dropped
extra = found - expected # belongs to a different release

If anything is missing or extra, we don't fail and we don't ship a wrong file. We hand the discrepancy back to the agent and ask it to fix exactly those PRs:

for _ in range(MAX_ITERATIONS):
missing, extra = validate(notes)
if not missing and not extra:
break # matches the manifest exactly
run_agent_fix(missing_prs=missing, extra_prs=extra)

This is the pattern that makes the whole thing trustworthy: a non-deterministic model wrapped in deterministic guardrails. The model is great at writing prose and unreliable at being exhaustive. So we let it write and let code enforce the consistency.

Completeness is one half. Accuracy is the other. A model summarizing a PR from its title alone will cheerfully invent a code example that doesn't match the real API.

To prevent that, when we fetch PR metadata we also pull the actual documentation diffs from each PR: the unified diff of any .md

file under docs/

that the PR touched.

def fetch_doc_diffs(pr):
return [
{"filename": f.filename, "status": f.status, "patch": f.patch}
for f in pr.get_files()
if f.filename.startswith("docs/") and f.filename.endswith(".md") and f.patch
]

That diff goes into the model's context so when it writes "here's the new CLI command," it's quoting the example the PR author actually wrote in the docs. That's the same logic as before: give the model real source material and a narrow job.

The prompts themselves liv

Source: Hugging Face Blog