Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights

Microsoft has open-sourced SkillOpt, a framework that automatically optimizes AI agent skills using deep-learning-style controls. By treating skill documents as trainable objects, it boosts agent performance without altering the underlying model's weights.
Agent skills have become an important part of real-world AI applications, providing a mechanism — a set of instructions saved in a folder of text-based markdown (.md) files, usually — for models to adapt to specific enterprise use cases and complex workflows.
However, optimizing these skills is a slow process and faulty process, as they cannot be trained in the same way as the parameters of the underlying AI model. Instead, users typically must update them manually by retyping the instructions in each file, playing a "guessing game" as to what changes might improve agentic AI performance and reduce errors.
SkillOpt, a new, open source (MIT Licensed) framework developed by Microsoft, does one better: it introduces an optimizer designed for agent skills, turning the agent's skill .md document as a trainable object that evolves based on performance feedback.
It uses deep-learning-style optimization to make it possible for the AI to systematically explore modifications to the document and find the best combination of instructions. Most importantly, it accomplishes this procedural adaptation without making changes to the underlying model's weights.
On various industry benchmarks, SkillOpt outperforms existing baselines, significantly boosting accuracy for models like GPT-5.5 and Qwen. The result is a set of compact, transferable skill artifacts that allow AI agents to adapt to new domains effortlessly.
The challenge of optimizing agent skills
Agent skills package procedural knowledge into natural-language specifications, including domain heuristics, tool-use policies, output constraints, and known failure modes. These skills provide an external interface for agents to adapt to complex enterprise workflows. In practice, agent skills are stored as text documents and inserted into the agent's context before execution.
One of the key benefits of skills is that they customize the behavior of the underlying model without changing its weights. However, the skill document itself needs to be tweaked and optimized to get the best performance out of the agent.
While deep learning relies on strict mathematical controls for stability, human prompt engineering often relies on trial and error. When attempting to automatically update a skill document based on feedback, the lack of mathematical discipline makes text highly volatile.
Yifan Yang, Senior Research SDE at Microsoft Research Asia, told VentureBeat that the problem is not making changes, but ensuring those changes are mathematically sound.
"The breaking point isn't whether a team can change a skill, it's that they can't guarantee the change is an improvement," Yang said. "Three failure modes recur: no step-size control, so skills drift; no validation, so a fix that reads as reasonable gets written in and can quietly regress performance; and no negative memory, so the same failed edit keeps coming back."
To illustrate how easily performance can drop when edits aren't mathematically validated, Yang noted that "an ungated rewrite pushed GPT-5.5 on SpreadsheetBench from 41.8 down to 41.1."
According to Yang, these failure modes are amplified in multi-step workflows "because that's where frontier models are weakest zero-shot. Not on reasoning, but on procedural discipline: format, self-verification, tool policy."
Before SkillOpt, agent skills were primarily hand-crafted, generated in a single shot, or evolved through loosely controlled self-revision pipelines that could not reliably improve under feedback.
Prompt optimization methods like TextGrad and GEPA treat language artifacts as optimizable objects and use trajectory feedback to evolve prompts, but they focus on single-prompt configurations rather than generating persistent, reusable skill artifacts.
Meanwhile, skill evolution and discovery methods like EvoSkill and Trace2Skill convert agent execution experiences into trajectory lessons to refine skill folders, build domain-specific libraries, or perform evolutionary search.
None of them apply deep-learning-style controls, such as learning rates, validation gates, and momentum, which are necessary to continuously train a single, compact skill document.
Importing mathematical discipline to text
SkillOpt optimizes a text document through an iterative propose-and-test loop that separates the model executing the tasks from the model optimizing the skill. The process unfolds in several steps:
SkillOpt starts with an initial skill document and a frozen target model (or harness), where the target model runs a batch of tasks to generate execution trajectories that act as the evidence for the current step.
An offline optimizer model analyzes these trajectories, separating successes from failures into minibatches. Looking at a minibatch helps the model identify systematic procedural errors rather than one-off anomalies. Based on these patterns, the optimizer proposes structural add, delete, or replace edits to the skill document.
The proposed edits are reviewed to filter out duplicates or contradictions, and the optimizer then ranks these candidate edits by their expected utility.
Rather than applying all proposed changes, SkillOpt clips the list to a maximum edit budget for that step, generating a candidate skill.
The candidate skill is evaluated on a held-out validation set using the target model. If the candidate improves the validation score, it is accepted and becomes the new current skill. If it fails, the edits are rejected and sent to a rejected-edit buffer, providing negative feedback so the optimizer knows not to repeat that mistake.
SkillOpt directly addresses the problem of treating text as a trainable object by importing mathematical concepts from deep learning. The creators note that “the deep-learning analogy is operational rather than decorative,” helping the framework avoid the instability issues associated with other optimization techniques.
The edit budget acts as a learning rate. By limiting how many edits can be applied at once, the skill version is prevented from moving too far from its previous state, preserving continuity while allowing new procedures to be acquired.
Just like checking validation loss in deep learning, the strict held-out examples ensure that plausible-sounding text edits are only kept if they mathematically improve the agent's actual performance on the validation split.
At the end of an epoch, SkillOpt performs a slow update by comparing tasks under the previous and current epoch's skills. This acts like a momentum term, carrying durable, long-horizon procedural lessons forward while isolating them from the fast, step-level edits.
SkillOpt in action
To evaluate the technique in practice, researchers tested SkillOpt across different models, ranging from large-scale frontier models like GPT-5.5 to smaller closed and open models including GPT-5.4-mini and Qwen3.5-4B. They also deployed the skills within different execution harnesses, using plain chat as well as complex coding harnesses like the Codex CLI and Claude Code.
The evaluation spanned diverse industry benchmarks including single-round question-answering, multi-round code generation involving tool use, and multimodal document reasoning. SkillOpt was measured against multiple baselines ranging from a default no-skill setting to human-written skills and one-shot LLM-generated skills. It was also compared against advanced prompt-optimization and skill-evolution methods, specifically Trace2Skill, TextGrad, GEPA, and EvoSkill.
SkillOpt dominated across the board, proving highly effective on all 52 evaluated combinations of model, benchmark, and harness. It was particularly effective with frontier models, delivering an average absolute improvement of +23.5 points against the no-skill baseline on GPT-5.5. Furthermore, SkillOpt outperformed a hypothetical oracle baseline that cherry-picks the best competing method for every problem.
Small target models saw immense relative gains, proving th
Source: VentureBeat












