When a field evolves rapidly, its vocabulary often evolves faster than its common understanding. Terms begin to become vague, being reused in different contexts or becoming shorthand for ideas that are never fully explained. We are currently seeing this happening in the field of AI agents. Concepts are mixed there, some are renamed, and others are widely used for a few months before quietly disappearing.
This can be overwhelming for beginners as well as practitioners trying to keep up with the latest developments. After ICLR 2026, one of us (@ariG23498) posted a question that nicely captured this confusion.
“What do the terms ‘harness’ and ‘scaffolding’ mean in the context of agents? I heard a lot of explanations when I was at ICLR, and I couldn’t understand why they didn’t converge on one explanation.”
This glossary is an attempt to ground terms that appear without a clear and consistent explanation. It is not intended to be a comprehensive dictionary covering all terms in this field. Instead, we’ll focus on concepts that are often confused, reused in different ways, or seem obvious when they aren’t.
Most of these terms will come up whether you’re building an agent, deploying an agent, or simply using tools like Claude Code, Codex, and Hermes Agent. The final section describes concepts specific to training models. This makes more sense when addressing that aspect.
Many of these terms do not yet have widely accepted definitions, and different frameworks use the same terms in different ways. The goal here is not to force one correct vocabulary, but to provide a practical mental model that makes the discussion easier to understand.
Let’s get started.
table of contents
model
The model is LLM. LLM receives text and generates text (e.g. Claude, Qwen, GPT, Kimi, DeepSeek…). By itself, there is no memory between calls and no loops. A model can express the intent to invoke a tool, but requires a harness to actually execute it. Respond to one prompt and stop. It becomes an agent when wrapped around scaffolding and harnesses.
scaffold
A behavioral definition layer around the model: system prompts, tool descriptions, how to parse the model’s response, and what to remember across steps (context management). This shapes how the model perceives the world and acts in it, either during training or inference.
Products such as Claude Code, Codex, and Antigravity CLI refer to the entire thing as a harness. Claude Code’s own documentation directly says, “Claude Code acts as a harness for agents around Claude.” This is its widespread use. Harness means anything other than the model. The distinction between scaffolds and harnesses is most important when scaffolds and harnesses need to be inferred separately, such as in training pipelines. We’ll also see “scaffolding” used more broadly to cover whatever infrastructure the harness depends on, such as hooks, runtime configuration, and even directory structures.
Some products, such as Claude Code and Codex, are tightly coupled to the provider model. Other models such as Antigravity CLI and Hermes Agent allow you to plug in any model.
harness
An execution layer within the agent: calls the model, handles its tool calls, and decides when to stop. The harness is what runs the agent. Scaffolding, as defined above, is what the model functions on: its instructions, its tools, its format.
Harness engineering is the discipline for properly designing this layer. Decide when to stop the agent, how to handle errors, and what guardrails are in place to keep the agent on track. This applies to both training and inference. Addy Osmani’s article and OpenAI’s explanation of building with Codex both cover this from the inference side.
During evaluation, the same pattern is displayed as the evaluation harness. Instead of collecting training data, run a fixed set of scenarios at model checkpoints and record metrics instead of updating weights.
Some frameworks use an orchestrator as a high-level controller that coordinates work among multiple agents. Unlike a harness, which drives a model through a run loop, an orchestrator manages agents as a unit, each running its own harness (see subagents below).
agent
The term comes from reinforcement learning, where an agent is simply a function that takes observations and returns actions. The environment performs its actions and returns new observations, and the loop repeats. This loop remains the core of the LLM agent’s operation.
In the LLM world, this term has expanded. An agent is a model plus everything around it that allows it to not only respond but also act. This turns raw text generation into something that can operate within a loop of taking in information, deciding what to do, and acting on the results.
Let’s take coding agents as a concrete example. The system prompts, tool descriptions, and output formats that the model follows form the scaffolding. The loop that calls the model, processes its tool calls, and decides when to stop is the harness. During training, the harness runs many of these loops in parallel and feeds back the results to update the model.

The community usually treats it as Agent = Model + Harness (see tweets from @Vtrivedy10 and Will Brown). If you’re not a model, you’re a harness. The subtle differences between harnesses and scaffolding that cause most of the confusion are explained in the two sections above.
When we talk about products like Claude Code, Codex, Cursor, etc., they refer to specific harnesses that are built on specific models and designed and optimized together. Two products that use the same basic model can feel completely different due to different harness choices. Your experience will also change if you replace the same harness with a better model. There are three different models, harnesses, and products.
context engineering
Design what appears in the agent’s context window: what the model displays at each step, system prompts, tool descriptions, conversation history, and acquired knowledge. This is not a one-time decision. As the model runs, previous turns shape what is reflected in future calls, and the harness actively manages this throughout the run. This is true for both training and inference, but the cost of getting it wrong is very different. During training, what the model sees shapes what it learns. If you make a mistake, you will have to retrain. Inference, this is just text. Modify the prompt and redeploy. The HF Contextual Engineering course covers this in detail.
Memory is part of this picture. Short-term memory is what remains in the context window during a single run, such as conversation history, tool results, and previous inferences. Long-term memory persists between sessions, is stored externally, retrieved on demand, and brought back into context as needed.
policy
Policies are actions that agents follow. Defines the probability of performing each possible action in any given situation. An LLM system learns some of its policies through model weights, but its behavior also depends on the surrounding scaffolding and harness. The same model can behave very differently depending on prompts, tools, memory, and run loops.
Policies are not agents. Policies define behavior. An agent is a complete system that operates within an environment. When you wrap a checkpoint in scaffolding and harnesses and deploy it, you get an agent whose behavior is the policy.
Using tools
How agents reach outside of themselves: APIs, code interpreters, databases, web searches, file systems. A model expresses your intent to use a tool in a structured format. Modern inference APIs expose this as a first-class object. The harness receives the call directly and routes it to the appropriate function. The results are fed back into the context and the loop continues.
skill
Reusable, structured knowledge packages that enable multi-step tasks. If a tool is an action (“run this command”), a skill is a collection of everything needed to accomplish a goal (“investigate this bug, form a hypothesis, and create a fix”). These are portable between agents and loaded on demand. The boundaries between tools, skills, and subagents change between frameworks. The HF Contextual Engineering course explores the skills in detail.
subagent
An agent that is called by another agent to handle a specific subtask. It has its own model and scaffolding, infers independently, and returns results. The calling agent does not need to know how it works internally. This is what separates subagents from tools (function calls) or skills (packaged knowledge). The subagents themselves can reason, use tools, and call further subagents. The calling agent is sometimes called an orchestrator.
training
The above conditions apply during training or induction. These four are specific to training, where the agent performs tasks, obtains scores, and updates model weights. All of LLM’s RL training systems are built around the same pipeline.

RL environment
The environment is anything you can interact with. That is, it is a stateful object that takes actions as input, updates its internal state, and returns observations. In an LLM context, actions are typically tool calls. A file system is a simple example. The action touch foo.txt updates the state by creating a file, and a list of updated files may be observed. Definitions vary by framework.
We recently published a dedicated guide on this. So, rather than compress it here, please see the Ultimate Guide to RL Environments for a complete breakdown of types, frameworks, and examples.
trainer
Trainers improve agents. The trainer runs many agent episodes, scores the results, and uses them to update the internal model weights. TRL’s GRPOTrainer is a concrete example. A single class that handles episode generation, reward scoring, and weight updates.
roll out
A rollout is one complete run of the agent from start to finish. That is, it shows what the agent saw, what it did, and what reward it got for each step. Also called a trajectory or trace, depending on the context. This is the raw data that the RL algorithm learns from.
reward
A score that tells the training algorithm whether the model is improving. This can be verifiable (test pass/fail, matching answers), learned (human preferences, LLM as judge), sparse (one score at the end of the episode), or dense (score at each step). This is what the trainer uses to actually update the internal model weights. For a detailed breakdown of each type, see the Reward Architecture section of Adithya’s guide.
Rubrics divide rewards into explicit dimensions using weights rather than a single number. OpenEnv and Verifier implement rubrics as combinable objects (WeightedSum, Sequential, Gate).
learn more
If you think the definitions are inaccurate, or if you come across a term that we’ve missed, we’d love to hear from you.
Thanks to Pedro Cuenca, Quentin Gallouédec, Shaun Smith, and Adithya S Kolavi for reviewing this post.

