AI Agents Should Call Models

May 18, 2026 11 min read

ai agents machine-learning tools mlops model-registry evidence

The MLAT paper describes a production pilot where an agent calls an XGBoost pricing model as a tool, achieves R^2 = 0.807 on held-out data, reports mean absolute error of 3688 USD, and cuts proposal generation from hours to under 10 minutes.¹

The useful idea is not the exact pricing model. The useful idea is the boundary: when a task needs a score, forecast, price, risk estimate, ranking, classifier, or detector, the agent should call the model that was trained for that job. It should not improvise a statistical answer in fluent prose.

A trained model belongs in the agent tool registry. The LLM can decide when to call it, explain the result, ask for missing inputs, and route exceptions. The fitted model should produce the numeric estimate, confidence signal, versioned output, and evidence trail.

TL;DR

LLM agents are good at orchestration. Statistical and machine-learning models are often better at bounded prediction. The Machine Learning as a Tool pattern treats a fitted ML model as a callable tool inside an agent workflow, alongside search, databases, APIs, and other tools.¹

That pattern gives teams a clean operating rule: let the agent coordinate the work, but make specialized models do specialized inference. The result should include model version, input schema, output schema, calibration notes, and a traceable call record. Without that boundary, the LLM may sound certain while silently replacing a model with a guess.

Key Takeaways

For agent builders: expose trained models as typed tools with schemas, versions, and failure modes.
For ML teams: treat the agent as a caller, not as a replacement for model evaluation, persistence, or registry discipline.
For product teams: show whether a number came from a model call, a rule, a database, or an LLM explanation.
For security teams: apply the same scoped-authority logic from Agent Keys Need Risk Budgets to model tools.
For reviewers: demand the model call, model version, inputs, output, and confidence limits before trusting the answer.

Why Should Agents Call Models Instead Of Imitating Them?

An LLM can discuss a price. A pricing model can estimate one from the features it learned. An LLM can summarize risk. A risk model can score risk from a tested feature set. An LLM can describe churn. A churn model can return a probability tied to a training process.

Those are different jobs.

Agent tools already make that split possible. OpenAI’s Agents SDK documents function tools with JSON Schema parameters, tool invocation handlers, and structured tool output.² Anthropic’s tool-use docs describe Claude calling client-side tools and external functions with JSON Schema inputs.³ The agent can ask for a model prediction through the same tool pattern it uses for search, calendar updates, shell commands, or database queries.

The core failure mode appears when teams skip that split. They ask the LLM for an estimate because the LLM can produce one. The answer arrives quickly. The prose looks reasonable. The interface has no visible clue that the number came from pattern completion instead of a fitted estimator.

That is a weak contract. The user does not know what produced the result. The reviewer cannot inspect model version or input features. The operator cannot replay the call. The product cannot explain why the answer changed.

The Evidence Gate applies here: confidence is not evidence. A model call can produce evidence. A prose guess usually cannot.

What Does The MLAT Pattern Add?

MLAT stands for Machine Learning as a Tool. The paper frames a trained ML model as a first-class tool that an LLM agent can invoke when the conversation needs quantitative estimation.¹

The paper’s pilot system, PitchCraft, uses two agents. A research agent gathers prospect context through parallel tool calls. A draft agent calls an XGBoost pricing model and then writes a proposal through structured outputs.¹ The ML model handles pricing. The LLM handles context, assembly, and explanation.

That split matters because it avoids two bad designs:

Bad design	What breaks
LLM-only estimation	The model invents a plausible number without model lineage, calibration, or replayable inputs.
Pipeline-only automation	The ML model runs as a fixed preprocessing step even when the conversation does not need it.
MLAT-style tool call	The agent calls the model when the task needs it and keeps the output inside a traceable contract.

The agent still matters. It can decide when the pricing input is incomplete. It can ask a user for missing fields. It can call search or CRM tools before invoking the model. It can explain that the estimate came from a model, not from its own authority.

That is the right division of labor: the LLM orchestrates; the fitted model predicts.

What Should A Model Tool Return?

A model tool should not return a naked number. A serious model tool should return an evidence object.

Field	Why it belongs in the output
`model_name`	Identifies the model family or product capability.
`model_version`	Lets reviewers compare output across releases.
`input_schema_version`	Prevents silent feature-shape drift.
`features_used`	Shows which inputs shaped the estimate.
`prediction`	Carries the score, price, class, rank, or forecast.
`confidence` or `interval`	Names uncertainty when the model supports it.
`known_limits`	Keeps the answer inside the model’s valid domain.
`trace_id`	Connects the result to logs, review packets, and replay.

That output shape makes model tools compatible with Agent Execution Traces Are the Runtime Contract. If an agent calls a pricing model, the trace should show the model call. If an agent skips the model and writes a number anyway, the trace should make that absence obvious.

The same logic supports Review Packets Are the New Final Answer. A final answer with a price is weak. A final answer with a model-call record, model version, feature snapshot, and confidence note gives the reviewer something to inspect.

Where Do Model Registries Fit?

Tool wrapping does not replace MLOps. It exposes MLOps to the agent runtime.

MLflow’s model registry documentation describes lineage, versioning, aliases, metadata tags, and lifecycle information for models.⁴ That registry layer matters because an agent workflow can only cite a model version if the platform tracks versions in the first place.

Scikit-learn’s model persistence docs make a related point from the serving side: persistence choices carry security and portability tradeoffs, and ONNX can serve models without a Python environment while pickle-based paths require trust in the source.⁵ A model tool should not smuggle an unsafe model artifact into an agent just because the agent asked for a prediction.

The minimum operating stack looks like this:

Layer	Responsibility
Model registry	Stores lineage, version, aliases, metadata, and lifecycle state.
Model serving	Loads the model safely and executes inference.
Tool wrapper	Defines input schema, output schema, permissions, timeout, and error shape.
Agent runtime	Decides when to call the tool and how to explain the result.
Review surface	Shows the call, version, inputs, result, and limits.

Teams often collapse those layers into one endpoint called predict. That shortcut works for demos. It fails when the agent starts chaining predictions into customer emails, sales proposals, underwriting notes, infrastructure plans, or medical triage drafts.

The product needs a model contract, not a magic endpoint.

How Should Products Show Model Output?

The UI should tell the user when an answer came from a model tool.

Bad interface copy hides provenance:

UI claim	Problem
“The agent recommends $47,000.”	The source of the number is invisible.
“AI predicts high risk.”	The user cannot tell whether a fitted model, rule, or LLM produced the score.
“Best match: Vendor B.”	The ranking method disappears.

Better copy names the production path:

UI claim	Better signal
“Pricing model v4 estimated $47,000; agent adjusted the proposal language.”	Separates estimate from prose.
“Risk model returned high risk from five available features.”	Shows source and input basis.
“Ranking model v2 chose Vendor B; agent summarized the tradeoffs.”	Splits ranking from explanation.

That distinction protects user dignity. Users should not have to guess whether a number came from a tested model, a model card, a business rule, or a language-model completion. Agentic Design Is Control Surface Design argues that agent products need surfaces for supervision and control. Model provenance is one of those surfaces.

Model cards help with the same problem at the documentation layer. The Model Cards paper proposes structured reporting for model characteristics, intended use, metrics, and evaluation context.⁶ Agent interfaces can borrow that idea at runtime: every model answer should carry enough context for a user or reviewer to understand what kind of claim the model made.

What Should Agents Refuse?

A model-aware agent should refuse several tempting shortcuts.

It should refuse to invent a model output when the model tool is unavailable. It can say the pricing model failed. It can ask whether the user wants a rough human-labeled estimate. It should not silently replace the model.

It should refuse to widen the model’s domain without evidence. A churn model trained on mid-market SaaS data should not become a universal business-health oracle because the prompt asks nicely.

It should refuse to hide uncertainty. If a model returns an interval, the answer should not collapse it into a single confident number unless the product has a clear display rule.

It should refuse to call a model tool with missing or fabricated features. The agent can collect inputs, ask follow-up questions, or mark fields unknown. It should not fill the feature vector with convenient fiction.

It should refuse to treat model authority as action authority. A model can estimate fraud risk. That does not mean the agent can freeze an account. The action still needs the scoped-key discipline from Agent Keys Need Risk Budgets.

The Decision Rule

Use this rule when building an agent workflow:

Task asks for	Agent should
A fact from a source	Retrieve or query the source.
A prediction from historical data	Call the trained model.
A classification with known labels	Call the classifier or ask for missing inputs.
A business rule	Execute the rule and cite the rule version.
A subjective recommendation	Separate evidence, model outputs, and judgment.
An action based on a score	Require model output plus action authorization.

That rule gives the LLM a valuable job without letting it impersonate every other system. It can coordinate the workflow, explain outputs, draft the message, and ask better questions. It cannot become the pricing model, risk model, fraud model, ranking model, or policy engine by sounding fluent.

The best agent products will not ask one model to pretend to be the whole company. They will build a tool surface where each system does the job it can prove.

FAQ

Is this only for traditional machine-learning models?

No. The same pattern applies to any specialized estimator or scorer: gradient-boosted models, classifiers, ranking systems, forecasting models, rules engines, retrieval scorers, and domain-specific detectors. The point is not the algorithm. The point is the contract around the output.

Why not let the LLM estimate directly?

Sometimes a rough qualitative estimate is fine. A product should say that clearly. When the user needs a price, risk score, forecast, or eligibility decision, the answer should come from a tested model or rule path with traceable inputs and limits.

Does a model tool make the answer automatically correct?

No. A model tool can still be stale, biased, miscalibrated, misused, or outside its valid domain. The model tool improves inspectability. It does not remove the need for evaluation, monitoring, and human review.

What is the minimum viable model-tool contract?

Start with input schema, output schema, model version, prediction, confidence or caveat, error shape, timeout, and trace ID. Add feature names, registry link, model-card reference, and calibration notes when the model affects money, access, safety, or customer-facing decisions.

How does this change agent UX?

The interface should label the source of important outputs. Users should see whether an answer came from a model call, a retrieved document, a business rule, a human approval, or LLM synthesis. That provenance changes how much trust the answer deserves.

References

Blake Crosley, “Machine Learning as a Tool (MLAT): A Framework for Integrating Statistical ML Models as Callable Tools within LLM Agent Workflows,” arXiv, submitted February 19, 2026. Source for the MLAT framing, PitchCraft pilot, XGBoost model tool, R^2 = 0.807, mean absolute error of 3688 USD, and proposal-generation time claim. ↩↩↩↩
OpenAI Agents SDK, “Tools,” OpenAI documentation. Source for function tools, hosted tools, JSON Schema parameters, tool invocation handlers, and structured tool output in agent workflows. ↩
Anthropic, “Tool use with Claude,” Anthropic documentation. Source for Claude calling external tools and client-side tools through JSON Schema-defined inputs. ↩
MLflow, “ML Model Registry,” MLflow documentation. Source for registry concepts including lineage, versioning, aliases, metadata tagging, annotation support, and lifecycle tracking. ↩
scikit-learn, “Model persistence,” scikit-learn documentation. Source for persistence methods, ONNX serving without a Python environment, and security warnings around pickle-based persistence. ↩
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru, “Model Cards for Model Reporting,” Google Research. Source for structured model reporting around model characteristics, intended use, metrics, and evaluation context. ↩