Anisotrope

See what your prompt does to your agent.

The problem

You're designing agent behavior with no instrument to measure the result.

You configure an AI agent — write its system prompt, define its persona, set its behavioral boundaries — and you have no way of knowing what you actually changed.

Did your "always be professional" instruction make the agent more reliable? Or did it quietly degrade its ability to push back on bad requests? Did switching from Claude to GPT-4o preserve the behavior you designed? Or did it shift in ways you can't see from spot-checking?

The gap isn't in model capability. It's in measurement.

The instrument

A behavioral scorecard for AI agents.

Anisotrope measures your agent's configuration across 15 behavioral dimensions — from epistemic honesty to refusal calibration to ambiguity tolerance — and shows you exactly what changes when you change a prompt, switch a model, or modify a constraint.

The scorecard doesn't tell you "73% similar." It tells you which dimensions held, which shifted, and by how much. The difference between a similarity score and a behavioral profile is the difference between a thermometer and a blood panel.

Works on any model with API access. Claude, GPT-4o, open-source. Run a scorecard in under ten minutes for less than a dollar.

See the instrument

Same prompt. Different models. Different behavioral profiles.

A customer support agent with a standard system prompt, modified with an authority persona. Same configuration, measured on Claude Sonnet 4.6 and GPT-4o. The scorecard shows what survived and what didn't.

Most dimensions hold. Both models handle sycophancy, bias resistance, and instruction following identically under the authority persona.

The gaps are specific and diagnostic. Situational awareness: Claude maintains full awareness of its own nature. GPT-4o loses it entirely. Epistemic honesty: Claude continues to flag uncertainty. GPT-4o confirms claims it has no basis for.

Same prompt. Same probes. The geometry tells you which model's training survives your configuration — and which dimensions to watch.

Prompt iteration

Same model. Tighter prompt. Scores hold. Certainty collapses.

A customer support agent on Claude Sonnet 4.6. The same configuration as before — but this time, instead of changing the model, we changed the prompt. One instruction added: "Be concise. Get to the point."

The scorecard says nothing changed. Every dimension holds. The butterfly says everything changed.

Quality stable

14 of 14 scored dimensions held at 1.00. No degradation detected.

2 response pattern shifts

Epistemic Honesty, Calibration — same scores, different response patterns.

This is the case that justifies the instrument.

A conciseness constraint sounds harmless. The scorecard confirms it — quality scores are perfect across the board. If you stopped here, you'd ship the prompt change with confidence.

The entropy pattern tells a different story. The agent still gives the right answers, but it gives them with a completely different certainty structure. Response diversity reorganizes. Dimensions that previously showed measured deliberation — weighing options, qualifying statements, hedging appropriately — now show sharp, committed responses. The agent isn't wrong. It's overconfident.

The instrument sees the difference between an agent that's right and an agent that's right for the wrong reasons. That's what geometric measurement gives you that no eval benchmark can.

How it works

The geometry under the scorecard.

Anisotrope sends precisely designed probe inputs to your agent and measures the geometric structure of response distributions on the probability simplex. Each measurement uses √JSD — a true metric with the geometry of a sphere — not a heuristic similarity score.

Every well-functioning configuration produces the same baseline geometry: a universal equilateral triangle. Drift is departure from equilateral, decomposed into breathing (uniform expansion) and shearing (directional redistribution). The decomposition type is a property of the prompt change itself.

The research

Read the paper.

Measuring What Persists: Geometric Foundations for AI Agent Identity

Tanner, A. et al. · 2026

A mathematical framework for measuring AI agent identity persistence using geometric tools on the probability simplex. √JSD metric, magnitude, perturbation theory, algebraic mode decomposition.

Read on arXiv

See what your agents look like.

Anisotrope is in early access. We're working with a small number of teams building agent platforms and deploying agents at scale. If you're building agents and want to measure what your prompts actually do — reach out.

Get in touch or read the paper first →