Good agent workflows do not stop at fluent output. They end when you have checked the relevant evidence.
That matters because a model can sound confident while still being wrong about files, code, commands, tests, or the state of a repository. In this course, a useful default is: treat the model’s answer as a draft until the files, tools, or test results support it.
After working through this page, students should be better able to:
To keep the examples comparable, imagine the same small repository in every case:
README.mdsrc/demo_app/cli.pysrc/demo_app/utils.pytests/test_cli.pyThe point is not this exact project. The point is to keep the repository fixed while the kind of claim changes.
Suppose an agent says:
The program starts in
src/demo_app/cli.py.
That might be right, but the claim is not trustworthy just because it sounds plausible. Ask for the evidence path.
A better interaction is:
Show your work. Which file and command support that answer?
Now the verification can be grounded in the repository itself:
rg "__main__|main\(" -n src tests
sed -n '1,120p' src/demo_app/cli.py
If the search result and file contents really show main() being called from
src/demo_app/cli.py, then the claim is supported. If not, the answer needs to
be revised.
Suppose an agent says:
I renamed
format_nametorender_nameeverywhere.
This is a code-change claim, so the first check is not prose. The first check is the diff.
git diff
rg "format_name|render_name" -n src tests
pytest tests/test_cli.py -q
Here the evidence should show three things:
If the agent changed the wrong file, missed one reference, or broke a test, the claim is not yet verified.
Suppose an agent says:
I fixed the CLI so it prints the greeting correctly.
That is a behavioural claim. The best evidence is to run the behaviour or a test that checks it.
python -m demo_app --name "Ada"
pytest tests/test_cli.py -q
If the program output or test result matches the claim, good. If not, then the model has described a fix that is not actually present.
Testing is often the first support here. If a behaviour can be tested, start there before trusting the explanation around it.
Suppose an agent says:
The program reads configuration from the
DEMO_APP_MODEenvironment variable.
This is not mainly a code-edit claim. It is an explanation claim. You still verify it against the repository:
rg "DEMO_APP_MODE" -n src tests README.md
sed -n '1,120p' src/demo_app/utils.py
The explanation is only as good as the evidence path. A helpful answer should point you to the relevant file and ideally cite the line or command that supports the explanation.
After the examples, the general pattern becomes easier to see:
| If the model claims… | First check | Good evidence looks like |
|---|---|---|
| something about repository structure | search plus file inspection | matching files, paths, and contents in the repo |
| it changed code correctly | git diff plus targeted search |
the intended edit is present and unintended edits are absent |
| it fixed behaviour | run the command or test | observed output or test results match the claim |
| its explanation is accurate | ask for file-backed evidence | cited files, commands, and outputs support the explanation |
When possible, ask the agent to show its evidence path rather than only its conclusion.
Weak:
The entry point is
src/demo_app/cli.py.
Stronger:
The entry point appears to be
src/demo_app/cli.py; I foundmain()there withrg "__main__|main\(" -n src tests, and the file shows it being called underif __name__ == "__main__":.
The stronger version is better because it gives you something to inspect. This resembles good literate-programming habits too: not just stating the conclusion, but showing the path from evidence to conclusion.
On their own, these are not enough:
In short: the model said it is weaker than the tool showed evidence, which
is weaker than the files, commands, and tests confirmed it.
Before you rely on an agent answer, ask:
llm