TL;DR: Voice-agent quality is not only about what the assistant says. It is also about whether the evidence, tests, and evaluation tools run inside clear operational boundaries.
A recent example from our local-only Vapi voice management system made this plain.
We were collecting call evidence and using Codex as an LLM judge to review the result. The evidence collection worked. The transcript and supporting data were available. But the judge failed before it could complete.
The issue was not the voice agent.
It was not the call evidence.
It was a sandbox boundary.
The collector asked Codex to write its temporary schema and result files to the system temp directory. On a normal machine, that can seem harmless. In a sandboxed Codex run, it matters. Codex was operating from the project workspace, but the output files pointed outside that workspace. The tool could not initialize cleanly under the read-only sandbox.
The fix was small: keep the Codex judge files under a repo-local path, `.tmp/call-evidence-codex/`, so the generated inputs and outputs stay inside the workspace.
That sounds technical, but the business lesson is simple.
Secure systems need clear boundaries.
When a voice-agent builder understands those boundaries, they can tell the difference between a failed call, a failed evaluation, and a failed tool setup. That distinction prevents panic and avoids bad fixes.
If the evidence was collected but the judge could not write its result file, the right conclusion is not, "the agent failed." The right conclusion is, "the evaluation layer needs a safer runtime path."
That is the kind of operational judgement that creates trust.
Our Vapi voice management system is intentionally local-only. It is designed for a single operator, with local evidence, local artifacts, and explicit control over when anything is synced or deployed. That makes small configuration details important.
Paths, environment variables, temp folders, and sandbox settings are not background noise. They are part of the safety model.
For a business user, this is a useful competence signal.
You do not need to know every Codex flag. You do need your voice-agent operator to understand what the tools are allowed to touch, where evidence is stored, and what a test result actually proves.
A deterministic check can prove that a phrase appeared, a tool was called, or a handoff happened. An LLM judge can add a softer semantic review. If the LLM judge fails because of a sandbox setting, the call evidence may still be useful. The outcome is narrower, not worthless.
That nuance matters when your business depends on customer calls being handled correctly.
The point is not that every operator should obsess over temporary directories. The point is that safe automation comes from disciplined boundaries. Local files should stay local. Generated outputs should be visible. Live systems should not be mutated casually. Evaluation failures should be diagnosed at the right layer.
That is how technical settings become peace of mind.
Good voice-agent operations are not just about building an assistant that sounds polished. They are about building a management system that can explain what happened, preserve evidence, and make safe changes without guessing.
-----------
If you find this content useful, please share it with this link: [https://patrickmichael.co.za/subscribe](https://patrickmichael.co.za/subscribe)