Our laboratory is involved in the development of an intelligent agent that operates a remotely piloted aircraft with twohuman teammates that communicate using text chat. The task is well-defined, but there are potentially numerous andunpredictable inputs during varied 40 minute missions. To assure reliability of agent behavior, we must run a largenumber of missions and analyze the behavior of the agent at milliseconds resolution. To support this requirement, we havedeveloped 1) a scripting language and control system that drives a mission with simulated teammates and environmentalevents, 2) scripted missions using actual chat input from a previous study, 3) output files for each mission that trace agentactions, situation state, and program events, and 4) scripts that analyze the output files based on performance heuristicsand differences from known-good output. This framework allows us to verify complex agent behavior as developmentprogresses.