Issue-Fixture Recovery Diagnostics for Hermes-Agent

A bounded Lab v1 simulation artifact using real issue and PR-linked fixtures.

Victor de Genaro Hermes-Agent recovery eval lab Repo: Magaav/simulation-hermes-agent-expectation-trace Dashboard: GitHub Pages

Abstract

This Lab v1 artifact converts real Hermes-Agent GitHub issues and PR-linked regressions into fixed input fixtures for recovery-diagnostic evaluation. It compares a baseline representation against a second node with a LeWorldModel-inspired expectation trace observer. The observer records expected-vs-actual transitions, heuristic surprise scores, failure categories, and recovery hints. The purpose is not to establish runtime superiority, but to test whether issue history can become a reusable evaluation surface for agent recovery behavior.

Final Conclusion

On 12 fixed issue/PR-linked fixtures, the expectation trace observer produced higher fixture-level heuristic scores than the baseline representation: failure detection changed from 0.83 to 1.00, average recovery-hint quality from 0.92 to 2.75, and average diagnosis steps decreased from 3.67 to 1.00. These results are bounded to the selected fixtures and scoring rubric. They are not production reliability measurements, not statistical evidence, and not evidence of a full LeWorldModel implementation.

Claim Boundary: Bounded simulation artifact. Not production proof. Not runtime superiority. Not full LeWorldModel. This page reports fixture-level heuristic scores only. It does not claim issue prediction, production superiority, statistical significance, runtime superiority, or a full LeWorldModel implementation.

Lab v1 Summary

Grouped bar chart comparing baseline representation and expectation trace observer on failure detection, recovery hint quality, and diagnosis steps.

Fixture-level heuristic scores on 12 fixed Hermes-Agent issue / PR-linked inputs. For diagnosis steps, lower is better.

Loading local dashboard data...

Detailed Dashboard Data

Structured tables below expose the fixed fixtures, observer traces, and aggregate metrics used by the paper-style summary above.

Overview

Fixture-Level Metrics

These charts visualize fixture-level heuristic scores, not production reliability measurements.

Per-Fixture Comparison

The comparison label is a fixture-level heuristic result, not a product benchmark.

Task Category Baseline Detection Observer Detection Baseline Hint Quality Observer Hint Quality Baseline Steps Observer Steps Max Surprise Evidence

Issue Fixture Sources

Issues and PRs are fixed input fixtures. They are not forecast or discovered by the observer.

Fixture Title Category Source Type Evidence Quality Expected Behavior Observed Failure Source

Observer Traces

Expected-vs-actual rows emitted only by the patched second node.

Trace Task Action Expected Actual Surprise Hint

Limitations