eval_model.py in public repo is not directly runnable and blocks baseline SR reproduction

The public repository does not currently
    allow straightforward reproduction of the
    README live success-rate baseline. The
    evaluation script contains runnable-code
    bugs, metric naming inconsistencies, and
    undocumented simulator/version coupling.