diff --git a/docs/model-evaluation.md b/docs/model-evaluation.md index 92bab39..8b84380 100644 --- a/docs/model-evaluation.md +++ b/docs/model-evaluation.md @@ -39,27 +39,27 @@ A task passes when **all** its assertions pass **and** the LLM judge approves th -### gpt-5-mini — 2026-04-21 +### gpt-5-mini — 2026-05-26 -**Overall: 11/11 tasks passed (100%)** +**Overall: 10/11 tasks passed (90%)** #### Task Results | # | Task | Result | toolsUsed | minCalls | maxCalls | Input Tokens | Output Tokens | |---|------|--------|-----------|----------|----------|--------------|---------------| -| 1 | list-clusters | Pass | Pass | Pass | Pass | 1720 | 634 | -| 2 | cve-detected-workloads | Pass | Pass | Pass | Pass | 565 | 1900 | -| 3 | cve-detected-clusters | Pass | Pass | Pass | Pass | 1759 | 1983 | -| 4 | cve-nonexistent | Pass | Pass | Pass | **Fail** | 2550 | 3087 | -| 5 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 539 | 1032 | -| 6 | cve-cluster-does-not-exist | Pass | **Fail** | Pass | Pass | 504 | 1481 | -| 7 | cve-clusters-general | Pass | Pass | Pass | Pass | 516 | 1692 | -| 8 | cve-cluster-list | Pass | Pass | Pass | Pass | 2530 | 3438 | -| 9 | cve-log4shell | Pass | Pass | Pass | Pass | 2032 | 2593 | -| 10 | cve-multiple | Pass | Pass | Pass | Pass | 2166 | 2588 | -| 11 | rhsa-not-supported | Pass | — | Pass | Pass | 1674 | 1429 | - -**Total input tokens**: 16555 | **Total output tokens**: 21857 +| 1 | cve-detected-clusters | Pass | Pass | Pass | Pass | 1513 | 1506 | +| 2 | cve-cluster-does-not-exist | Pass | Pass | Pass | Pass | 1496 | 1289 | +| 3 | cve-cluster-does-exist | Pass | Pass | Pass | Pass | 507 | 1265 | +| 4 | cve-clusters-general | Pass | Pass | Pass | Pass | 1788 | 2052 | +| 5 | cve-cluster-list | Pass | Pass | Pass | Pass | 674 | 1682 | +| 6 | rhsa-not-supported | Pass | — | Pass | Pass | 1810 | 3098 | +| 7 | cve-nonexistent | **Fail** | Pass | Pass | Pass | 561 | 1506 | +| 8 | cve-detected-workloads | Pass | Pass | Pass | Pass | 539 | 2250 | +| 9 | cve-multiple | Pass | Pass | Pass | **Fail** | 2234 | 3627 | +| 10 | cve-log4shell | Pass | Pass | Pass | Pass | 2245 | 3516 | +| 11 | list-clusters | Pass | Pass | Pass | Pass | 1700 | 607 | + +**Total input tokens**: 15067 | **Total output tokens**: 22398