Feature/eval with badcase by xujiayuan0205 · Pull Request #27 · TomTraining/TomTest

xujiayuan0205 · 2026-04-16T05:37:18Z

No description provided.

…x ToMi field mapping - Add StructuredResult dataclass wrapping parsed output with raw_response and reasoning_content - Add src/judge.py for LLM semantic judge fallback when structured extraction fails - Extend runner.py with collect_badcases() and build_corrected_predictions() - Add dual metrics (strict + judge-corrected) to all task run.py scripts - Fix ToMi field mapping: Story.full_story/Question/Answer.Correct_Answer - Fix reasoning_content capture: support both 'reasoning' and 'reasoning_content' field names - Fix run_all.py subprocess PYTHONPATH for src module resolution - Update SUMMARY.md with deepseek-chat and deepseek-r1 results

…esults/scripts

xujiayuan0205 added 2 commits April 15, 2026 17:27

chore: stop tracking datasets; align ignores for experiment_configs/r…

9d2895d

…esults/scripts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/eval with badcase#27

Feature/eval with badcase#27
xujiayuan0205 wants to merge 2 commits into
mainfrom
feature/eval_with_badcase

xujiayuan0205 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xujiayuan0205 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant