Skip to content

fix: resolve fast-unit-tests CI collection errors and stale config assertions#77

Open
lee101 wants to merge 522 commits into
mainfrom
ci-fix/stock-prediction-fast-unit-tests-76
Open

fix: resolve fast-unit-tests CI collection errors and stale config assertions#77
lee101 wants to merge 522 commits into
mainfrom
ci-fix/stock-prediction-fast-unit-tests-76

Conversation

@lee101

@lee101 lee101 commented Mar 27, 2026

Copy link
Copy Markdown
Owner

Summary

Four issues were causing the `fast-unit-tests` CI job (Python 3.13) to fail at collection time on PR #76:

  • JAX import errors (`test_jax_losses.py`, `test_jax_policy.py`, `test_jax_trainer_wandboard.py`): All three failed with `ModuleNotFoundError: No module named 'jax'` / `'flax'`. These packages are not in `requirements-ci.txt`. Added `pytest.skip(allow_module_level=True)` guards so the tests are cleanly skipped when jax/flax are absent instead of crashing collection.

  • Missing `resolve_data_path` function (`test_train_crypto_lora_sweep.py`): `ImportError: cannot import name 'resolve_data_path' from 'scripts.train_crypto_lora_sweep'`. Added `resolve_data_path(symbol, data_root)` that checks both flat (`{root}/{symbol}.csv`) and stocks-subdirectory (`{root}/stocks/{symbol}.csv`) layouts, and updated `main()` to use it.

  • Stale config assertions (`test_120d_eval_scripts.py::test_deployed_config_values`): `DEPLOYED_CONFIG` in `scripts/run_120d_worksteal_eval.py` was updated (`dip_pct` 0.20→0.18, `profit_target_pct` 0.15→0.20, `stop_loss_pct` 0.10→0.15) but the test assertions were not kept in sync. Updated test to match actual deployed values.

Tests run

```
CI=1 FAST_CI=1 CPU_ONLY=1 python -m pytest -v
-m "unit and not slow and not model_required and not cuda_required"
--tb=short --maxfail=10 tests/
```

Result: 78 passed, 14 skipped, 3963 deselected (0 failures, 0 collection errors)

Also verified `tests/test_train_crypto_lora_sweep.py::test_resolve_data_path_supports_mixed_hourly_root` passes independently.

🤖 Generated with Claude Code

lee101 and others added 30 commits March 22, 2026 20:46
feat: MKTD v3 — 20 intraday features (vol, morning_ret, vwap_dev, gap_open)
Exports pufferlib checkpoint (MLP/Residual/Transformer) to TorchScript
format for libtorch C API inference. Includes round-trip verification,
metadata JSON output, and logits-only wrapper for C trader.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- policy_infer.cpp: libtorch C++ with extern C API, optional build
- export_torchscript.py: convert pufferlib checkpoints to TorchScript
- Makefile: libcurl + libtorch optional linking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add generate_markdown_report() to profile_training.py: parses Chrome
  trace, speedscope flamegraph, timing.json, and gprof to produce
  profiles/report.md with throughput, kernel hotspots, and recommendations
- Add --quick (torch.profiler only, skip py-spy) and --report-only
  (skip profiling, regenerate report from existing files) flags
- Add tools/profile_report.py: standalone CLI for report generation
- Load Chrome trace once per report (single _load_trace_events call shared
  between kernel and memory parsers, eliminating duplicate JSON read)
- Fix _parse_chrome_trace_top_kernels return type to always tuple[list, float]
- Remove unused `import re`
- Add 17 tests covering all parsers, CLI flags, and report content

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fuses obs normalization + first linear layer + ReLU into a single Triton
kernel, eliminating the intermediate obs_norm (B, OBS) tensor allocation.
Integrates with TradingPolicy._encode() when --obs-norm is active.

- pufferlib_market/kernels/fused_obs_encode.py: new CuTE-style kernel
  with CC-aware autotune configs (CC>=9 / CC==8 / CC<8)
- pufferlib_market/train.py: set_obs_norm_stats(), _encode() Path 1,
  training loop skips CPU normalize when fused path active
- pufferlib_market/bench_obs_encode.py: benchmark vs baseline
- tests/test_fused_obs_encode.py: 14 correctness/dtype/integration tests
- pufferlib_market/kernels/fused_mlp.py: H100 warp specialization note

Benchmark on RTX 5090 (CC 12.0): 1.55–1.71x speedup at stocks12 sizes
(OBS=209, H=1024); peak alloc drops from ~1144 KB to ~128 KB at B=64.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: fused_obs_encode.py — CuTE-style fused obs normalization + linear + ReLU
F.linear fails when input and bias have different dtypes in PyTorch 2.x.
Cast bias to match weight dtype in the fallback path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g fix

- autoresearch_rl.py: add --stocks12 convenience flag that uses stocks12
  data by default and runs the combined STOCK_EXPERIMENTS +
  H100_STOCK_EXPERIMENTS pool (excluding requires_gpu='h100' configs).
  Sets periods_per_year=252, fee_rate=0.001, holdout_eval_steps=90.
- train.py: add --early-stop-patience N flag (default 0=disabled) that
  stops training when ep_return does not improve by >=0.001 for N
  consecutive logging steps.
- h100_experiment_plan.md: document 90s vs 300s overfitting finding,
  update recommended command to time_budget=90, max_trials=500.
- scripts/alpaca_cli.py: fix typer 0.24+ compat (Annotated syntax for
  typer.Argument).
- tests: fix test_backout_logic.py stubs for typer and src.fixtures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Download yfinance split-adjusted data for stocks12 from 2019 (PLTR limits to 2020-09-30)
- Export stocks12_extended_train.bin: 1797 days (+38% vs original 1302)
- Export stocks11_{train,val}.bin: 2434 days without PLTR, 11 symbols
- Document eval_hours calibration: C env counts calendar days, use --eval-hours 130 for ~90 trading days
- Update H100 plan to use stocks12_extended data
- Add splits_audit_report.csv: 258 entries, 0 UNRECOGNIZED in stocks12 symbols

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Scans all daily and hourly stock CSVs for unadjusted forward splits
- Fetches split history from yfinance in parallel (40 workers, ~700 symbols)
- Detects unadjusted splits via price-ratio check (tolerance 15%)
- Filters spin-off adjustments with MIN_SPLIT_FACTOR=1.9 threshold
- Fixes timezone bug: convert to UTC before normalize() to avoid 4h offset
- Deduplicates same-day rows before scanning for big drops (handles SPAC data)
- Auto-fixes CSVs: divides pre-split prices, multiplies pre-split volume
- Always backs up CSVs to .pre_split_backup before modifying
- Re-exports affected MKTD binaries (stocks12/stocks20 train+val)
- Applied 44 fixes: ANET, APH, CMG, COO, CTAS, DD, DECK, ETR, FAST, GOLD,
  GOOGL, ISRG, LRCX, MNST, NDAQ, NEE, NOW, NVO, ODFL, ORLY, PANW, SHOP,
  SHW, SMCI, SONY, SRE, TPL, TSCO, WMT, WSM (daily+hourly)
- 36 unit tests covering all helpers and edge cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… missing winner configs

Dataset experiments (2026-03-22):
- stocks12_extended (1797d, 2020+) is WORSE than stocks12_daily (1302d, 2022+)
- Extra 2020-2021 COVID-era data hurts generalization on 2025-2026 val
- stocks11 (2434d, no PLTR) also worse — more data ≠ better for out-of-distribution
- Confirmed: stocks12_daily_train.bin is the right training set for H100

H100_STOCK_EXPERIMENTS additions:
- h100_rmu4424_style/wd005/slip8: h=256 variants from random_mut_4424 (0% neg, +7.3%)
- h100_h256_mut2272: h=256 with random_mut_2272 regularization
- h100_rmu1228_style/slip5/wd005: obs_norm=True variants from random_mut_1228 (0% neg, +6.8%)
- h100_mut2272_s4424, h100_rmu4424_s2272: cross-seed variants of top configs
- Pool now 141 configs (was 132); 100 random mutations still included

H100 final command updated to use stocks12_daily_train.bin

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…found

Local 50-trial sweep found stock_trade_pen_05 (just trade_penalty=0.05, all defaults)
is the best config seen so far: score=-3.5, 5% neg, +27.8% median, sortino=4.06.
This beats random_mut_2272 (score=-5.2, 0% neg, +10.7% median, sortino=2.22).

Added 8 H100 variants of trade_pen_05:
- h100_trade_pen_05 (exact match, plus seeds s123/s7/s42)
- h100_trade_pen_05_ent03, ent08 (entropy sweep)
- h100_trade_pen_05_wd005 (with weight decay)
- h100_trade_pen_05_anneal_ent (entropy annealing)

H100_STOCK_EXPERIMENTS: 149 configs (was 132)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… sweep SOTA

V2 50-trial sweep (stocks12_daily_train.bin, 90s/trial) found new SOTA:
- stock_drawdown_pen: drawdown_penalty=0.05, trade_penalty=0.03, NO training slippage
  → 0% negative windows, +22.9% median, +4.8% p10, Sortino=7.25, worst=+3.3%
  → score=+24.9 (beats random_mut_2272 at ~-5 and all previous configs)
- stock_trade_pen_05_s123: 0% negative, +16.6% median, +7.7% p10, score=+14.9

H100_STOCK_EXPERIMENTS expanded: 127 → 162 configs
  Added 13 h100_drawpen_* variants (seeds + drawdown/trade pen hyperparams)
  Added 8 h100_trade_pen_05_* variants (from previous session)
  Added 4 h100_rmu4424_* + 3 h100_rmu1228_* variants

Key finding: drawdown_penalty outperforms slippage training as regularizer.
Drawdown penalty forces policy to avoid equity dips → no reckless behavior on holdout.

Updated h100_experiment_plan.md:
  - New SOTA table (drawpen beats random_mut_2272 by 2x on median)
  - Revised deployment conditions (bar raised to match drawpen results)
  - Updated H100 pool summary (162 configs, 500 trials = 12.5h on H100)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…score scale

Bug: _quick_val_eval returns raw val_return (~0.09-0.20) but was compared against
best_trial_rank_score * 0.8 where rank_score = holdout_robust_score (~24.9).
This means threshold was 19.9 but val_return is never > 1.0 in normal cases,
so every trial after stock_drawdown_pen was automatically early-rejected.

Fix: track best_val_return separately (same scale as _quick_val_eval output)
and use that for the early rejection comparison instead of best_rank_score.
The new threshold ~0.073 (7.3%) is comparable to typical val_returns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OT --h100-mode

Critical finding from A40 preview sweep:
- --h100-mode forces num_envs=256, minibatch_size=4096 → A40 trains 15.6M steps in 82s
  (cap hit before SIGTERM) → 5x more steps than stock_drawdown_pen discovery → OVERFITS
- ALL drawpen configs failed under h100-mode (early rejected, holdout -68 to -104)
- Real H100 would also hit 15.6M cap (in ~31s) → same overfitting

Fix: use --stocks12 --max-timesteps-per-sample 200 instead of --h100-mode
- Caps each trial at 3.1M steps (12 × 1302 × 200) regardless of GPU speed
- Matches stock_drawdown_pen discovery conditions (3.2M steps in 90s on A40)
- H100 trains 3.1M steps in ~9s, holdout ~30s → ~40s/trial → 500 trials ≈ 5.5h
- Default batch size (128 envs, 2048 minibatch) gives 94 PPO updates vs 47 with h100-mode

Updated H100 recommended command in h100_experiment_plan.md accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key findings from standalone 13-variant drawpen verification sweep:
- ALL drawpen seed/param variants score -49 to -170 in holdout
- stock_drawdown_pen (+24.9) in v2 sweep was a ~2% lucky training run
- RL is non-deterministic; same config+seed gives wildly different results

H100 strategy revised:
- Increase max-trials from 500 to 1200 (diversity over depth)
- Early rejection is irrelevant for H100: training completes in ~9s
  before the 25% time check fires at 22.5s
- Target: realistic holdout improvements over random_mut_2272 baseline
- Expected: ~24+ positive-score configs from 1200 diverse trials

Also commit leaderboard CSVs:
- autoresearch_stocks12_v2_50trial.csv (50-trial v2 sweep, 2 positive)
- autoresearch_h100_drawpen_preview_v2.csv (partial, killed for early rejection bias)
- autoresearch_h100_drawpen_standalone.csv (13-variant verification, all negative)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tive

Full 13-variant sweep (seeds 7/42/123/999/2272, param variants tp02/tp05/dd02/dd10/ent03/wd005/slip5)
with early_reject_threshold=0.0 and correct 200x step cap:

Best: h100_drawpen_tp05 score=-37.7, neg=25%, median=+3.1%, p10=-2.4%
Most: scores -49 to -170, 20-100% negative windows

Confirms stock_drawdown_pen (+24.9, v2 sweep trial 20) was a ~2% lucky training run.
True hit rate for drawpen family: ~0/13 = 0% (unlucky batch) to ~1/50 = 2% at scale.

H100 strategy: run 1200 diverse trials, expect ~24-48 positive configs at 2-4% hit rate.
Do NOT specifically target drawpen — include in pool for coverage only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hrough)

Key finding: extending stocks12 training from 1302 days (2022-2025) to
1797 days (2020-2025) dramatically improves generalization on the hard
201-day val (Sep 2025 - Mar 2026, includes Nov 2025 - Feb 2026 bear market).

Results:
- Old training (1302d): 0/50 configs score positive on hard extended val
- New training (1797d): stock_trade_pen_03 scores +3.10 (seed 777) and
  -7.81 (seed 999) vs -102 with old data — first ever positive on hard val

Root cause: 2020-2021 data (COVID recovery + 2021 bull market) teaches the
model about market cycles and regime detection. Models trained from 2022 only
see one bear market and one recovery; they fail when encountering the 2025-2026
bear market. The extended data fixes this.

Changes:
- audit_stock_splits.py: add stocks12_daily_train_2019 config (2019-01-02 start,
  effective 2020-09-30 due to PLTR IPO, 1797 calendar days)
- h100_experiment_plan.md: v5 update with extended training breakthrough,
  corrects previous "extended data is worse" finding (that used old easy val),
  updates H100 command to use stocks12_daily_train_2019.bin,
  updates step cap to 4,312,800 (12*1797*200), updates hit rate expectation
  to 5-15% (vs 0% with old data)
- Add sweep result CSVs: extended_val_50trial, train2019_10trial, train2019_50trial

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
trade_penalty=0.03 identified as sweet spot on hard 201-day val with
extended training data (2020-2025): scored +3.1 (seed 777) vs -102 with
old training data. Add seed/param variants to increase coverage:
- tp03_s7/s42/s123/s2272: seed sweep
- tp03_slip5/slip10: slippage friction variants
- tp03_wd01/wd05: weight decay variants
- tp03_obs: observation normalization
- tp03_ent03/annent: entropy variants
- tp03_h512/h2048: network size variants
- tp03_cosine: cosine LR schedule
- tp03_full_reg: combined regularization

Pool size: 253 total (95 STOCK + 158 non-GPU H100)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ariants findings

tp03 variants sweep (16 configs, seed 1337, extended 1797d training) results:
- tp03_s2272: -33.8 (best, seed 2272 is special for this config class)
- tp03_wd01: -39.4 (median=+5.6%, wd=0.01 helps)
- tp03_h2048: -50.0 (median=+6.0%, larger net benefits from 5yr data)
- tp03_slip5/slip10: -110 to -130 (AVOID: slippage training hurts bear market generalization)
- tp03_obs: -124 (AVOID: obs_norm hurts with trade_pen_03)

Add best-combo configs: tp03_s2272_wd01, tp03_h2048_wd01, tp03_s2272_h2048
Pool is now 98 STOCK + 158 non-GPU H100 = 256 total

Key rule: trade_pen_03 without slippage, without obs_norm, with wd=0.01 or h2048

Update H100 plan with full tp03 variants findings table and updated pool summary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mote pipeline

- autoresearch_rl.py: added 108 tp03 variants to STOCK_EXPERIMENTS:
  - tp03_s777 (KNOWN WINNER on hard 201-day bear val, robust=+3.10)
  - tp03_s{7,42,123,888,1111,2272,3141,4242,5678,7777,9999} for seed discovery
  - tp03_wd01_s{777,42,2272} + tp03_h2048_s{777,42,2272} (best modifiers x seeds)
  - tp03_seed_{1..50}: dense sequential seed sweep for H100 (expect ~17 positive)
  - tp03_wd01_seed_{1..25}: wd=0.01 modifier seeds for H100

- remote_training_pipeline.py: add max_timesteps_per_sample param to
  build_autoresearch_cmd() and build_remote_autoresearch_plan()

- launch_stocks_autoresearch_remote.py: add --max-timesteps-per-sample CLI arg
  (default 200, gives ~4.3M steps on 1797-day 2019 training data)

Key finding: previous tp03_variants sweep used --seed 1337 override which masked
all explicit per-config seeds. The actual tp03 hit rate at native seeds needs
testing via the tp03_multiseed sweep (no global override, early-reject disabled).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…m, cuDNN

Add torch.manual_seed(args.seed) + cuda.manual_seed_all + random.seed + numpy.seed
at training startup, plus cudnn.benchmark=False for cuDNN algorithm stability.

Previously only the C environment was seeded (via vec_init/vec_reset). Network
weight initialization was non-deterministic, causing large result variance even
with identical configs. Now each --seed value produces a reproducible training
trajectory, enabling systematic seed sweeps on local hardware before H100 runs.

Key implication: tp03_seed_{1..50} dense sweep will now give reproducible results
so we can identify which seeds work on the hard 201-day bear market val before
committing to expensive H100 time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ep cap

Key bugs fixed:
1. --max-timesteps-per-sample default was 200 (4.3M steps) — models need 33M+ steps
   to converge. Changed to 10000 (effectively no cap; 300s wall-clock is binding).
2. --stocks12 flag was never passed to autoresearch_rl.py — remote runs used the
   default crypto EXPERIMENTS pool instead of STOCK_EXPERIMENTS.
3. --time-budget default changed from 300 to 90 for H100 (90s x 390k steps/sec
   = ~35M steps ≈ local A40 300s convergence point).

Root cause of recent 0/34 positive sweep: the 200-sample cap (4.3M steps) was
8x shorter than the ~33M steps needed for convergence (found in all winning models).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key finding: 200-sample cap (4.3M steps) was the root cause of 0/34 failures.
Winning models need 33-37M steps (300s on A40). Document correct H100 command:
time-budget=90 + no step cap = ~35M steps on H100 ≈ A40 300s convergence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… (2015-2026)

Key changes:
1. STOCK_EXPERIMENTS: move tp03 dense seed sweep (75 configs) to new
   STOCK_TP03_SEED_EXPERIMENTS constant — all 75 are negative at 300s on
   extended 201-day bear market val and were blocking random mutations from
   being reached (random mutations were at index 157, now at index 82).

2. Expand random mutation slots: 30 → 300, enabling H100 500-trial sweeps
   with ~218 random mutation trials (after 82 named configs).

3. Add extend_stocks_history.py: downloads 2015-2019 historical data from
   yfinance for stocks11 (no PLTR) to extend training data 2x:
   - stocks12_daily_train_2019.bin: 12×1797 = 21,564 samples (from 2020-09-30)
   - stocks11_daily_train_2015.bin: 11×3895 = 42,845 samples (from 2015-01-02)
   Extended data includes: COVID crash (Mar 2020), 2018 Q4 correction,
   2015-2019 diverse regimes — critical for bear market generalization.

Running experiments to compare stocks12 vs stocks11-extended hit rates on
the 201-day hard val (Sep 2025–Mar 2026, includes Nov 2025–Feb 2026 bear).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y v7

- Expand random mutation slots: 300 → 450 (total pool: 532 = 82 named + 450 random)
  Supports 500-trial H100 runs with majority of trials as random mutations
- Update h100_experiment_plan.md with v7 final config:
  * Pool restructuring benefits documented
  * stocks11 extended (42,845 samples) as H100 alternative
  * Expected 17-21 positive models from 500-trial H100 run at 4-5% hit rate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace torch.no_grad() with torch.inference_mode() in cutechronos
validation and test functions. The main predict() methods already used
inference_mode; this completes the migration for the module.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…#61)

Add forecast_bias_weight to WorkStealConfig and forecast_data param to
run_worksteal_backtest. Positive forecasts boost candidate scores, negative
reduce them. Weight=0.0 (default) preserves identical behavior.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

Replace custom Triton attention with PyTorch SDPA (scale=1.0) as the
preferred CUDA backend. SDPA auto-selects FlashAttention2/cuDNN kernels
and is ~2x faster than the eager fallback on RTX 5090.

New module cutechronos/modules/flex_attention.py provides:
- sdpa_unscaled_attention: SDPA with scale=1.0 (recommended)
- flex_unscaled_attention: FlexAttention for mask-free case, SDPA fallback for masked
- eager_unscaled_attention: delegates to existing _fallbacks implementation
- Backend registry with benchmark_backends() and get_best_attention_backend()

Integration: FusedTimeSelfAttention and model.py now use SDPA on CUDA,
with Triton and eager as fallbacks for non-CUDA paths.

56 new tests, 195 total tests passing.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single kernel fuses residual-add and RMS LayerNorm, eliminating one
full read-write of hidden state per sub-layer (36 round-trips across
all encoder blocks). Provides both out-of-place and in-place variants
via compile-time INPLACE constexpr flag. 26 tests covering FP32/BF16,
2D/3D shapes, edge cases, and cross-variant consistency.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lee101 and others added 28 commits March 25, 2026 14:40
Updated top-10 5bps leaderboard (145 evaluated):
#4 s456: +8,802% (ultra-robust: 5bps > 8bps, Sortino=6.71)
#6 s452: +8,002% (ultra-robust: 5bps > 8bps, Sortino=6.65)
#7 s734: +7,160% (ultra-robust: 5bps > 8bps)
#10 s446: +6,536% (ultra-robust: 5bps > 8bps)
#15 s827: +4,801% (ultra-robust: 5bps > 8bps)

7 sweeps ongoing: s201-900 at 55-62% complete

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…56(+8802% ROBUST)

New absolute record: s275 at +23,595% ann, 5bps>8bps>pool (tri-consistent).
s456 enters top-5 at +8,802% ROBUST (5bps >> 8bps).
169 seeds now properly evaluated in 5bps leaderboard.
New seeds: s357(+1708% ROBUST), s359(+2998%), s437(+1112%), s751(+2135%),
s649(+1601%), s842(pending).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Leaderboard updated (169 evaluated at 5bps):
#1 s275: +23,595% (Sortino=9.0, ultra-robust: 5bps > 8bps)
#2 s240: +17,642%
#3 s434: +10,359%
#4 s71:  +9,381%
#5 s456: +8,802% (new)
#6 s507: +8,273%
#7 s452: +8,002% (new)

Top-10 mean: +10,796% ann | 7/10 ultra-robust (5bps >= 8bps)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…42%), s845(+2662%), s760(+2385%)

Batch eval of 103 seeds completes 5bps coverage. Notable new ROBUST seeds:
- s467: +3242% ann, sortino=6.10 (s401-500)
- s845: +2662% ann, sortino=4.71 (s801-900)
- s760: +2385% ann, sortino=5.23 (s701-800)
- s279: +2127% ROBUST (fixed: 5bps=3.62 >> 8bps=2.63)
- s210: +4461% ROBUST confirmed
- s209: +3091% ROBUST confirmed
- s904: +986% ROBUST (s901-1000 not all bad!)
Also fixed s277/s279 swapped entries from parallel eval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0(+991%)

New high-value ROBUST finds:
- s658: +1837% ann (s601-700), 5bps=3.31 > 8bps=3.21
- s279: +2127% ann (s201-300) ROBUST, corrected from earlier swap
- s277: +1230% ann (s201-300) ROBUST
- s660: +991% ann (s601-700) ROBUST
- s567: +776% ann, s470: +1028% ann (overfitters)
Also: s564(+1261%), s552(+665% ROBUST), s465(+855% overfitter)
196 seeds evaluated, 85 ROBUST confirmed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ew finds

Extraordinary s901-1000 discoveries:
- s921: 5bps=6.86, +6386% ann, sortino=5.60, ROBUST! (pool=5.76 -> honest=6.66 -> 5bps=6.86)
- s914: +1839% OVERFITTER, s915: +1414% ROBUST, s920: +680% OVERFITTER

s801-900:
- s850: 5bps=5.86, +4869% ann, sortino=5.99, ROBUST! (pool=5.12 -> 5bps=5.86)

Other new ROBUST seeds: s658(+1837%), s660(+991%), s279(+2127%), s284(+894%)
Total: 206 seeds in 5bps leaderboard, 81 ROBUST. Sweeps ~65-87% complete per range.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s discovered

Top seeds by 5bps annualized return:
- s275: +23,595% (Sortino=9.0, ultra-robust) — all-time champion
- s240: +17,642% (Sortino=7.0)
- s434: +10,359% (Sortino=6.99)
- s71:  +9,381%  (Sortino=8.29)
- s456: +8,802%  (Sortino=6.71, ultra-robust)

New champions this session: s921 (+6,386%, ultra-robust), s850 (+4,869%)

Coverage: s61-120 ✓, s121-200 ✓, others 50-80% complete
Auto 5bps monitor running continuously, all seeds >800% evaluated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(+7647%), s578(+5688%), s1203(+3619%)

Key new finds (all ROBUST):
- s292: +19,815% ann, S=7.99 — new #2 ROBUST champion
- s765: +7,647% ann, S=5.76 — new #7 ROBUST
- s578: +5,688% ann, S=8.14 — new #18 ROBUST (high sortino)
- s1203: +3,619% ann, S=6.43 — new from s1201-1300 range
- s1202: +4,121% ann, S=5.65 — new from s1201-1300 range
- s1206: +1,012% ann ROBUST, s1005: +854% ann ROBUST

New ranges discovered: s1001-1100 (103 seeds), s1101-1200 (6 seeds), s1201-1300 (14 seeds)
5bps auto-monitor updated to cover all ranges up to s1300+

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Champion leaderboard (top 3 by 5bps annualized):
- s670: +29,099% (Sortino=7.56, p50=15.45x/180d) — NEW ALL-TIME CHAMPION
- s275: +23,595% (Sortino=9.00, ultra-robust)
- s292: +20,000% (Sortino=7.99, ultra-robust)

233 seeds evaluated at 5bps; sweep ~70% complete.
Updated prod.md with comprehensive top-10 leaderboard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ents, unified hourly fixes

- src/robust_trading_metrics.py: new robust trading metrics module
- scripts/evaluate_binance_lora_candidate.py: binance lora candidate evaluator
- scripts/run_binance_crypto_lora_sweep.py: expanded binance crypto lora sweep
- pufferlib_market/autoresearch_rl.py: improved autoresearch with gpu pool support
- pufferlib_market/gpu_pool_rl.py: gpu pool RL training
- pufferlib_market/replay_eval.py: improved replay evaluation
- unified_hourly_experiment/trade_unified_hourly.py: hourly trading improvements
- tests: comprehensive test coverage additions
- alpacaprogress6.md: alpaca progress notes
- leaderboard CSVs: mixed23 sweep results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…neal_ent + stocks20

- scripts/stocks_deep_sweep.sh: 5-phase sweep covering:
  - Phase A: stocks12 tp05 s100-299 @ 35M steps
  - Phase B: stocks20 tp05 s1-80 @ 35M steps
  - Phase C: stocks12 tp03 s1-60 @ 35M steps
  - Phase D: stocks12 tp07 s1-60 @ 35M steps
  - Phase E: stocks12 anneal_ent tp05 s1-60 @ 35M steps
- Inspired by crypto70: need 200+ seeds to find champions
- pufferlib_market/stocks12_seed_sweep_leaderboard.csv: s51-87 results at 15M steps
  - s55 (med=9.38%, 5/50 neg) best of first batch — retraining at 35M
- Disk cleanup: freed 110GB by removing non-champion old checkpoints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace manual `processed_batches += 1` counter inside the for loop with
`enumerate(loader, start=1)` as required by ruff SIM113. This fixes the
failing CI lint job (Fast CI / lint).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Portfolio combiner: per-symbol models -> diversified portfolio with
equal/inverse_vol/sqrt_sortino allocation. sqrt_sortino winner:
+40.69% med ret, Sort=5.21, -0.87% DD, 100% positive (15 symbols, 10x30d).

New research features in trainer:
- Spectral regularization (penalize max singular value)
- Multi-period loss (train on sub-windows for horizon diversity)
- WARP weight averaging (tested: hurts vs best single checkpoint)

R5 experiment configs: spectral, multiperiod, combos, champion tuning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- scripts/finetune_alpaca_symbols.sh: fine-tune Chronos2 LoRA for all 18 Alpaca live symbols
  using proven differencing preaug (33.7% MAE improvement on QUBT). Promotes any symbol
  improving >5% over baseline.
- scripts/stocks_extended_sweep.sh: Phase F sweep — stocks12 extended (7.1yr) training data,
  evaluating on standard val for fair comparison. Runs after deep sweep completes.
- e2etraining smoke test confirmed working (2026-03-27)
- Key finding: stocks seeds overfit at 35M steps — s55 fell from 9.38% to 0.48% med.
  Short scan (15M) + selective 35M retrain is the right strategy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sertions

Four issues were causing the fast-unit-tests CI job to fail at collection time:

1. tests/test_jax_losses.py, test_jax_policy.py, test_jax_trainer_wandboard.py:
   All three fail with ModuleNotFoundError for jax/flax which are not in
   requirements-ci.txt. Added pytest.skip(allow_module_level=True) guards so
   tests are cleanly skipped when jax/flax are absent rather than erroring.

2. tests/test_train_crypto_lora_sweep.py:
   ImportError for resolve_data_path which was missing from
   scripts/train_crypto_lora_sweep.py. Added resolve_data_path() that checks
   both {root}/{symbol}.csv (flat) and {root}/stocks/{symbol}.csv (sub-dir)
   layouts, and updated main() to use it.

3. tests/test_120d_eval_scripts.py::test_deployed_config_values:
   DEPLOYED_CONFIG in scripts/run_120d_worksteal_eval.py was updated (dip_pct
   0.20->0.18, profit_target_pct 0.15->0.20, stop_loss_pct 0.10->0.15) but
   the test assertions were not kept in sync. Updated test to match actual
   deployed values.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lee101

lee101 commented Mar 27, 2026

Copy link
Copy Markdown
Owner Author

Codex Infinity
Hi! I'm Codex Infinity, your coding agent for this repo.

Start a task on this PR's branch by commenting:

Tasks and logs: https://codex-infinity.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant