Skip to content

fix: resolve CI fast-unit-tests collection errors (jax skip + resolve_data_path)#76

Open
lee101 wants to merge 516 commits into
mainfrom
ci-fix/stock-prediction-unit-tests-fix
Open

fix: resolve CI fast-unit-tests collection errors (jax skip + resolve_data_path)#76
lee101 wants to merge 516 commits into
mainfrom
ci-fix/stock-prediction-unit-tests-fix

Conversation

@lee101

@lee101 lee101 commented Mar 27, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the failing Fast CI (GitHub Runners) / fast-unit-tests job that was exiting with code 2 (interrupted) before any unit tests ran.

Root cause: pytest collection errors were accumulating and triggering --maxfail=10 before any @pytest.mark.unit tests could execute.

Two collection errors were fixed:

  • tests/test_train_crypto_lora_sweep.py: imported resolve_data_path from scripts.train_crypto_lora_sweep at module level, but the function didn't exist → ImportError during collection. Added the missing resolve_data_path(symbol, data_root) function that searches stocks/ and crypto/ subdirectories before falling back to the root.

  • tests/test_jax_losses.py, tests/test_jax_policy.py, tests/test_jax_trainer_wandboard.py: these import from binanceneural.jax_* modules that require jax/flax, which are not in requirements-ci.txt. Added skip logic in pytest_ignore_collect (matching the existing pattern for pufferlib) so these files are skipped when jax is not installed.

Test plan

  • tests/test_train_crypto_lora_sweep.py::test_resolve_data_path_supports_mixed_hourly_root passes
  • All 86 unit tests (-m "unit and not slow and not model_required and not cuda_required") pass locally
  • Ruff lint passes on the neuraldailytraining targets checked by CI
  • No new ruff issues introduced in the modified files (lint CI only checks specific neuraldailytraining targets, not scripts/ or tests/conftest.py)

🤖 Generated with Claude Code

lee101 and others added 30 commits March 22, 2026 20:45
New experiments: robust_reg_tp005_ent at seeds 42/7/123,
h1536_robust_ent, h2048_robust_ent for A40 sweep.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ctrader sim parity verification
… --multi-period-eval to autoresearch_rl

- evaluate_fast.py: add multi_period_eval() that evaluates across multiple
  window sizes (default 5/15/30/60/90 days) and returns a smoothness_score
  (weighted avg of p10_sortino, shorter windows weighted more to penalise
  single-spike wins). Module-level _SMOOTHNESS_WEIGHTS constant (5=3,15=2,
  30=2,60=1,90=1). CLI gains --multi-windows and --n-windows-per-size flags.

- autoresearch_rl.py: add --multi-period-eval flag to run_trial; when set,
  calls multi_period_eval() in-process after training and writes smooth_score
  + per-window p10_sortino columns to the leaderboard CSV. smooth_score is
  now the top-priority rank metric in select_rank_score(). Also adds
  --multi-period-windows, --multi-period-n-per-size, --multi-period-slippage-bps
  flags and smooth_score as a valid --rank-metric choice.

- tests/test_multi_period_eval.py: 7 tests covering signature, defaults,
  CLI flag presence, error-path behaviour, and autoresearch help output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement Binance REST API client in C with libcurl
Introduces src/split_monitor.py to detect recent forward splits via
yfinance and force-close any held positions before the policy observes
distorted price data (stale pre-split entry_price causes fake losses).

Integrates into execute_stock_signals in the unified orchestrator:
checks held symbols once per cycle, logs any split event to
logs/split_events.log, and adds affected symbols to trail_exit_syms so
no new orders are placed on them in the same cycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: MKTD v3 — 20 intraday features (vol, morning_ret, vwap_dev, gap_open)
Exports pufferlib checkpoint (MLP/Residual/Transformer) to TorchScript
format for libtorch C API inference. Includes round-trip verification,
metadata JSON output, and logits-only wrapper for C trader.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- policy_infer.cpp: libtorch C++ with extern C API, optional build
- export_torchscript.py: convert pufferlib checkpoints to TorchScript
- Makefile: libcurl + libtorch optional linking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add generate_markdown_report() to profile_training.py: parses Chrome
  trace, speedscope flamegraph, timing.json, and gprof to produce
  profiles/report.md with throughput, kernel hotspots, and recommendations
- Add --quick (torch.profiler only, skip py-spy) and --report-only
  (skip profiling, regenerate report from existing files) flags
- Add tools/profile_report.py: standalone CLI for report generation
- Load Chrome trace once per report (single _load_trace_events call shared
  between kernel and memory parsers, eliminating duplicate JSON read)
- Fix _parse_chrome_trace_top_kernels return type to always tuple[list, float]
- Remove unused `import re`
- Add 17 tests covering all parsers, CLI flags, and report content

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fuses obs normalization + first linear layer + ReLU into a single Triton
kernel, eliminating the intermediate obs_norm (B, OBS) tensor allocation.
Integrates with TradingPolicy._encode() when --obs-norm is active.

- pufferlib_market/kernels/fused_obs_encode.py: new CuTE-style kernel
  with CC-aware autotune configs (CC>=9 / CC==8 / CC<8)
- pufferlib_market/train.py: set_obs_norm_stats(), _encode() Path 1,
  training loop skips CPU normalize when fused path active
- pufferlib_market/bench_obs_encode.py: benchmark vs baseline
- tests/test_fused_obs_encode.py: 14 correctness/dtype/integration tests
- pufferlib_market/kernels/fused_mlp.py: H100 warp specialization note

Benchmark on RTX 5090 (CC 12.0): 1.55–1.71x speedup at stocks12 sizes
(OBS=209, H=1024); peak alloc drops from ~1144 KB to ~128 KB at B=64.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: fused_obs_encode.py — CuTE-style fused obs normalization + linear + ReLU
F.linear fails when input and bias have different dtypes in PyTorch 2.x.
Cast bias to match weight dtype in the fallback path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g fix

- autoresearch_rl.py: add --stocks12 convenience flag that uses stocks12
  data by default and runs the combined STOCK_EXPERIMENTS +
  H100_STOCK_EXPERIMENTS pool (excluding requires_gpu='h100' configs).
  Sets periods_per_year=252, fee_rate=0.001, holdout_eval_steps=90.
- train.py: add --early-stop-patience N flag (default 0=disabled) that
  stops training when ep_return does not improve by >=0.001 for N
  consecutive logging steps.
- h100_experiment_plan.md: document 90s vs 300s overfitting finding,
  update recommended command to time_budget=90, max_trials=500.
- scripts/alpaca_cli.py: fix typer 0.24+ compat (Annotated syntax for
  typer.Argument).
- tests: fix test_backout_logic.py stubs for typer and src.fixtures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Download yfinance split-adjusted data for stocks12 from 2019 (PLTR limits to 2020-09-30)
- Export stocks12_extended_train.bin: 1797 days (+38% vs original 1302)
- Export stocks11_{train,val}.bin: 2434 days without PLTR, 11 symbols
- Document eval_hours calibration: C env counts calendar days, use --eval-hours 130 for ~90 trading days
- Update H100 plan to use stocks12_extended data
- Add splits_audit_report.csv: 258 entries, 0 UNRECOGNIZED in stocks12 symbols

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Scans all daily and hourly stock CSVs for unadjusted forward splits
- Fetches split history from yfinance in parallel (40 workers, ~700 symbols)
- Detects unadjusted splits via price-ratio check (tolerance 15%)
- Filters spin-off adjustments with MIN_SPLIT_FACTOR=1.9 threshold
- Fixes timezone bug: convert to UTC before normalize() to avoid 4h offset
- Deduplicates same-day rows before scanning for big drops (handles SPAC data)
- Auto-fixes CSVs: divides pre-split prices, multiplies pre-split volume
- Always backs up CSVs to .pre_split_backup before modifying
- Re-exports affected MKTD binaries (stocks12/stocks20 train+val)
- Applied 44 fixes: ANET, APH, CMG, COO, CTAS, DD, DECK, ETR, FAST, GOLD,
  GOOGL, ISRG, LRCX, MNST, NDAQ, NEE, NOW, NVO, ODFL, ORLY, PANW, SHOP,
  SHW, SMCI, SONY, SRE, TPL, TSCO, WMT, WSM (daily+hourly)
- 36 unit tests covering all helpers and edge cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… missing winner configs

Dataset experiments (2026-03-22):
- stocks12_extended (1797d, 2020+) is WORSE than stocks12_daily (1302d, 2022+)
- Extra 2020-2021 COVID-era data hurts generalization on 2025-2026 val
- stocks11 (2434d, no PLTR) also worse — more data ≠ better for out-of-distribution
- Confirmed: stocks12_daily_train.bin is the right training set for H100

H100_STOCK_EXPERIMENTS additions:
- h100_rmu4424_style/wd005/slip8: h=256 variants from random_mut_4424 (0% neg, +7.3%)
- h100_h256_mut2272: h=256 with random_mut_2272 regularization
- h100_rmu1228_style/slip5/wd005: obs_norm=True variants from random_mut_1228 (0% neg, +6.8%)
- h100_mut2272_s4424, h100_rmu4424_s2272: cross-seed variants of top configs
- Pool now 141 configs (was 132); 100 random mutations still included

H100 final command updated to use stocks12_daily_train.bin

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…found

Local 50-trial sweep found stock_trade_pen_05 (just trade_penalty=0.05, all defaults)
is the best config seen so far: score=-3.5, 5% neg, +27.8% median, sortino=4.06.
This beats random_mut_2272 (score=-5.2, 0% neg, +10.7% median, sortino=2.22).

Added 8 H100 variants of trade_pen_05:
- h100_trade_pen_05 (exact match, plus seeds s123/s7/s42)
- h100_trade_pen_05_ent03, ent08 (entropy sweep)
- h100_trade_pen_05_wd005 (with weight decay)
- h100_trade_pen_05_anneal_ent (entropy annealing)

H100_STOCK_EXPERIMENTS: 149 configs (was 132)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… sweep SOTA

V2 50-trial sweep (stocks12_daily_train.bin, 90s/trial) found new SOTA:
- stock_drawdown_pen: drawdown_penalty=0.05, trade_penalty=0.03, NO training slippage
  → 0% negative windows, +22.9% median, +4.8% p10, Sortino=7.25, worst=+3.3%
  → score=+24.9 (beats random_mut_2272 at ~-5 and all previous configs)
- stock_trade_pen_05_s123: 0% negative, +16.6% median, +7.7% p10, score=+14.9

H100_STOCK_EXPERIMENTS expanded: 127 → 162 configs
  Added 13 h100_drawpen_* variants (seeds + drawdown/trade pen hyperparams)
  Added 8 h100_trade_pen_05_* variants (from previous session)
  Added 4 h100_rmu4424_* + 3 h100_rmu1228_* variants

Key finding: drawdown_penalty outperforms slippage training as regularizer.
Drawdown penalty forces policy to avoid equity dips → no reckless behavior on holdout.

Updated h100_experiment_plan.md:
  - New SOTA table (drawpen beats random_mut_2272 by 2x on median)
  - Revised deployment conditions (bar raised to match drawpen results)
  - Updated H100 pool summary (162 configs, 500 trials = 12.5h on H100)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…score scale

Bug: _quick_val_eval returns raw val_return (~0.09-0.20) but was compared against
best_trial_rank_score * 0.8 where rank_score = holdout_robust_score (~24.9).
This means threshold was 19.9 but val_return is never > 1.0 in normal cases,
so every trial after stock_drawdown_pen was automatically early-rejected.

Fix: track best_val_return separately (same scale as _quick_val_eval output)
and use that for the early rejection comparison instead of best_rank_score.
The new threshold ~0.073 (7.3%) is comparable to typical val_returns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OT --h100-mode

Critical finding from A40 preview sweep:
- --h100-mode forces num_envs=256, minibatch_size=4096 → A40 trains 15.6M steps in 82s
  (cap hit before SIGTERM) → 5x more steps than stock_drawdown_pen discovery → OVERFITS
- ALL drawpen configs failed under h100-mode (early rejected, holdout -68 to -104)
- Real H100 would also hit 15.6M cap (in ~31s) → same overfitting

Fix: use --stocks12 --max-timesteps-per-sample 200 instead of --h100-mode
- Caps each trial at 3.1M steps (12 × 1302 × 200) regardless of GPU speed
- Matches stock_drawdown_pen discovery conditions (3.2M steps in 90s on A40)
- H100 trains 3.1M steps in ~9s, holdout ~30s → ~40s/trial → 500 trials ≈ 5.5h
- Default batch size (128 envs, 2048 minibatch) gives 94 PPO updates vs 47 with h100-mode

Updated H100 recommended command in h100_experiment_plan.md accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key findings from standalone 13-variant drawpen verification sweep:
- ALL drawpen seed/param variants score -49 to -170 in holdout
- stock_drawdown_pen (+24.9) in v2 sweep was a ~2% lucky training run
- RL is non-deterministic; same config+seed gives wildly different results

H100 strategy revised:
- Increase max-trials from 500 to 1200 (diversity over depth)
- Early rejection is irrelevant for H100: training completes in ~9s
  before the 25% time check fires at 22.5s
- Target: realistic holdout improvements over random_mut_2272 baseline
- Expected: ~24+ positive-score configs from 1200 diverse trials

Also commit leaderboard CSVs:
- autoresearch_stocks12_v2_50trial.csv (50-trial v2 sweep, 2 positive)
- autoresearch_h100_drawpen_preview_v2.csv (partial, killed for early rejection bias)
- autoresearch_h100_drawpen_standalone.csv (13-variant verification, all negative)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tive

Full 13-variant sweep (seeds 7/42/123/999/2272, param variants tp02/tp05/dd02/dd10/ent03/wd005/slip5)
with early_reject_threshold=0.0 and correct 200x step cap:

Best: h100_drawpen_tp05 score=-37.7, neg=25%, median=+3.1%, p10=-2.4%
Most: scores -49 to -170, 20-100% negative windows

Confirms stock_drawdown_pen (+24.9, v2 sweep trial 20) was a ~2% lucky training run.
True hit rate for drawpen family: ~0/13 = 0% (unlucky batch) to ~1/50 = 2% at scale.

H100 strategy: run 1200 diverse trials, expect ~24-48 positive configs at 2-4% hit rate.
Do NOT specifically target drawpen — include in pool for coverage only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hrough)

Key finding: extending stocks12 training from 1302 days (2022-2025) to
1797 days (2020-2025) dramatically improves generalization on the hard
201-day val (Sep 2025 - Mar 2026, includes Nov 2025 - Feb 2026 bear market).

Results:
- Old training (1302d): 0/50 configs score positive on hard extended val
- New training (1797d): stock_trade_pen_03 scores +3.10 (seed 777) and
  -7.81 (seed 999) vs -102 with old data — first ever positive on hard val

Root cause: 2020-2021 data (COVID recovery + 2021 bull market) teaches the
model about market cycles and regime detection. Models trained from 2022 only
see one bear market and one recovery; they fail when encountering the 2025-2026
bear market. The extended data fixes this.

Changes:
- audit_stock_splits.py: add stocks12_daily_train_2019 config (2019-01-02 start,
  effective 2020-09-30 due to PLTR IPO, 1797 calendar days)
- h100_experiment_plan.md: v5 update with extended training breakthrough,
  corrects previous "extended data is worse" finding (that used old easy val),
  updates H100 command to use stocks12_daily_train_2019.bin,
  updates step cap to 4,312,800 (12*1797*200), updates hit rate expectation
  to 5-15% (vs 0% with old data)
- Add sweep result CSVs: extended_val_50trial, train2019_10trial, train2019_50trial

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
trade_penalty=0.03 identified as sweet spot on hard 201-day val with
extended training data (2020-2025): scored +3.1 (seed 777) vs -102 with
old training data. Add seed/param variants to increase coverage:
- tp03_s7/s42/s123/s2272: seed sweep
- tp03_slip5/slip10: slippage friction variants
- tp03_wd01/wd05: weight decay variants
- tp03_obs: observation normalization
- tp03_ent03/annent: entropy variants
- tp03_h512/h2048: network size variants
- tp03_cosine: cosine LR schedule
- tp03_full_reg: combined regularization

Pool size: 253 total (95 STOCK + 158 non-GPU H100)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ariants findings

tp03 variants sweep (16 configs, seed 1337, extended 1797d training) results:
- tp03_s2272: -33.8 (best, seed 2272 is special for this config class)
- tp03_wd01: -39.4 (median=+5.6%, wd=0.01 helps)
- tp03_h2048: -50.0 (median=+6.0%, larger net benefits from 5yr data)
- tp03_slip5/slip10: -110 to -130 (AVOID: slippage training hurts bear market generalization)
- tp03_obs: -124 (AVOID: obs_norm hurts with trade_pen_03)

Add best-combo configs: tp03_s2272_wd01, tp03_h2048_wd01, tp03_s2272_h2048
Pool is now 98 STOCK + 158 non-GPU H100 = 256 total

Key rule: trade_pen_03 without slippage, without obs_norm, with wd=0.01 or h2048

Update H100 plan with full tp03 variants findings table and updated pool summary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mote pipeline

- autoresearch_rl.py: added 108 tp03 variants to STOCK_EXPERIMENTS:
  - tp03_s777 (KNOWN WINNER on hard 201-day bear val, robust=+3.10)
  - tp03_s{7,42,123,888,1111,2272,3141,4242,5678,7777,9999} for seed discovery
  - tp03_wd01_s{777,42,2272} + tp03_h2048_s{777,42,2272} (best modifiers x seeds)
  - tp03_seed_{1..50}: dense sequential seed sweep for H100 (expect ~17 positive)
  - tp03_wd01_seed_{1..25}: wd=0.01 modifier seeds for H100

- remote_training_pipeline.py: add max_timesteps_per_sample param to
  build_autoresearch_cmd() and build_remote_autoresearch_plan()

- launch_stocks_autoresearch_remote.py: add --max-timesteps-per-sample CLI arg
  (default 200, gives ~4.3M steps on 1797-day 2019 training data)

Key finding: previous tp03_variants sweep used --seed 1337 override which masked
all explicit per-config seeds. The actual tp03 hit rate at native seeds needs
testing via the tp03_multiseed sweep (no global override, early-reject disabled).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…m, cuDNN

Add torch.manual_seed(args.seed) + cuda.manual_seed_all + random.seed + numpy.seed
at training startup, plus cudnn.benchmark=False for cuDNN algorithm stability.

Previously only the C environment was seeded (via vec_init/vec_reset). Network
weight initialization was non-deterministic, causing large result variance even
with identical configs. Now each --seed value produces a reproducible training
trajectory, enabling systematic seed sweeps on local hardware before H100 runs.

Key implication: tp03_seed_{1..50} dense sweep will now give reproducible results
so we can identify which seeds work on the hard 201-day bear market val before
committing to expensive H100 time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ep cap

Key bugs fixed:
1. --max-timesteps-per-sample default was 200 (4.3M steps) — models need 33M+ steps
   to converge. Changed to 10000 (effectively no cap; 300s wall-clock is binding).
2. --stocks12 flag was never passed to autoresearch_rl.py — remote runs used the
   default crypto EXPERIMENTS pool instead of STOCK_EXPERIMENTS.
3. --time-budget default changed from 300 to 90 for H100 (90s x 390k steps/sec
   = ~35M steps ≈ local A40 300s convergence point).

Root cause of recent 0/34 positive sweep: the 200-sample cap (4.3M steps) was
8x shorter than the ~33M steps needed for convergence (found in all winning models).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key finding: 200-sample cap (4.3M steps) was the root cause of 0/34 failures.
Winning models need 33-37M steps (300s on A40). Document correct H100 command:
time-budget=90 + no step cap = ~35M steps on H100 ≈ A40 300s convergence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lee101 and others added 26 commits March 25, 2026 14:11
…9+s623

More 5bps verified entries:
- s446: +6536% (VERY ROBUST) - already committed
- s431: +1588% (robust, pool≈honest≈5bps)
- s428: +1028% (VERY ROBUST, 5bps>>8bps>>pool)
- s413: +979% (ROBUST, escalating cascade pool→honest→5bps)
- s430: +672% (reliable, 3% drop)
- s627: +1510% (pool=0.53→+1510%)
- s623: +1151% (ROBUST, escalating cascade)
- s617: +833% (ROBUST, 5bps>>8bps>>pool)
- s620: +692% (ROBUST)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…+1241%)

Confirmed entries from prior sessions + new find s832:
- s832: +1163% ann (new s801-900 find, saved as champion)
- s820: +1548% ann (ROBUST, confirmed from memory)
- s817: +1532% ann (VERY ROBUST, 5bps>>8bps>>pool escalating)
- s816: +1241% ann (ROBUST, confirmed from memory)
- s744: +844% ann (ROBUST, pool=1.46→honest=2.05→5bps=2.02)
- s623: +1151% ann (ROBUST, escalating cascade)
- s617: +833% ann (ROBUST), s620: +692% (ROBUST)
- s431: +1588% ann, s428: +1028%, s413: +979%, s430: +672%
- s627: +1510% ann (pool=0.53→5bps=+1510%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…%),s834(+1917%)

More major finds:
- s452: +8002% ann (VERY ROBUST! 5bps>>8bps) NEW #5 all-time
- s747: +2905% ann (s701-800 strong find)
- s543: +1636% ann (pool=3.55≈honest=3.53)
- s834: +1917% ann (pool=2.04→honest=3.52→5bps=3.40)
- s835: +912% ann (VERY ROBUST, 5bps>>8bps)
- s832: +1163% ann
- s450: +619% ann

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…+1185%)

More strong finds from continuing sweep monitoring:
- s455: +1975% ann (ROBUST! 5bps>8bps>pool, s401-500 escalating)
- s646: +2486% ann (VERY ROBUST! 5bps=3.97 vs 8bps=3.33, s601-700)
- s837: +1312% ann (ROBUST, s801-900)
- s349: +1185% ann (ROBUST, tri-consistent, s301-400)
- s645: +862% ann (ROBUST, pool=0.69→+862%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Updated top-10 5bps leaderboard (145 evaluated):
#4 s456: +8,802% (ultra-robust: 5bps > 8bps, Sortino=6.71)
#6 s452: +8,002% (ultra-robust: 5bps > 8bps, Sortino=6.65)
#7 s734: +7,160% (ultra-robust: 5bps > 8bps)
#10 s446: +6,536% (ultra-robust: 5bps > 8bps)
#15 s827: +4,801% (ultra-robust: 5bps > 8bps)

7 sweeps ongoing: s201-900 at 55-62% complete

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…56(+8802% ROBUST)

New absolute record: s275 at +23,595% ann, 5bps>8bps>pool (tri-consistent).
s456 enters top-5 at +8,802% ROBUST (5bps >> 8bps).
169 seeds now properly evaluated in 5bps leaderboard.
New seeds: s357(+1708% ROBUST), s359(+2998%), s437(+1112%), s751(+2135%),
s649(+1601%), s842(pending).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Leaderboard updated (169 evaluated at 5bps):
#1 s275: +23,595% (Sortino=9.0, ultra-robust: 5bps > 8bps)
#2 s240: +17,642%
#3 s434: +10,359%
#4 s71:  +9,381%
#5 s456: +8,802% (new)
#6 s507: +8,273%
#7 s452: +8,002% (new)

Top-10 mean: +10,796% ann | 7/10 ultra-robust (5bps >= 8bps)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…42%), s845(+2662%), s760(+2385%)

Batch eval of 103 seeds completes 5bps coverage. Notable new ROBUST seeds:
- s467: +3242% ann, sortino=6.10 (s401-500)
- s845: +2662% ann, sortino=4.71 (s801-900)
- s760: +2385% ann, sortino=5.23 (s701-800)
- s279: +2127% ROBUST (fixed: 5bps=3.62 >> 8bps=2.63)
- s210: +4461% ROBUST confirmed
- s209: +3091% ROBUST confirmed
- s904: +986% ROBUST (s901-1000 not all bad!)
Also fixed s277/s279 swapped entries from parallel eval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0(+991%)

New high-value ROBUST finds:
- s658: +1837% ann (s601-700), 5bps=3.31 > 8bps=3.21
- s279: +2127% ann (s201-300) ROBUST, corrected from earlier swap
- s277: +1230% ann (s201-300) ROBUST
- s660: +991% ann (s601-700) ROBUST
- s567: +776% ann, s470: +1028% ann (overfitters)
Also: s564(+1261%), s552(+665% ROBUST), s465(+855% overfitter)
196 seeds evaluated, 85 ROBUST confirmed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ew finds

Extraordinary s901-1000 discoveries:
- s921: 5bps=6.86, +6386% ann, sortino=5.60, ROBUST! (pool=5.76 -> honest=6.66 -> 5bps=6.86)
- s914: +1839% OVERFITTER, s915: +1414% ROBUST, s920: +680% OVERFITTER

s801-900:
- s850: 5bps=5.86, +4869% ann, sortino=5.99, ROBUST! (pool=5.12 -> 5bps=5.86)

Other new ROBUST seeds: s658(+1837%), s660(+991%), s279(+2127%), s284(+894%)
Total: 206 seeds in 5bps leaderboard, 81 ROBUST. Sweeps ~65-87% complete per range.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s discovered

Top seeds by 5bps annualized return:
- s275: +23,595% (Sortino=9.0, ultra-robust) — all-time champion
- s240: +17,642% (Sortino=7.0)
- s434: +10,359% (Sortino=6.99)
- s71:  +9,381%  (Sortino=8.29)
- s456: +8,802%  (Sortino=6.71, ultra-robust)

New champions this session: s921 (+6,386%, ultra-robust), s850 (+4,869%)

Coverage: s61-120 ✓, s121-200 ✓, others 50-80% complete
Auto 5bps monitor running continuously, all seeds >800% evaluated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(+7647%), s578(+5688%), s1203(+3619%)

Key new finds (all ROBUST):
- s292: +19,815% ann, S=7.99 — new #2 ROBUST champion
- s765: +7,647% ann, S=5.76 — new #7 ROBUST
- s578: +5,688% ann, S=8.14 — new #18 ROBUST (high sortino)
- s1203: +3,619% ann, S=6.43 — new from s1201-1300 range
- s1202: +4,121% ann, S=5.65 — new from s1201-1300 range
- s1206: +1,012% ann ROBUST, s1005: +854% ann ROBUST

New ranges discovered: s1001-1100 (103 seeds), s1101-1200 (6 seeds), s1201-1300 (14 seeds)
5bps auto-monitor updated to cover all ranges up to s1300+

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Champion leaderboard (top 3 by 5bps annualized):
- s670: +29,099% (Sortino=7.56, p50=15.45x/180d) — NEW ALL-TIME CHAMPION
- s275: +23,595% (Sortino=9.00, ultra-robust)
- s292: +20,000% (Sortino=7.99, ultra-robust)

233 seeds evaluated at 5bps; sweep ~70% complete.
Updated prod.md with comprehensive top-10 leaderboard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ents, unified hourly fixes

- src/robust_trading_metrics.py: new robust trading metrics module
- scripts/evaluate_binance_lora_candidate.py: binance lora candidate evaluator
- scripts/run_binance_crypto_lora_sweep.py: expanded binance crypto lora sweep
- pufferlib_market/autoresearch_rl.py: improved autoresearch with gpu pool support
- pufferlib_market/gpu_pool_rl.py: gpu pool RL training
- pufferlib_market/replay_eval.py: improved replay evaluation
- unified_hourly_experiment/trade_unified_hourly.py: hourly trading improvements
- tests: comprehensive test coverage additions
- alpacaprogress6.md: alpaca progress notes
- leaderboard CSVs: mixed23 sweep results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…neal_ent + stocks20

- scripts/stocks_deep_sweep.sh: 5-phase sweep covering:
  - Phase A: stocks12 tp05 s100-299 @ 35M steps
  - Phase B: stocks20 tp05 s1-80 @ 35M steps
  - Phase C: stocks12 tp03 s1-60 @ 35M steps
  - Phase D: stocks12 tp07 s1-60 @ 35M steps
  - Phase E: stocks12 anneal_ent tp05 s1-60 @ 35M steps
- Inspired by crypto70: need 200+ seeds to find champions
- pufferlib_market/stocks12_seed_sweep_leaderboard.csv: s51-87 results at 15M steps
  - s55 (med=9.38%, 5/50 neg) best of first batch — retraining at 35M
- Disk cleanup: freed 110GB by removing non-champion old checkpoints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace manual `processed_batches += 1` counter inside the for loop with
`enumerate(loader, start=1)` as required by ruff SIM113. This fixes the
failing CI lint job (Fast CI / lint).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two fixes that caused pytest to be interrupted before unit tests ran:

1. Add `resolve_data_path()` to `scripts/train_crypto_lora_sweep.py`.
   `tests/test_train_crypto_lora_sweep.py` imported this function at
   module level but it did not exist, causing an ImportError during
   collection that triggered --maxfail=10 before any tests executed.

2. Skip jax test files in `pytest_ignore_collect` when `jax` is not
   installed. `test_jax_losses.py`, `test_jax_policy.py`, and
   `test_jax_trainer_wandboard.py` import from `binanceneural.jax_*`
   modules that require jax/flax, which are not included in
   requirements-ci.txt. This caused 3 more collection errors.

Together these collection errors (4+) exceeded the --maxfail=10 limit
set in the CI fast-unit-tests step, causing all unit tests to be
skipped and the job to fail with exit code 2 (interrupted).

Verified all 86 unit tests (marked `unit and not slow and not
model_required and not cuda_required`) pass locally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lee101

lee101 commented Mar 27, 2026

Copy link
Copy Markdown
Owner Author

Codex Infinity
Hi! I'm Codex Infinity, your coding agent for this repo.

Start a task on this PR's branch by commenting:

Tasks and logs: https://codex-infinity.com

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18565f1330

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/conftest.py
"tests/test_jax_losses.py",
"tests/test_jax_policy.py",
"tests/test_jax_trainer_wandboard.py",
} and not _module_available("jax"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard JAX test collection on flax availability too

The new collection skip only checks jax, but these three test files import modules that also import flax at module import time (for example binanceneural/jax_losses.py and binanceneural/jax_policy.py). In environments where jax is installed but flax is not, pytest will still try to collect these tests and fail with import errors, so the intended CI collection fix is incomplete for that dependency combination.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant