Skip to content

Step 4: numpy NormalizationManagerProtein quadratic+linear normalization#87

Open
mschwoer wants to merge 7 commits into
opt/step3-vectorize-sample-shifterfrom
opt/step4-numpy-normalize-quadratic-linear
Open

Step 4: numpy NormalizationManagerProtein quadratic+linear normalization#87
mschwoer wants to merge 7 commits into
opt/step3-vectorize-sample-shifterfrom
opt/step4-numpy-normalize-quadratic-linear

Conversation

@mschwoer

@mschwoer mschwoer commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Overrides _normalize_quadratic_and_linear with numpy, dropping the label-based .loc[tuples,:]=df round-trips on a MultiIndex (the dominant per-protein cost for >num_samples_quadratic-ion proteins).

Estimated speedup: estimate stage 102.8 s → 16.2 s (~6.3× — the single biggest win), single-core HYE benchmark.

Bit-identical protein + ion output (max_abs_diff=0); suite passes.

🤖 Generated with Claude Code


@ammarcsj I chose to overwrite the whole method, not the four individual steps, as this would have introduced 3 more instance variables and made it harder to read overall. I added references as comments ("cf. ..") instead

mschwoer and others added 6 commits June 15, 2026 14:01
Add regression tests before the numpy override of
_normalize_quadratic_and_linear: an executable-spec characterization test
(validated against the baseline pandas implementation) over mixed-NaN rows that
pins the stable k-fewest-NaN quadratic split and the linear nanmedian shift, plus
a noise-free behavioral test asserting scaled-copy profiles collapse onto one
profile. Both pass against the unmodified baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n (step 4)

Override _normalize_quadratic_and_linear with a numpy implementation that drops
the label-based .loc[list_of_tuples,:]=df round-trips on a MultiIndex (the
dominant estimate-stage cost for >num_samples_quadratic-ion proteins). Reuses
get_normfacts/apply_sampleshifts on the stable k-fewest-NaN quadratic subset,
takes the column nanmedian as the reference, and shifts the linear rows by
nanmedian(reference - row). Bit-identical protein + ion output (max_abs_diff=0);
estimate stage drops to ~18s. Suite passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r stage comments

---
# Conversation that produced these changes
---

## User prompt

"apply the 4-step refactor and run the tests"

## User prompt

"I don't like the introduction of new instance variables. please revert. Instead add the four stages as comments to the method"

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mschwoer mschwoer requested a review from ammarcsj June 16, 2026 09:08
mschwoer added a commit that referenced this pull request Jun 16, 2026
---
# Conversation that produced these changes
---

## User prompt

"Now, the NormalizationManagerProtein: it is no longer used. This means the refactoring done in #87 was obsolete?"

## Clarifying round 1

Q: How do you want to resolve the duplicated quadratic+linear numpy logic between #87's class override and #88's _normalize_protein_values?
  - Single source in normalization.py  <-- chosen
  - Revert #87's override
  - Leave both as-is
  - Just answer, no change

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant