EffortlessMetrics · EffortlessSteven · Jun 21, 2026
@@ -5,3 +5,7 @@
 ## 2026-03-05 - Hot Loop Allocations in Token Sampling
 **Learning:** Allocating memory in the hot path of LLM token generation (e.g., `logits.to_vec()` or creating `HashMap`s per token) significantly degrades performance due to repeated allocation overhead of vocabulary-sized vectors (often 128K+ elements). Additionally, mathematically equivalent iterative multiplication (`logit *= inv_penalty`) can replace `HashMap` counting and `.powi(count)`, completely eliminating O(N) memory allocations per token.
 **Action:** When working on generation loops, use buffer pooling (e.g. storing a `Vec` in the generator state and using `std::mem::take` to bypass borrow checker limitations) and avoid `HashMap` allocations for simple counting if an iterative scalar approach is mathematically equivalent.
+
+## 2026-03-05 - Repetition Penalty Power Calculation Optimization
+**Learning:** Calculating `powi` unconditionally before an `if/else` block introduces significant overhead by performing work for branches that aren't taken. Furthermore, executing division (e.g., `logit /= penalty`) in a hot loop is computationally expensive and can be safely eliminated by calculating the inverse first (`1.0 / penalty`) and multiplying with the inverse `inv_penalty.powi(count)`.
+**Action:** When working on math operations in conditional branches inside hot loops, pre-calculate values if possible outside the loop and lazily calculate values inside the branches that actually use them to avoid wasted cycles and unnecessary expensive operations like division.
@@ -264,6 +264,9 @@
     ///
     /// `token_counts` is a slice of `(token_id, occurrence_count)` pairs.
     pub fn apply(&self, logits: &mut [f32], token_counts: &[(u32, usize)]) {
+        // ⚡ Bolt: Hoist the inverse calculation entirely outside the loop to avoid redundant operations
+        let inv_count_penalty = 1.0 / self.count_penalty;
+
         for &(token_id, count) in token_counts {
             let idx = token_id as usize;
             if idx >= logits.len() || count == 0 {
@@ -279,11 +282,14 @@
             // Count penalty: multiplicative
             if self.count_penalty.to_bits() != 1.0f32.to_bits() {
                 let count = i32::try_from(count).unwrap_or(i32::MAX);
-                let penalty = self.count_penalty.powi(count);
 
                 if logits[idx] > 0.0 {
-                    logits[idx] /= penalty;
+                    // ⚡ Bolt: Lazily compute inverse power inside branch and multiply to safely eliminate division overhead
+                    let inv_penalty = inv_count_penalty.powi(count);
+                    logits[idx] *= inv_penalty;
                 } else {
+                    // ⚡ Bolt: Lazily compute power inside branch to avoid unnecessary work when this branch is not taken
+                    let penalty = self.count_penalty.powi(count);
                     logits[idx] *= penalty;
                 }
             }

@@ -6,6 +6,6 @@ expression: logits
     1.0,
     0.15026294,
     3.0,
-    2.909091,
+    2.9090908,
     5.0,
 ]