Week 4 project update by kushal281 · Pull Request #2 · OpenLake/SmartTriage

kushal281 · 2026-06-25T10:18:49Z

Summary by CodeRabbit

New Features
- Added GitHub issue webhook handling to detect possible duplicate issues and post a comment when a match is found.
- Added a health check endpoint for basic service monitoring.
- Added issue ingestion and similarity search support for storing and checking issue content.
Chores
- Added environment variable examples and ignore rules for local, editor, and database files.
Bug Fixes
- Improved issue processing to handle missing request data more safely.

coderabbitai · 2026-06-25T10:19:02Z

Walkthrough

The PR adds GitHub environment defaults, ignore rules, and README merge markers. It also introduces configuration loading, GitHub API access, embedding and vector search, issue cleaning and duplicate detection, ingestion, and FastAPI webhook and health endpoints.

Changes

Repository setup

Layer / File(s)	Summary
Environment and ignore defaults `.env.example`, `.gitignore`	Adds GitHub env placeholders and repeated example assignments, while `.gitignore` excludes Python caches, virtualenvs, `.env`, Chroma data, IDE folders, and `.DS_Store`.
README merge markers `README.md`	Unresolved merge-conflict markers are inserted at the start and end of the README.

GitHub triage flow

Layer / File(s)	Summary
Config and GitHub client `app/config.py`, `app/core/github.py`	Environment variables are loaded into configuration constants and used to construct the GitHub client wrapper for repository lookup, issue fetching, and commenting.
Embedding and vector search `app/ml/embedder.py`, `app/db/vector.py`, `app/test.py`	The SentenceTransformer embedder, ChromaDB issue collection helper, and a small similarity-search script are added around issue similarity lookup.
Issue cleaning and duplicate detection `app/ml/clean.py`, `app/ml/duplicate.py`	Issue text is cleaned into structured fields and duplicate detection compares embeddings against the configured similarity threshold.
Ingestion and webhook endpoints `app/db/ingest.py`, `app/api/webhooks.py`, `app/main.py`	The ingestion script writes cleaned issues to ChromaDB and JSON, while the webhook handler processes opened issues, checks for duplicates, comments on matches, and is mounted by the FastAPI app.

Sequence Diagram(s)

sequenceDiagram
  participant GitHub
  participant github_webhook
  participant detect_duplicate
  participant search_similar_issue
  participant GitHubClient
  participant GitHubAPI
  GitHub->>github_webhook: POST /webhook issues opened event
  github_webhook->>detect_duplicate: title and body text
  detect_duplicate->>search_similar_issue: embedding
  search_similar_issue-->>detect_duplicate: distances and metadata
  detect_duplicate-->>github_webhook: duplicate result
  github_webhook->>GitHubClient: comment_issue(issue_number, comment)
  GitHubClient->>GitHubAPI: create issue comment

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A bunny hopped through hooks with glee,
And tucked embeddings in the tree.
If issues look the same to me,
I’ll leave a comment, quick and wee.
Thump-thump! 🐇

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title is too generic and does not clearly describe the main code changes in the pull request.	Use a concise title that names the primary change, such as adding GitHub webhook, duplicate detection, and ingestion support.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Warning

⚠️ This pull request shows signs of AI-generated slop (defensive_cruft, phantom_api). It has been flagged by CodeRabbit slop detection and should be reviewed carefully.

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (7)

.gitignore (1)
9-12: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Ignore the whole Chroma data directory, not one generated file.

This is brittle: a new Chroma persistence file or regenerated collection ID will fall back into the repo. Ignoring data/chroma/ keeps the local DB out of source control without depending on a specific UUID.
♻️ Suggested cleanup
 # Chroma database
 data/issues.json
-data/chroma/chroma.sqlite3
-data/chroma/defcbcc9-4578-4553-b3f4-648cbe1c763b
+data/chroma/
As per the app/db/vector.py snippet, Chroma persists under ./data/chroma.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.gitignore around lines 9 - 12, The .gitignore entry is too specific because
it only excludes individual Chroma-generated files instead of the whole
persistence directory. Update the Chroma ignore rules to cover the entire
data/chroma folder so any files created by the vector store, including future
collection IDs or sqlite artifacts from app/db/vector.py and the Chroma
persistence path under ./data/chroma, stay out of source control.
app/db/ingest.py (1)
13-13: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Misleading variable name.

uncleaned_issues holds the raw issue records persisted to data/issues.json; a name like raw_issues would read more clearly. Cosmetic only.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/db/ingest.py` at line 13, The variable name in the ingestion logic is
misleading because `uncleaned_issues` actually stores the raw issue records
loaded from and persisted to `data/issues.json`. Rename it to something clearer
like `raw_issues` in the `ingest` flow so the intent is obvious, and update any
nearby references in the same scope to match.
app/core/github.py (2)
39-42: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Use a local variable instead of self.repo.

Assigning to self.repo introduces mutable per-call instance state that other methods could accidentally depend on. A local variable is clearer and side-effect free.
♻️ Proposed change
     def comment_issue(self, issue_number, comment):
-        self.repo = self.get_repo(TARGET_REPO)
-        issue = self.repo.get_issue(number=issue_number)
+        repo = self.get_repo(TARGET_REPO)
+        issue = repo.get_issue(number=issue_number)
         issue.create_comment(comment)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/core/github.py` around lines 39 - 42, Replace the mutable instance
assignment in comment_issue with a local repository variable instead of setting
self.repo. In github.py, update comment_issue to store the result of
get_repo(TARGET_REPO) in a local name and use that for get_issue/create_comment,
keeping the method side-effect free and avoiding accidental reliance on shared
instance state.
15-37: 🩺 Stability & Availability | 🔵 Trivial | ⚖️ Poor tradeoff

Add error handling around external GitHub calls.

get_repo, get_issues, and create_comment can raise on rate limits, network errors, or auth failures. In the webhook path these are invoked on a request thread; unhandled exceptions will surface as 500s with no recovery. Consider wrapping calls with structured error handling and respecting PyGithub's rate-limit/retry options.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/core/github.py` around lines 15 - 37, Add structured error handling
around the external GitHub API calls in the GitHub integration so request-thread
failures don’t escape as unhandled 500s. In the methods that call PyGithub,
especially fetch_issues, and anywhere create_comment is used, wrap get_repo,
get_issues, and comment creation in a try/except that catches
GitHub/PyGithub/network/rate-limit/auth failures and returns a controlled error
or logs context. Use the existing GitHub class methods as the place to
centralize this, and ensure the webhook path handles these failures gracefully
instead of propagating exceptions.
app/ml/clean.py (1)
3-3: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Module-level Embedder() loads the model at import time.

Instantiating the SentenceTransformer at import means any module importing clean_issue (ingest script, webhook handler, tests) pays the model-load cost eagerly, even when no embedding is needed. Consider lazy initialization or dependency injection.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/ml/clean.py` at line 3, Module-level instantiation of Embedder is loading
the SentenceTransformer eagerly on import, so move this initialization out of
the top level in clean_issue/clean.py. Refactor Embedder usage to support lazy
creation or dependency injection, and update the callers that use embedder so
the model is only loaded when embedding is actually needed.
app/config.py (1)
6-10: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider failing fast on missing required configuration.

GITHUB_TOKEN, TARGET_REPO, and SOURCE_REPO will silently be None if unset, surfacing later as confusing runtime errors deep in the GitHub client. A startup check would make misconfiguration obvious.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/config.py` around lines 6 - 10, Fail fast on missing required
configuration in app/config.py by validating the module-level settings for
GITHUB_TOKEN, TARGET_REPO, and SOURCE_REPO as soon as they are loaded instead of
leaving them as None. Update the config initialization around WEBHOOK_SECRET,
GITHUB_TOKEN, TARGET_REPO, SIMILARITY_THRESHOLD, and SOURCE_REPO so missing
required values raise an immediate startup error with a clear message, and keep
SIMILARITY_THRESHOLD parsing unchanged except for any needed validation.
app/ml/embedder.py (1)
3-9: 🚀 Performance & Scalability | 🔵 Trivial | ⚖️ Poor tradeoff

Consider caching/sharing a single Embedder instance.

SentenceTransformer("all-MiniLM-L6-v2") is loaded on every Embedder() construction. Since duplicate.py, test.py, and ingestion each instantiate their own, the model is loaded multiple times in a process. Reusing a module-level singleton avoids redundant load time and memory.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/ml/embedder.py` around lines 3 - 9, The Embedder class currently loads
SentenceTransformer("all-MiniLM-L6-v2") in every __init__ call, causing
redundant model loads across duplicate.py, test.py, and ingestion. Refactor
Embedder so the SentenceTransformer instance is created once and shared, ideally
via a module-level singleton or cached factory used by Embedder.__init__ and
generate_embedding, and update the call sites to reuse the same Embedder
instance instead of constructing new ones repeatedly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.env.example:
- Around line 1-11: The env example is inconsistent with the real config
contract and duplicates keys, so clean up the file to show one canonical set of
variables only. Update the example to match what app/config.py reads by
documenting GITHUB_WEBHOOK_SECRET, GITHUB_TOKEN, TARGET_REPO,
SIMILARITY_THRESHOLD, and SOURCE_REPO, and remove the conflicting
REPO_NAME/MY_REPO_NAME block so contributors see a single unambiguous template.

In `@app/api/webhooks.py`:
- Around line 10-21: The github_webhook handler currently accepts any POST
request without verifying the GitHub HMAC signature. In github_webhook, read the
raw request body first, compute the expected X-Hub-Signature-256 using the
configured webhook secret, and compare it with the incoming header using a
constant-time check before parsing the payload. If the signature is missing or
invalid, return an unauthorized response and only proceed with event processing
after the signature is validated.
- Around line 27-31: Guard against malformed issue payloads in the webhook
handler by updating the `issue_text` construction in `app/api/webhooks.py` to
use safe accessors instead of direct indexing on `payload["issue"]["title"]`;
validate the payload shape first, return early for missing `issue` or `title`,
and keep the existing flow in the webhook parsing logic so bad
attacker-controlled input cannot raise a `KeyError` and trigger a 500.
- Around line 33-49: The async webhook handler is calling blocking work
directly, which can stall the event loop and hurt concurrency; in the webhook
flow around detect_duplicate and github.comment_issue, move both the duplicate
detection and the PyGithub issue comment call off the event loop using a
threadpool or background task. Update the handler in app/api/webhooks.py so the
logic around detect_duplicate(payload["issue"]["body"]) and
github.comment_issue(...) is awaited through an async-safe offload path, keeping
the async request handler non-blocking.

In `@app/config.py`:
- Line 6: The webhook flow currently reads WEBHOOK_SECRET in app/config.py but
the handler in app/api/webhooks.py does not use it to authenticate incoming
requests. Update the /webhook handling logic to verify GitHub’s
X-Hub-Signature-256 against an HMAC-SHA256 computed from the raw request body
using WEBHOOK_SECRET, and reject requests when the signature is missing or
invalid before any parsing or duplicate-detection work runs. Use the webhook
handler entrypoint and request-processing logic in app/api/webhooks.py to locate
the change, and perform the comparison in constant time.

In `@app/db/vector.py`:
- Line 7: The collection created in client.get_or_create_collection for
github_issues is using the default L2 distance, which breaks the similarity math
used by app.ml.duplicate's similarity calculation. Update the collection
creation in app.db.vector so github_issues is created with an explicit cosine
distance metric/space, keeping the returned distances compatible with max(0, 1 -
distance).

In `@app/test.py`:
- Around line 2-16: The test is importing and calling the wrong vector search
helper: `app.db.vector` exposes `search_similar_issue`, not `search_similar`.
Update the import in `app/test.py` to use the सही function name, and adjust the
call site to invoke `search_similar_issue` with the embedding so the test
matches the actual API.

In `@README.md`:
- Around line 1-4: The README is still left with merge-conflict markers, so
remove the conflict artifacts and keep the intended content only. Clean up both
conflicted sections in README.md by deleting the HEAD/divider markers and
retaining the final README text, using the visible conflict blocks near the top
and bottom of the file as the locations to fix.

---

Nitpick comments:
In @.gitignore:
- Around line 9-12: The .gitignore entry is too specific because it only
excludes individual Chroma-generated files instead of the whole persistence
directory. Update the Chroma ignore rules to cover the entire data/chroma folder
so any files created by the vector store, including future collection IDs or
sqlite artifacts from app/db/vector.py and the Chroma persistence path under
./data/chroma, stay out of source control.

In `@app/config.py`:
- Around line 6-10: Fail fast on missing required configuration in app/config.py
by validating the module-level settings for GITHUB_TOKEN, TARGET_REPO, and
SOURCE_REPO as soon as they are loaded instead of leaving them as None. Update
the config initialization around WEBHOOK_SECRET, GITHUB_TOKEN, TARGET_REPO,
SIMILARITY_THRESHOLD, and SOURCE_REPO so missing required values raise an
immediate startup error with a clear message, and keep SIMILARITY_THRESHOLD
parsing unchanged except for any needed validation.

In `@app/core/github.py`:
- Around line 39-42: Replace the mutable instance assignment in comment_issue
with a local repository variable instead of setting self.repo. In github.py,
update comment_issue to store the result of get_repo(TARGET_REPO) in a local
name and use that for get_issue/create_comment, keeping the method side-effect
free and avoiding accidental reliance on shared instance state.
- Around line 15-37: Add structured error handling around the external GitHub
API calls in the GitHub integration so request-thread failures don’t escape as
unhandled 500s. In the methods that call PyGithub, especially fetch_issues, and
anywhere create_comment is used, wrap get_repo, get_issues, and comment creation
in a try/except that catches GitHub/PyGithub/network/rate-limit/auth failures
and returns a controlled error or logs context. Use the existing GitHub class
methods as the place to centralize this, and ensure the webhook path handles
these failures gracefully instead of propagating exceptions.

In `@app/db/ingest.py`:
- Line 13: The variable name in the ingestion logic is misleading because
`uncleaned_issues` actually stores the raw issue records loaded from and
persisted to `data/issues.json`. Rename it to something clearer like
`raw_issues` in the `ingest` flow so the intent is obvious, and update any
nearby references in the same scope to match.

In `@app/ml/clean.py`:
- Line 3: Module-level instantiation of Embedder is loading the
SentenceTransformer eagerly on import, so move this initialization out of the
top level in clean_issue/clean.py. Refactor Embedder usage to support lazy
creation or dependency injection, and update the callers that use embedder so
the model is only loaded when embedding is actually needed.

In `@app/ml/embedder.py`:
- Around line 3-9: The Embedder class currently loads
SentenceTransformer("all-MiniLM-L6-v2") in every __init__ call, causing
redundant model loads across duplicate.py, test.py, and ingestion. Refactor
Embedder so the SentenceTransformer instance is created once and shared, ideally
via a module-level singleton or cached factory used by Embedder.__init__ and
generate_embedding, and update the call sites to reuse the same Embedder
instance instead of constructing new ones repeatedly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e1bbede-441c-4408-a2d0-3cb0a21af05d

📥 Commits

Reviewing files that changed from the base of the PR and between c703023 and b86ddae.

📒 Files selected for processing (14)

.env.example
.gitignore
README.md
app/api/webhooks.py
app/config.py
app/core/github.py
app/db/ingest.py
app/db/vector.py
app/main.py
app/ml/clean.py
app/ml/duplicate.py
app/ml/embedder.py
app/test.py
requirements.txt

coderabbitai · 2026-06-25T10:24:46Z

+GITHUB_WEBHOOK_SECRET=
+GITHUB_TOKEN=
+REPO_NAME= #Enter repo from where you want to fetch issues
+SIMILARITY_THRESHOLD=
+MY_REPO_NAME= #Enter repo for which you want to use 
+
+GITHUB_TOKEN=your_token_here
+GITHUB_WEBHOOK_SECRET=your_secret_here
+REPO_NAME=OpenLake/your-repo
+TARGET_REPO=your-test-repo
+SIMILARITY_THRESHOLD=0.85


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Align the example with the actual env contract.

This file has two conflicting key blocks, and it never documents SOURCE_REPO, which app/config.py reads. New contributors will copy an incomplete/ambiguous config.

♻️ Suggested cleanup

-GITHUB_WEBHOOK_SECRET= -GITHUB_TOKEN= -REPO_NAME= `#Enter` repo from where you want to fetch issues -SIMILARITY_THRESHOLD= -MY_REPO_NAME= `#Enter` repo for which you want to use - GITHUB_TOKEN=your_token_here GITHUB_WEBHOOK_SECRET=your_secret_here -REPO_NAME=OpenLake/your-repo +SOURCE_REPO=OpenLake/your-repo TARGET_REPO=your-test-repo SIMILARITY_THRESHOLD=0.85

As per the app/config.py snippet, the app loads GITHUB_WEBHOOK_SECRET, GITHUB_TOKEN, TARGET_REPO, SIMILARITY_THRESHOLD, and SOURCE_REPO.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

GITHUB_WEBHOOK_SECRET=

GITHUB_TOKEN=

REPO_NAME= #Enter repo from where you want to fetch issues

SIMILARITY_THRESHOLD=

MY_REPO_NAME= #Enter repo for which you want to use

GITHUB_TOKEN=your_token_here

GITHUB_WEBHOOK_SECRET=your_secret_here

REPO_NAME=OpenLake/your-repo

TARGET_REPO=your-test-repo

SIMILARITY_THRESHOLD=0.85

GITHUB_TOKEN=your_token_here

GITHUB_WEBHOOK_SECRET=your_secret_here

SOURCE_REPO=OpenLake/your-repo

TARGET_REPO=your-test-repo

SIMILARITY_THRESHOLD=0.85

🧰 Tools

🪛 dotenv-linter (4.0.0)

[warning] 2-2: [UnorderedKey] The GITHUB_TOKEN key should go before the GITHUB_WEBHOOK_SECRET key

(UnorderedKey)

[warning] 3-3: [SpaceCharacter] The line has spaces around equal sign

(SpaceCharacter)

[warning] 3-3: [ValueWithoutQuotes] This value needs to be surrounded in quotes

(ValueWithoutQuotes)

[warning] 5-5: [SpaceCharacter] The line has spaces around equal sign

(SpaceCharacter)

[warning] 5-5: [TrailingWhitespace] Trailing whitespace detected

(TrailingWhitespace)

[warning] 5-5: [UnorderedKey] The MY_REPO_NAME key should go before the REPO_NAME key

(UnorderedKey)

[warning] 5-5: [ValueWithoutQuotes] This value needs to be surrounded in quotes

(ValueWithoutQuotes)

[warning] 7-7: [DuplicatedKey] The GITHUB_TOKEN key is duplicated

(DuplicatedKey)

[warning] 8-8: [DuplicatedKey] The GITHUB_WEBHOOK_SECRET key is duplicated

(DuplicatedKey)

[warning] 9-9: [DuplicatedKey] The REPO_NAME key is duplicated

(DuplicatedKey)

[warning] 11-11: [DuplicatedKey] The SIMILARITY_THRESHOLD key is duplicated

(DuplicatedKey)

[warning] 11-11: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

[warning] 11-11: [UnorderedKey] The SIMILARITY_THRESHOLD key should go before the TARGET_REPO key

(UnorderedKey)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.env.example around lines 1 - 11, The env example is inconsistent with the real config contract and duplicates keys, so clean up the file to show one canonical set of variables only. Update the example to match what app/config.py reads by documenting GITHUB_WEBHOOK_SECRET, GITHUB_TOKEN, TARGET_REPO, SIMILARITY_THRESHOLD, and SOURCE_REPO, and remove the conflicting REPO_NAME/MY_REPO_NAME block so contributors see a single unambiguous template.

coderabbitai · 2026-06-25T10:24:46Z

+@router.post("/webhook")
+async def github_webhook(request: Request):
+
+    event = request.headers.get("X-GitHub-Event")
+
+    print(f"GitHub Event: {event}")
+
+    try:
+        payload = await request.json()
+    except Exception:
+        payload = {}
+


🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

Verify the GitHub webhook signature.

The handler accepts any POST without validating the X-Hub-Signature-256 HMAC against your webhook secret. Anyone who knows the endpoint can spoof issues/opened events and trigger comments. Compute and compare the HMAC of the raw body before processing.

Sketch

import hmac, hashlib from app.config import WEBHOOK_SECRET raw = await request.body() sig = request.headers.get("X-Hub-Signature-256", "") expected = "sha256=" + hmac.new(WEBHOOK_SECRET.encode(), raw, hashlib.sha256).hexdigest() if not hmac.compare_digest(sig, expected): raise HTTPException(status_code=401, detail="Invalid signature") payload = json.loads(raw)

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 19-19: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/api/webhooks.py` around lines 10 - 21, The github_webhook handler currently accepts any POST request without verifying the GitHub HMAC signature. In github_webhook, read the raw request body first, compute the expected X-Hub-Signature-256 using the configured webhook secret, and compare it with the incoming header using a constant-time check before parsing the payload. If the signature is missing or invalid, return an unauthorized response and only proceed with event processing after the signature is validated.

coderabbitai · 2026-06-25T10:24:46Z

+            issue_text = (
+                payload["issue"]["title"]
+                + "\n"
+                + (payload["issue"]["body"] or "")
+            )


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Guard against malformed issues payloads.

If the payload is missing issue (or title), payload["issue"]["title"] raises KeyError and returns a 500. Since the body is attacker-controllable, prefer .get() with validation and return early on bad shape.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/api/webhooks.py` around lines 27 - 31, Guard against malformed issue payloads in the webhook handler by updating the `issue_text` construction in `app/api/webhooks.py` to use safe accessors instead of direct indexing on `payload["issue"]["title"]`; validate the payload shape first, return early for missing `issue` or `title`, and keep the existing flow in the webhook parsing logic so bad attacker-controlled input cannot raise a `KeyError` and trigger a 500.

coderabbitai · 2026-06-25T10:24:46Z

+            result = detect_duplicate(issue_text)
+            print("Duplicate result:", result)
+
+            if result and result["duplicate"]:
+                github.comment_issue(
+                    payload["issue"]["number"],
+
+                    f"""
+Possible duplicate issue detected.
+
+Similar issue:
+{result['issue']['url']}
+
+Similarity:
+{result['similarity']:.2f}
+"""
+                )


🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Blocking work runs on the event loop.

detect_duplicate (model encode + Chroma query) and github.comment_issue (synchronous PyGithub HTTP calls) are blocking, but this is an async handler. They will stall the event loop and degrade concurrency under load. Offload to a thread (e.g. await run_in_threadpool(...)) or a background task.

Proposed change

- result = detect_duplicate(issue_text) + from starlette.concurrency import run_in_threadpool + result = await run_in_threadpool(detect_duplicate, issue_text) print("Duplicate result:", result) if result and result["duplicate"]: - github.comment_issue( - payload["issue"]["number"], - ... - ) + await run_in_threadpool( + github.comment_issue, + payload["issue"]["number"], + comment_body, + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/api/webhooks.py` around lines 33 - 49, The async webhook handler is calling blocking work directly, which can stall the event loop and hurt concurrency; in the webhook flow around detect_duplicate and github.comment_issue, move both the duplicate detection and the PyGithub issue comment call off the event loop using a threadpool or background task. Update the handler in app/api/webhooks.py so the logic around detect_duplicate(payload["issue"]["body"]) and github.comment_issue(...) is awaited through an async-safe offload path, keeping the async request handler non-blocking.

coderabbitai · 2026-06-25T10:24:46Z

+
+load_dotenv()
+
+WEBHOOK_SECRET = os.getenv("GITHUB_WEBHOOK_SECRET")


🔒 Security & Privacy | 🔴 Critical | 🏗️ Heavy lift

WEBHOOK_SECRET is loaded but never used to verify webhook signatures.

The webhook handler in app/api/webhooks.py parses the request body without validating GitHub's X-Hub-Signature-256 header against this secret. As a result, the /webhook endpoint will process arbitrary forged payloads, which can trigger duplicate-detection work and post comments on issues on behalf of your token.

Compute an HMAC-SHA256 over the raw request body using WEBHOOK_SECRET and compare it (constant-time) to the X-Hub-Signature-256 header before processing.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/config.py` at line 6, The webhook flow currently reads WEBHOOK_SECRET in app/config.py but the handler in app/api/webhooks.py does not use it to authenticate incoming requests. Update the /webhook handling logic to verify GitHub’s X-Hub-Signature-256 against an HMAC-SHA256 computed from the raw request body using WEBHOOK_SECRET, and reject requests when the signature is missing or invalid before any parsing or duplicate-detection work runs. Use the webhook handler entrypoint and request-processing logic in app/api/webhooks.py to locate the change, and perform the comparison in constant time.

coderabbitai · 2026-06-25T10:24:46Z

+client = chromadb.PersistentClient(path="./data/chroma")
+
+
+collection = client.get_or_create_collection(name="github_issues")


🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the default distance metric for a ChromaDB get_or_create_collection, and how do you configure cosine distance in chromadb 0.4.6?

💡 Result:

The default distance metric for a ChromaDB collection is l2 (Squared Euclidean distance) [1][2]. In ChromaDB version 0.4.6, the configuration dictionary parameter (which became standard in later versions like 1.0) was not yet the primary interface for index settings [2]. Instead, you configure the distance metric by passing a metadata dictionary to the get_or_create_collection method using the hnsw:space key [3][4]. To configure cosine distance in chromadb 0.4.6, use the following syntax: collection = client.get_or_create_collection( name="your_collection_name", metadata={"hnsw:space": "cosine"}) Important notes: 1. Distance vs. Similarity: ChromaDB's cosine setting returns cosine distance (1 - cosine similarity), meaning values closer to 0 indicate higher similarity [1][3]. 2. Immutability: The distance metric cannot be changed once a collection has been created [1][2]. If you need a different metric, you must delete the collection and recreate it, or clone it into a new collection with the desired setting [1]. 3. Legacy Metadata: While newer versions of Chroma now prefer a configuration argument for these settings, version 0.4.6 relies on the metadata parameter to pass HNSW-specific configurations like hnsw:space [2][3][4].

Citations:

1: https://cookbook.chromadb.dev/core/collections/

2: https://cookbook.chromadb.dev/core/configuration/

3: [Bug]: Passing {"hnsw:space": "cosine"} to create_collection method returning unexpected distance values on collection.query call chroma-core/chroma#1335

4: https://github.com/chroma-core/chroma/blob/main/chromadb/api/collection_configuration.py

Set an explicit cosine distance metric for the collection.

The default distance metric for get_or_create_collection in ChromaDB 0.4.6 is L2. Since app/ml/duplicate.py calculates similarity = max(0, 1 - distance), it expects the distance values to be bounded (specifically within [0, 2] for cosine distance). Using unbounded L2 distances with this formula produces incorrect similarity scores. You must configure the collection to use the cosine space during creation.

Proposed change

-collection = client.get_or_create_collection(name="github_issues") +collection = client.get_or_create_collection( + name="github_issues", + metadata={"hnsw:space": "cosine"}, +)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

collection = client.get_or_create_collection(name="github_issues")

collection = client.get_or_create_collection(

name="github_issues",

metadata={"hnsw:space": "cosine"},

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/db/vector.py` at line 7, The collection created in client.get_or_create_collection for github_issues is using the default L2 distance, which breaks the similarity math used by app.ml.duplicate's similarity calculation. Update the collection creation in app.db.vector so github_issues is created with an explicit cosine distance metric/space, keeping the returned distances compatible with max(0, 1 - distance).

coderabbitai · 2026-06-25T10:24:47Z

+from app.db.vector import search_similar
+
+
+embedder = Embedder()
+
+
+text = "Cannot login into application"
+
+
+embedding = embedder.generate_embedding(text)
+
+
+result = search_similar(
+    embedding
+)


🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Import name mismatch — search_similar does not exist.

app/db/vector.py defines search_similar_issue, not search_similar. This import raises ImportError and the call on Line 14 would also fail.

Proposed fix

-from app.db.vector import search_similar +from app.db.vector import search_similar_issue @@ -result = search_similar( - embedding -) +result = search_similar_issue( + embedding +)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from app.db.vector import search_similar

embedder = Embedder()

text = "Cannot login into application"

embedding = embedder.generate_embedding(text)

result = search_similar(

embedding

)

from app.db.vector import search_similar_issue

embedder = Embedder()

text = "Cannot login into application"

embedding = embedder.generate_embedding(text)

result = search_similar_issue(

embedding

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/test.py` around lines 2 - 16, The test is importing and calling the wrong vector search helper: `app.db.vector` exposes `search_similar_issue`, not `search_similar`. Update the import in `app/test.py` to use the सही function name, and adjust the call site to invoke `search_similar_issue` with the embedding so the test matches the actual API.

coderabbitai · 2026-06-25T10:24:47Z

+<<<<<<< HEAD
+
+=======
 # SmartTriage


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Remove the merge-conflict markers before merging.

The README is still in a conflicted state at both the top and bottom, which breaks the rendered docs and should block merge.

As per the provided line-range change details, both markers need to be removed.

Also applies to: 264-264

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 1 - 4, The README is still left with merge-conflict markers, so remove the conflict artifacts and keep the intended content only. Clean up both conflicted sections in README.md by deleting the HEAD/divider markers and retaining the final README text, using the visible conflict blocks near the top and bottom of the file as the locations to fix.

kushal281 added 3 commits June 25, 2026 15:36

Initial commit with SmartTriage setup

a90097c

Update README for SmartTriage

321ea6f

Merge remote main into local main

b86ddae

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Week 4 project update#2

Week 4 project update#2
kushal281 wants to merge 3 commits into
OpenLake:mainfrom
kushal281:main

kushal281 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		load_dotenv()

		WEBHOOK_SECRET = os.getenv("GITHUB_WEBHOOK_SECRET")

		client = chromadb.PersistentClient(path="./data/chroma")


		collection = client.get_or_create_collection(name="github_issues")

Uh oh!

Conversation

kushal281 commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kushal281 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading