Skip to content

Add Playwright-backed row extraction and local LLM provider support#145

Draft
AdamEXu wants to merge 18 commits into
tinyfish-io:mainfrom
AdamEXu:exp-row
Draft

Add Playwright-backed row extraction and local LLM provider support#145
AdamEXu wants to merge 18 commits into
tinyfish-io:mainfrom
AdamEXu:exp-row

Conversation

@AdamEXu

@AdamEXu AdamEXu commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds a Playwright-backed row extraction path that can build, validate, cache, run, and repair reusable extractors through TinyFish Browser. It also carries the supporting fixes needed to make that reliable in practice: richer schema contracts, provider-selectable LLM setup, row extractor settings, better refresh/populate cancellation, and UI/backend wiring for the new configuration.

What changed

  • Added a generated Playwright row extractor runtime for populate and refresh flows, including TinyFish Browser CDP sessions, extractor validation, repair, smoke/regression checks, quality checks, and persisted per-dataset/site extractor scripts.
  • Added codification metadata to schema inference so datasets can decide whether a reusable browser extractor is disabled, candidate, required, or unknown, with source-family hints and URL templates.
  • Expanded column contracts with nullable, validation regex, normalization hints, typed values, cell sources, and stricter primary key handling.
  • Added local LLM provider selection beyond OpenRouter, including provider-specific credentials, base URLs, defaults, model compatibility filtering, extractor-builder model settings, provider logos, and setup/settings UI.
  • Added row extractor controls for concurrency and browser attempts, plus higher token/step limits where the agents need them.
  • Made stop/cancel propagation work through populate, refresh, queued subagents, dataset tools, search/fetch tools, and agent generation so the stop button cancels active and queued work instead of only marking intent.
  • Hardened scheduled refresh completion/failure with run IDs so stale refresh jobs cannot overwrite newer status.
  • Updated local dev docs/setup flow and Docker/dev Makefile wiring around the new local credential/keychain behavior.

Validation

  • git diff --check upstream/main...HEAD
  • cd backend && npm run build
  • cd frontend && npm run lint passed with warnings only: existing/new <img> warnings, React Compiler/TanStack warning, generated Convex eslint-disable warnings, and an existing analytics hook dependency warning.
  • cd frontend && npm run build

Notes for reviewers

The biggest behavioral change is that row collection can now move from one-off agent investigation to reusable browser extractors when the dataset has a stable page family. The provider/model and schema-contract changes are included because the extractor builder needs explicit model selection, validation rules, and typed normalization to avoid producing brittle scripts.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 35b7a475-7c75-430b-8483-f2d2466b6887

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AdamEXu

AdamEXu commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

WIP: Still need to add sandboxing and other important features like that to make sure it's good and stuff

@AdamEXu

AdamEXu commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

have not tested the latest commit at all yet btw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant