Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the Python SLO workload runner under tests/slo/ to use environment-variable-driven configuration (workload selection, YDB connection, duration, OTLP exporter settings) instead of CLI subcommands, and updates metrics and CI workflows accordingly.
Changes:
- Replace subcommand-based orchestration with a single “run all” flow selected via
WORKLOAD_NAME/ env vars. - Rework metrics initialization to use standard OpenTelemetry env vars and switch latency quantiles to HDR histogram-based calculation.
- Update SLO GitHub workflows and Docker image interface to match the new env-var-based runner.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/slo/src/runners/topic_runner.py | Update metrics factory call to new signature. |
| tests/slo/src/runners/table_runner.py | Update metrics factory call to new signature. |
| tests/slo/src/root_runner.py | Replace subcommand runner with run_all() flow selected via WORKLOAD_NAME. |
| tests/slo/src/pyrightconfig.json | Add Pyright configuration for the SLO runner code. |
| tests/slo/src/options.py | Simplify CLI flags; inject connection/duration from environment variables. |
| tests/slo/src/jobs/table_jobs.py | Switch to WORKLOAD_NAME constant. |
| tests/slo/src/jobs/base.py | Always start metrics sender job (no CLI OTLP gating). |
| tests/slo/src/jobs/async_topic_jobs.py | Always start async metrics sender task (no CLI OTLP gating). |
| tests/slo/src/core/metrics.py | Move to standard OTel env vars and implement HDR-based latency quantiles. |
| tests/slo/src/main.py | Switch entrypoint to call run_all(). |
| tests/slo/requirements.txt | Replace quantile-estimator with hdrhistogram. |
| tests/slo/Dockerfile | Update documentation/comments and provide a default CMD for tuning flags. |
| pyproject.toml | Configure tool.ty extra paths for SLO sources. |
| .github/workflows/slo.yml | Refactor workload matrix and switch to ydb-slo-action v2 init/report. |
| .github/workflows/slo-report.yml | Switch report action reference to v2 and remove label-removal job. |
Comments suppressed due to low confidence (1)
.github/workflows/slo.yml:53
concurrency.groupreferencesmatrix.workload, but the matrix is now defined undermatrix.sdk(withname/command). This will make the workflow fail to evaluate at runtime. Update the concurrency group expression to use the new matrix key (e.g.,matrix.sdk.name).
concurrency:
group: slo-${{ github.ref }}-${{ matrix.workload }}
cancel-in-progress: true
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Metrics job calls push() with the cadence of --report-period. | ||
| # force_flush() makes the exporter send immediately. | ||
| with self._hdr_lock: | ||
| snapshot = {k: v for k, v in self._hdr.items()} |
There was a problem hiding this comment.
In push(), the code copies the _hdr dict under the lock, but then reads percentiles from the same HdrHistogram instances after releasing the lock. Since stop() records into these histograms under the same lock, this creates a data race (HdrHistogram isn't thread-safe) and can lead to incorrect values or runtime errors. Keep the lock held while computing percentiles, or clone/snapshot each histogram under the lock before reading it.
| snapshot = {k: v for k, v in self._hdr.items()} | |
| # Clone histograms under the lock to avoid concurrent mutation while reading. | |
| snapshot = {k: v.copy() for k, v in self._hdr.items()} |
| import asyncio | ||
| import ydb | ||
| import ydb.aio | ||
| import logging | ||
| from typing import Dict | ||
|
|
||
| from runners.topic_runner import TopicRunner | ||
| from core.metrics import WORKLOAD_NAME | ||
| from runners.table_runner import TableRunner | ||
| from runners.base import BaseRunner | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class SLORunner: | ||
| def __init__(self): | ||
| self.runners: Dict[str, type(BaseRunner)] = {} | ||
|
|
||
| def register_runner(self, prefix: str, runner_cls: type(BaseRunner)): | ||
| self.runners[prefix] = runner_cls | ||
| from runners.topic_runner import TopicRunner | ||
|
|
||
| def run_command(self, args): | ||
| subcommand_parts = args.subcommand.split("-", 1) | ||
| if len(subcommand_parts) < 2: | ||
| raise ValueError(f"Invalid subcommand format: {args.subcommand}. Expected 'prefix-command'") | ||
| import ydb | ||
| import ydb.aio | ||
|
|
There was a problem hiding this comment.
asyncio and ydb.aio are imported but not used in this module. Flake8 will flag these as unused imports (F401). Remove them, or use the async driver path if it’s still intended here.
vgvoleg
left a comment
There was a problem hiding this comment.
In this PR you've deleted async jobs, which is out of scope; you've reorganized arguments, which is out of scope; you've deleted an ability to run jobs without monitoring job; you've added configs for tools we are not using...
Please make PR as simple as it needed to solve your task.
🌋 SLO Test Results🔴 2 workload(s) tested — 1 workload(s) exceeded failure thresholds
Threshold violations: sync-table:
Generated by ydb-slo-action |
| @@ -0,0 +1,4 @@ | |||
| { | |||
There was a problem hiding this comment.
local tool configs should not be pushed to main repo
| @@ -1,3 +1,9 @@ | |||
| [tool.ty.environment] | |||
There was a problem hiding this comment.
local tool configs should not be pushed to main repo
| --- | ||
|
|
||
| ## Topic Chaos Testing (SLO) | ||
| ## SLO Testing |
There was a problem hiding this comment.
As we don't have CI job for topics and they still have lots of issues during disconnect, this article was created for local topic development only. There's no need to make this doc related to table workloads as well, since it may lead to an unnecessary increased consumption of tokens. Please revert it to topic chaos testing and update only code snippets
| if not getattr(self.args, "otlp_endpoint", None): | ||
| return [] | ||
|
|
||
| def _run_metric_job(self) -> list[Any]: |
There was a problem hiding this comment.
Old _run_metric_job had an early return if not getattr(self.args, "otlp_endpoint", None): return []. This guard was removed in both sync (base.py) and async (async_topic_jobs.py) versions — a background thread/task is now always started even when metrics are disabled.
|
|
||
| try: | ||
| logger.info("[%s][async] Creating resources", workload_name) | ||
| runner.create(args) |
There was a problem hiding this comment.
_run_all_async calls runner.create() and runner.cleanup() synchronously, but with a ydb.aio.Driver whose topic_client has async def create_topic/drop_topic/etc. The coroutines are never awaited — topic is never actually created, so the workload fails immediately.
|
|
||
| args = parser.parse_args() | ||
|
|
||
| # Aliases used by topic runner |
There was a problem hiding this comment.
args.path, args.consumer, args.partitions_count are aliased from args.topic_path, args.topic_consumer, args.topic_partitions at the bottom of parse_options() because the runner/job code was never updated to use the new names. Two names for the same field — the aliases should be removed and the runner code updated instead.
| def make_table_create_parser(subparsers): | ||
| table_create_parser = subparsers.add_parser("table-create", help="Create tables and fill with initial content") | ||
| add_common_options(table_create_parser) | ||
| def parse_options(): |
There was a problem hiding this comment.
The old subcommand structure meant each subcommand only accepted its own relevant arguments — passing a table arg to a topic run was a parser error. Now all arguments are always accepted regardless of --workload-name, so e.g. --workload-name topic --min-partitions-count 42 silently ignores the table parameter. Mistakes won't be caught at parse time.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def create_metrics(args) -> BaseMetrics: | ||
| """ | ||
| Factory used by SLO runners. | ||
|
|
||
| Metrics are enabled if either: | ||
| - OTLP_ENDPOINT env var is set, or | ||
| - `--otlp-endpoint` is provided (and non-empty) | ||
|
|
||
| If endpoint is empty, metrics are disabled (DummyMetrics). | ||
| Uses args.otlp_endpoint, args.workload_name, args.workload_ref from parsed CLI arguments. | ||
| If the endpoint is empty, returns a no-op DummyMetrics. | ||
| """ | ||
| endpoint = (environ.get("OTLP_ENDPOINT") or (otlp_endpoint or "")).strip() | ||
| endpoint = (args.otlp_endpoint or "").strip() | ||
|
|
||
| if not endpoint: | ||
| logger.info("Creating dummy metrics (metrics disabled)") | ||
| logger.info("OTLP endpoint not configured — metrics disabled") | ||
| return DummyMetrics() | ||
|
|
||
| logger.info("Creating OTLP metrics exporter to Prometheus: %s", endpoint) | ||
| environ.setdefault("OTEL_EXPORTER_OTLP_PROTOCOL", "http/protobuf") | ||
| environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = endpoint | ||
|
|
||
| logger.info("Creating OTLP metrics exporter (endpoint: %s)", endpoint) |
| driver.stop(timeout=args.shutdown_time) | ||
|
|
||
|
|
| parser.add_argument( | ||
| "endpoint", | ||
| nargs="?", | ||
| default=os.environ.get("YDB_ENDPOINT", "grpc://localhost:2136"), | ||
| help="YDB endpoint (default: $YDB_ENDPOINT)", | ||
| ) | ||
| topic_parser.add_argument( | ||
| "--write-threads", | ||
| default=1, | ||
| type=int, | ||
| help="Number of threads for topic writing", | ||
| parser.add_argument( | ||
| "db", | ||
| nargs="?", | ||
| default=os.environ.get("YDB_DATABASE", "/local"), | ||
| help="YDB database (default: $YDB_DATABASE)", | ||
| ) |
| default="http://localhost:9090/api/v1/otlp/v1/metrics", | ||
| type=str, | ||
| help="Full Prometheus OTLP metrics endpoint (e.g. http://ydb-prometheus:9090/api/v1/otlp/v1/metrics). Empty to disable.", | ||
| Every flag supports a fallback chain: CLI arg > environment variable > hardcoded default. |
| - if: always() | ||
| name: Upload logs | ||
| uses: actions/upload-artifact@v4 | ||
| workload_baseline_image: ydb-app-current |
| FROM python:3.11-slim | ||
|
|
||
| ENV PYTHONDONTWRITEBYTECODE=1 \ | ||
| PYTHONUNBUFFERED=1 | ||
|
|
||
| WORKDIR /src | ||
| COPY . /src | ||
|
|
||
| # Install runtime deps into an isolated venv so we can copy it into the final stage. | ||
| RUN python -m venv /opt/venv \ | ||
| && /opt/venv/bin/python -m pip install --no-cache-dir --upgrade pip \ | ||
| && /opt/venv/bin/pip install --no-cache-dir . \ | ||
| && /opt/venv/bin/pip install --no-cache-dir -r tests/slo/requirements.txt | ||
| RUN apt-get update && apt-get install -y --no-install-recommends gcc libc6-dev && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| WORKDIR /src | ||
|
|
||
| FROM python:3.11-slim | ||
| # 1. YDB SDK | ||
| COPY setup.py pyproject.toml README.md requirements.txt ./ | ||
| COPY ydb/ ydb/ | ||
| RUN pip install --no-cache-dir . | ||
|
|
||
| ENV PYTHONDONTWRITEBYTECODE=1 \ | ||
| PYTHONUNBUFFERED=1 \ | ||
| PATH="/opt/venv/bin:${PATH}" | ||
| # 2. SLO deps | ||
| COPY tests/slo/requirements.txt tests/slo/requirements.txt | ||
| RUN pip install --no-cache-dir -r tests/slo/requirements.txt |
| # Latency gauges (fed from HDR histograms via push()) | ||
| self._latency_p50 = self._meter.create_gauge( | ||
| name="sdk_operation_latency_p50_seconds", | ||
| unit="s", | ||
| description="P50 operation latency in seconds.", | ||
| ) | ||
| self._latency_p95 = self._meter.create_gauge( | ||
| name="sdk_operation_latency_p95_seconds", | ||
| unit="s", | ||
| description="P95 operation latency in seconds.", | ||
| ) | ||
| self._latency_p99 = self._meter.create_gauge( | ||
| name="sdk_operation_latency_p99_seconds", | ||
| unit="s", | ||
| description="P99 operation latency in seconds.", | ||
| ) |
Pull request type
Please check the type of change your PR introduces:
What is the current behavior?
Issue Number: N/A
What is the new behavior?
Other information