Skip to content

pass ddtrace_globals explicitly through sidecar/telemetry GSHUTDOWN#3886

Merged
cataphract merged 3 commits into
masterfrom
glopes/fix-gshutdown-crash
May 19, 2026
Merged

pass ddtrace_globals explicitly through sidecar/telemetry GSHUTDOWN#3886
cataphract merged 3 commits into
masterfrom
glopes/fix-gshutdown-crash

Conversation

@cataphract
Copy link
Copy Markdown
Contributor

In ZTS mode, ts_free_id iterates all per-thread TSRM storage slots from the main thread, calling zm_globals_dtor_ddtrace for each one. The dtor receives the correct zend_ddtrace_globals* for the dead worker thread, but ddtrace_sidecar_gshutdown() and
ddtrace_telemetry_flush_bgs_metrics_if_due() were using DDTRACE_G() which reads the thread's TSRM slot. Once ts_free_id has processed the main thread's slot and nulled it, subsequent calls to DDTRACE_G(sidecar) dereference NULL → SIGSEGV.

Fix: thread zend_ddtrace_globals* explicitly through both functions and replace all DDTRACE_G(field) with ddtrace_globals->field. Add DDTRACE_GLOBALS_PTR() macro for the non-GSHUTDOWN call site in ddtrace_telemetry_finalize().

Add ZtsGshutdownTests integration test (runs on ZTS + -PcheckCoreDumps) that reproduces the crash via MaxConnectionsPerChild 1 and verifies no core dump is produced. Enable -PcheckCoreDumps in the appsec integration CI jobs. Also wire up core-dump detection infrastructure in AppSecContainer (ulimit, core_pattern, copy cores to test-logs, detect on close).

The test fails around 80% of the time on my machine.

Reviewer checklist

  • Test coverage seems ok.
  • Appropriate labels assigned.

In ZTS mode, ts_free_id iterates all per-thread TSRM storage slots from
the main thread, calling zm_globals_dtor_ddtrace for each one. The dtor
receives the correct zend_ddtrace_globals* for the dead worker thread,
but ddtrace_sidecar_gshutdown() and
ddtrace_telemetry_flush_bgs_metrics_if_due() were using DDTRACE_G()
which reads the thread's TSRM slot. Once ts_free_id has processed the
main thread's slot and nulled it, subsequent calls to DDTRACE_G(sidecar)
dereference NULL → SIGSEGV.

Fix: thread zend_ddtrace_globals* explicitly through both functions and
replace all DDTRACE_G(field) with ddtrace_globals->field. Add DDTRACE_GLOBALS_PTR()
macro for the non-GSHUTDOWN call site in ddtrace_telemetry_finalize().

Add ZtsGshutdownTests integration test (runs on ZTS + -PcheckCoreDumps) that
reproduces the crash via MaxConnectionsPerChild 1 and verifies no core dump
is produced. Enable -PcheckCoreDumps in the appsec integration CI jobs.
Also wire up core-dump detection infrastructure in AppSecContainer
(ulimit, core_pattern, copy cores to test-logs, detect on close).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@cataphract cataphract requested review from a team as code owners May 19, 2026 00:13
@datadog-datadog-prod-us1
Copy link
Copy Markdown

datadog-datadog-prod-us1 Bot commented May 19, 2026

Pipelines  Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🚦 12 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | php-app.amd64.DOC: [public.ecr.aws/lts/ubuntu:22.04, linux/amd64, 7.2]   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. 2 failed tests: AssertionError - No traces found for request FTSHDAPTMTMLEGBJHTIVYZYERIAGGSVIKXOR and AssertionError - telemetry data length is 0.

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [8.0]   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Connection timed out while attempting to access request-replayer service at http://request-replayer:80/replay and http://request-replayer:80/replay-stats.

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [8.1]   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. 3 tests failed: process timed out while sending traces, unable to install live debugger metric probe, and unable to install live debugger span probe.

View all 12 failed jobs.

❄️ 1 New flaky test detected

no crash during GSHUTDOWN when MaxConnectionsPerChild 1 triggers ZTS worker lifecycle() from com.datadog.appsec.php.integration.ZtsGshutdownTests   View in Datadog (Fix with Cursor)
java.lang.AssertionError: Apache worker segfaulted during GSHUTDOWN:
AH00558: apache2: Could not reliably determine the server&#39;s fully qualified domain name, using 172.17.0.4. Set the &#39;ServerName&#39; directive globally to suppress this message
. Expression: errorLog.contains(exit signal Segmentation)

java.lang.AssertionError: Apache worker segfaulted during GSHUTDOWN:
AH00558: apache2: Could not reliably determine the server&#39;s fully qualified domain name, using 172.17.0.4. Set the &#39;ServerName&#39; directive globally to suppress this message
. Expression: errorLog.contains(exit signal Segmentation)
	at org.codehaus.groovy.runtime.InvokerHelper.createAssertError(InvokerHelper.java:416)
	at com.datadog.appsec.php.integration.ZtsGshutdownTests.no crash during GSHUTDOWN when MaxConnectionsPerChild 1 triggers ZTS worker lifecycle(ZtsGshutdownTests.groovy:107)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
...

New test introduced in this PR is flaky.

View in Flaky Test Management

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 60.70% (-0.05%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 847641e | Docs | Datadog PR Page | Give us feedback!

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 19, 2026

Benchmarks [ tracer ]

Benchmark execution time: 2026-05-19 02:53:43

Comparing candidate commit 527a671 in PR branch glopes/fix-gshutdown-crash with baseline commit 9fa5e9e in branch master.

Found 2 performance improvements and 1 performance regressions! Performance is the same for 191 metrics, 0 unstable metrics.

scenario:ComposerTelemetryBench/benchTelemetryParsing

  • 🟥 execution_time [+0.358µs; +1.842µs] or [+2.183%; +11.232%]

scenario:ContextPropagationBench/benchInject64Bit

  • 🟩 execution_time [-1035.627ns; -696.373ns] or [-3.366%; -2.263%]

scenario:PDOBench/benchPDOOverhead-opcache

  • 🟩 execution_time [-13.230µs; -6.620µs] or [-4.212%; -2.108%]

Doesn't fix the fact this is called repeatedly in the main thread.
When called by `ts_free_id` the threads whose data we want to free are
gone. For this to work propoerly we would need to register an actual
thread data destructor.
@cataphract cataphract merged commit a52b578 into master May 19, 2026
2109 of 2121 checks passed
@cataphract cataphract deleted the glopes/fix-gshutdown-crash branch May 19, 2026 11:44
@github-actions github-actions Bot added this to the 1.20.0 milestone May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants