android: harden phase-3.2 datapath validation#776
android: harden phase-3.2 datapath validation#776TryDying wants to merge 11 commits intotailscale:mainfrom
Conversation
- Add AGENTS.md with project overview, documentation index, architecture highlights - Add docs/01-项目指南.md: Project overview and quick start guide - Add docs/02-开发指南.md: Development workflow and feature guidelines - Add docs/03-技术指南.md: Architecture design and technical stack - Add docs/04-更新日志.md: Version history and bug fix tracking - Add PROGRESS.md: High-signal lessons learned log template All documentation follows established language and maintenance rules.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: trydying <trydying@mail.ustc.edu.cn>
There was a problem hiding this comment.
Pull request overview
This PR hardens the Android phase-3.2 datapath validation prototype by adding stable flow correlation, richer lifecycle/relay observability, host-side baseline services, and end-to-end adb-driven validation scripts.
Changes:
- Add tsocks datapath routing rules, stable
flow_id, relay/resource snapshots, and TCP lifecycle logging across the Go datapath stack. - Introduce reusable host-side HTTP/TCP baseline test services plus scripts to trigger, run, and machine-verify scenarios (including concurrency and wrong-port
/32behavior). - Add Android debug-only adb harness pieces (WorkManager worker + activity) to drive probes and datapath HTTP checks; update docs/progress records and bump versionCode.
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/tsocks-test-trigger.sh | Scenario-based adb trigger for broadcast/activity-based tests. |
| scripts/tsocks-test-services-stop.sh | Stops the host-side baseline test server via pidfile. |
| scripts/tsocks-test-services-start.sh | Starts the host-side baseline test server and runs a health check. |
| scripts/tsocks-test-services-health.sh | Host-side readiness checks for HTTP/TCP baselines. |
| scripts/tsocks-test-run-all.sh | Orchestrates build/install/connect + runs scenario suite and summaries. |
| scripts/tsocks-test-phase32.sh | Phase-3.2 validation suite (baseline/concurrency/wrong-port/lifecycle). |
| scripts/tsocks-test-pass-fail.sh | Log-based PASS/FAIL extraction and machine-verifiable assertions. |
| scripts/tsocks-test-logs.sh | Convenience logcat filter for tsocks validation tags. |
| scripts/tsocks-test-install.sh | Installs the APK via make install. |
| scripts/tsocks-test-env.sh | Discovers default LAN/tailnet hosts and exports baseline ports. |
| scripts/tsocks-test-build.sh | Builds the APK via make apk. |
| scripts/tsocks_test_server.py | Threaded HTTP/TCP baseline server for healthz and TCP behaviors. |
| PROGRESS.md | Adds Chinese progress/lessons-learned entry for phase-3.2 prototype. |
| libtailscale/tsocks.go | Implements shared probe execution and datapath SOCKS relay with logging. |
| libtailscale/tsocks_rules.go | Centralizes rules, injected-route target derivation, and flow ID generation. |
| libtailscale/tsocks_rules_test.go | Unit tests for rule matching, injected routes, offload state, flow IDs. |
| libtailscale/tsocks_observability.go | Adds relay start/end snapshots and close-reason classification. |
| libtailscale/step0_tun.go | Adds “step0” gVisor proof-stack interception and lifecycle event logging. |
| libtailscale/net.go | Injects /32 routes and wraps the TUN device to enable step0 interception. |
| libtailscale/interfaces.go | Extends Application interface with RunTsocksProbe for adb automation. |
| libtailscale/backend.go | Wires tsocks controller into backend/netstack TCP flow handling. |
| android/src/main/java/com/tailscale/ipn/IPNReceiver.java | Adds debug-only broadcast action to enqueue probe worker requests. |
| android/src/main/java/com/tailscale/ipn/DatapathTestActivity.kt | Debug-only activity to initiate a raw-socket HTTP datapath request. |
| android/src/main/java/com/tailscale/ipn/AdbTcpHttpTestWorker.kt | WorkManager worker that calls into libtailscale RunTsocksProbe. |
| android/src/main/java/com/tailscale/ipn/AdbTcpHttpTestContract.kt | Contract constants for adb test actions/extras/log tags. |
| android/src/main/AndroidManifest.xml | Declares DatapathTestActivity component. |
| android/build.gradle | Bumps Android versionCode. |
| AGENTS.md | Adds repository workflow/documentation rules and validated commands. |
| .gitignore | Ignores local dev artifacts (.envrc, etc.). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for _, routeTarget := range tsocksInjectedRouteTargets() { | ||
| if err := builder.AddRoute(routeTarget.String(), 32); err != nil { | ||
| return err | ||
| } | ||
| b.logger.Logf("updateTUN: added tsocks injected route %s/32", routeTarget) |
There was a problem hiding this comment.
tsocksInjectedRouteTargets() currently returns hard-coded public IPs, and these /32 routes are added to the VPN unconditionally on every updateTUN call. That will divert traffic to those IPs into the TUN for all users, which is high-risk outside the test harness. Please gate these injected routes behind an explicit debug/feature flag (and ensure the default is disabled), or only add them when running the phase-3.x validation flow.
| if tunDev, err = newStep0Tun(tunDev, b.appCtx, b.tsocks); err != nil { | ||
| closeFileDescriptor() | ||
| return err | ||
| } | ||
| b.logger.Logf("updateTUN: wrapped TUN device for step0") |
There was a problem hiding this comment.
Wrapping the TUN with step0Tun unconditionally changes packet handling and enables gVisor listeners for the intercept targets in all builds. This should be guarded behind a debug/feature flag (or only enabled when tsocks/datapath validation is explicitly turned on), otherwise production users may see unexpected routing/overhead or failures if the SOCKS server isn’t reachable.
| if err != nil { | ||
| return nil, fmt.Errorf("netstack.Create: %w", err) | ||
| } | ||
| ns.GetTCPHandlerForFlow = b.tsocks.datapathHandler |
There was a problem hiding this comment.
ns.GetTCPHandlerForFlow is set unconditionally to tsocks.datapathHandler, which enables offload behavior for matching flows in all runs. Given the rules include hard-coded public IPs and a hard-coded SOCKS server, this needs a runtime/compile-time gate so the handler is only installed when tsocks validation is explicitly enabled; otherwise normal users can have unrelated traffic hijacked or fail when the SOCKS server is unreachable.
| if decision.Route != tsocksRouteTailnetSocks || !slices.Contains(tsocksInterceptTargets(), target) { | ||
| return false |
There was a problem hiding this comment.
shouldIntercept calls tsocksInterceptTargets() and then slices.Contains(...) on every packet. tsocksInterceptTargets() allocates and sorts a slice each call, so this becomes a hot-path overhead. Consider precomputing the intercept targets once (e.g., store a map/set in step0Tun during init) and reusing it for membership checks.
| func tsocksShouldOffloadTarget(dst netip.AddrPort) bool { | ||
| for _, target := range tsocksInterceptTargets() { | ||
| if target == dst { | ||
| return true | ||
| } | ||
| } | ||
| return false |
There was a problem hiding this comment.
tsocksShouldOffloadTarget calls tsocksInterceptTargets(), which rebuilds and sorts the intercept target list every time. This is avoidable overhead in routing/packet-processing paths. Consider caching the computed intercept targets (and/or a set) at init time since tsocksDatapathRules are static.
| # | ||
|
|
||
| resolve_default_lan_host() { | ||
| ip -4 -o addr show up scope global | awk '$2 != "tailscale0" { split($4, a, "/"); print a[1]; exit }' |
There was a problem hiding this comment.
resolve_default_lan_host assumes the ip command exists. Because most caller scripts run with set -e and source this file, missing ip will abort the whole test flow. Consider guarding with command -v ip (and returning an empty/default value with a clear error message) to make failures more actionable across dev environments.
| ip -4 -o addr show up scope global | awk '$2 != "tailscale0" { split($4, a, "/"); print a[1]; exit }' | |
| if command -v ip >/dev/null 2>&1; then | |
| ip -4 -o addr show up scope global | awk '$2 != "tailscale0" { split($4, a, "/"); print a[1]; exit }' | |
| else | |
| echo "tsocks-test-env: 'ip' command not found; leaving TSOCKS_TEST_LAN_HOST unset unless provided explicitly" >&2 | |
| fi |
| assert_contains "socks-close" "event=conn_close .*closeReason=" "$log_file" | ||
| assert_contains "socks-connect" "event=socks_connect .*stage=connect_success" "$log_file" | ||
| assert_not_contains "socks-test-fail" "event=TEST_FAIL" "$log_file" | ||
| assert_not_contains "socks-cross-target" "flow_id=.*dst=104\.18\.26\.120:80.*\n.*flow_id=.*dst=104\.18\.27\.120:80" "$log_file" || true |
There was a problem hiding this comment.
This cross-target assertion is currently ineffective: grep -E doesn’t match across newlines, and the check is also silenced with || true, so it can’t ever fail the run. If you want to detect flow_id reuse across different dst values, use a line-oriented check (e.g., extract flow_id->dst mappings and assert 1:1) or drop this dead assertion.
| assert_not_contains "socks-cross-target" "flow_id=.*dst=104\.18\.26\.120:80.*\n.*flow_id=.*dst=104\.18\.27\.120:80" "$log_file" || true | |
| if ! awk ' | |
| { | |
| flow_id = "" | |
| dst = "" | |
| if (match($0, /flow_id=[^[:space:]]+/)) { | |
| flow_id = substr($0, RSTART + 8, RLENGTH - 8) | |
| } | |
| if (match($0, /dst=[^[:space:]]+/)) { | |
| dst = substr($0, RSTART + 4, RLENGTH - 4) | |
| } | |
| if (flow_id != "" && dst != "") { | |
| if ((flow_id in seen_dst) && seen_dst[flow_id] != dst) { | |
| printf "flow_id %s reused across dst values: %s and %s\n", flow_id, seen_dst[flow_id], dst > "/dev/stderr" | |
| exit 1 | |
| } | |
| seen_dst[flow_id] = dst | |
| } | |
| } | |
| ' "$log_file"; then | |
| printf 'ASSERT FAIL [%s]: flow_id reused across different dst values\n' "socks-cross-target" >&2 | |
| exit 1 | |
| fi |
| <activity | ||
| android:name=".DatapathTestActivity" | ||
| android:excludeFromRecents="true" | ||
| android:exported="true" |
There was a problem hiding this comment.
DatapathTestActivity is declared as exported in the main manifest. Even though it exits immediately when !BuildConfig.DEBUG, in debug builds any app can start it and provide an arbitrary url extra, turning it into a network request primitive. Prefer moving this activity declaration to a debug-only manifest/source set, or mark it non-exported / protect it with a signature permission so it’s not externally invokable.
| android:exported="true" | |
| android:exported="false" |
| headers := fmt.Sprintf("%s %s HTTP/1.1\r\nHost: %s\r\nConnection: close\r\nUser-Agent: tailscale-android-tsocks-test\r\n", method, path, req.Host) | ||
| hostHeader := strings.TrimSpace(req.HostHeader) | ||
| if hostHeader == "" { | ||
| hostHeader = req.Host | ||
| } | ||
| headers = fmt.Sprintf("%s %s HTTP/1.1\r\nHost: %s\r\nConnection: close\r\nUser-Agent: tailscale-android-tsocks-test\r\n", method, path, hostHeader) |
There was a problem hiding this comment.
probeHTTP assigns headers twice (first using req.Host, then immediately overwriting with hostHeader). This is dead code and makes the request construction harder to follow. Please remove the first assignment and build the headers once using the chosen Host header value.
| headers := fmt.Sprintf("%s %s HTTP/1.1\r\nHost: %s\r\nConnection: close\r\nUser-Agent: tailscale-android-tsocks-test\r\n", method, path, req.Host) | |
| hostHeader := strings.TrimSpace(req.HostHeader) | |
| if hostHeader == "" { | |
| hostHeader = req.Host | |
| } | |
| headers = fmt.Sprintf("%s %s HTTP/1.1\r\nHost: %s\r\nConnection: close\r\nUser-Agent: tailscale-android-tsocks-test\r\n", method, path, hostHeader) | |
| hostHeader := strings.TrimSpace(req.HostHeader) | |
| if hostHeader == "" { | |
| hostHeader = req.Host | |
| } | |
| headers := fmt.Sprintf("%s %s HTTP/1.1\r\nHost: %s\r\nConnection: close\r\nUser-Agent: tailscale-android-tsocks-test\r\n", method, path, hostHeader) |
| minSdkVersion 26 | ||
| targetSdkVersion 35 | ||
| versionCode 468 | ||
| versionCode 503 |
There was a problem hiding this comment.
AGENTS.md states that after modifying any code, android/build.gradle versionCode must be incremented by 1. This change jumps from 468 to 503, which conflicts with that rule. Either adjust the documented rule (if larger bumps are allowed) or update versionCode to match the stated +1 policy.
| versionCode 503 | |
| versionCode 469 |
Summary
flow_id, richer lifecycle logs, and relay resource snapshots/32behavior, and TCP lifecycle casesTesting
make apkmake installBUILD_FIRST=false INSTALL_FIRST=false sh scripts/tsocks-test-phase32.sh