Fix Windows PowerShell host-start hang in SetCorrectExecutionPolicy#2328
Conversation
On recent in-box Windows PowerShell 5.1 servicing builds, the language and debug servers intermittently hang on startup and ride the CI job timeout (#2323). The wedge is `SetCorrectExecutionPolicy`: it calls `Microsoft.PowerShell.Security\Get-ExecutionPolicy -List`, which autoloads `Microsoft.PowerShell.Security` into the freshly created runspace. That runspace's `InitialSessionState` is reused from the host runspace and already carries the module's `ObjectSecurity` type data, so re-binding it throws "The member `AuditToString` is already present". `PsesInternalHost.Run()` catches it, faults `_started`, and the pipeline thread exits — but `TryStartAsync` is still awaiting queued startup tasks that now never run, so it hangs forever. Configuring the execution policy is best-effort, so we wrap the `Get-ExecutionPolicy` query in try/catch (mirroring the existing `Set-ExecutionPolicy` handling just below it) and skip policy configuration when it fails, rather than letting a type-data hiccup abort host startup. I also guard the subsequent indexing against a short result list. With the hang fixed we no longer need the in-box Windows PowerShell E2E skips added in #2318, so this reverts them: the `SkippableFactOnWindowsPowerShell` attributes, the `LSPTestsFixture` early-return, and the two attribute files. The DAP suite passes clean under `powershell.exe`. Three LSP help tests (`Expand-Archive` synopsis) still fail locally on my older servicing build (.8655); they pass under `pwsh`, so I'm letting CI judge them on its build rather than re-skipping. The `ci-test.yml` job timeout from #2318 stays as a backstop. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes the real root cause behind the Windows CI hang tracked in #2323: on recent in-box Windows PowerShell 5.1 servicing builds, SetCorrectExecutionPolicy's Get-ExecutionPolicy -List call can throw an ObjectSecurity type-data conflict ("The member ... is already present") while autoloading Microsoft.PowerShell.Security, which kills the startup pipeline thread and hangs the server. The query is wrapped in try/catch (best-effort, mirroring the existing Set-ExecutionPolicy handling) plus a short-list guard. Because the hang is addressed at the source, the PR also reverts the stopgap in-box Windows PowerShell E2E test skips introduced by #2318, restoring full E2E coverage on that leg.
Changes:
- Harden
SetCorrectExecutionPolicyagainst the type-data conflict by catching and logging the failure (then skipping policy configuration) and guarding against a too-short policy list. - Remove the discovery-time Windows PowerShell skip attributes (
SkippableFactOnWindowsPowerShell/SkippableTheoryOnWindowsPowerShell) and revert the E2E tests back to[Fact]/[SkippableFact]/[SkippableTheory]. - Revert the #2318 test-harness "hygiene" changes (log-tail yield and the attach poll
Start-Sleep) and the fixture startup/dispose guards.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/.../PowerShell/Utility/PowerShellExtensions.cs |
Wraps the Get-ExecutionPolicy -List call in try/catch and adds a Count < 2 guard so a failed query skips policy configuration instead of aborting host startup. |
test/.../Hosts/SkippableFactOnWindowsPowerShellAttribute.cs |
Deletes the discovery-time WinPS skip attribute and shared skip-reason class. |
test/.../Hosts/SkippableTheoryOnWindowsPowerShellAttribute.cs |
Deletes the discovery-time WinPS skip theory attribute. |
test/.../LSPTestsFixtures.cs |
Removes the WinPS start/dispose guards so the LSP fixture always starts the server. |
test/.../LanguageServerProtocolMessageTests.cs |
Reverts test attributes back to [Fact]/[SkippableFact] and removes the class remarks. |
test/.../DebugAdapterProtocolMessageTests.cs |
Reverts DAP test attributes, the log-tail loop, and the attach poll Start-Sleep. |
The production fix is small and defensive, and the [Fact]/[SkippableFact] mapping is consistent (every test retaining a body-level Skip.* call kept a skippable attribute, and no deleted-type references remain). My comments are minor: an inconsistent ?.Dispose() and two reintroduced busy-waits that #2318 had described as orthogonal "good hygiene." The main risk is non-code: this re-enables a large in-box WinPS E2E suite that previously hung CI for hours, on the hypothesis that the startup fix resolves it, and the author flags open follow-ups (help-test discrepancies, an OmniSharp swallow). Final validation depends on the specific runner image, which warrants human review.
| while (nextLine is null || nextLine.Length == 0) | ||
| { | ||
| string nextLine = await scriptLogReader.ReadLineAsync(); | ||
| if (!string.IsNullOrEmpty(nextLine)) | ||
| { | ||
| return nextLine; | ||
| } | ||
|
|
||
| await Task.Delay(100); | ||
| nextLine = await scriptLogReader.ReadLineAsync(); //Might return null if at EOF because we created it above but the script hasn't written to it yet | ||
| } |
PR #2328 reset the three E2E test files to their pre-#2318 state (`b57653c40`) to undo the Windows PowerShell skips we no longer want now that the host-start hang is actually fixed. But #2318 was a squash that bundled genuine harness hardening *alongside* those skips, so reverting wholesale quietly dropped the good parts too. Restore just those, keeping the skips reverted: - `ReadScriptLogLineAsync` now yields with `await Task.Delay(100)` at EOF instead of busy-spinning. At EOF `ReadLineAsync` completes synchronously with `null`, so the old `while`/`await` loop never released its thread-pool thread and could starve the scheduler on constrained CI runners. - The child-process `Debug-Runspace` readiness poll in `CanAttachScriptWithPathMappings` sleeps 100ms per iteration so it can't peg a core during the attach handshake. - `LSPTestsFixture.DisposeAsync` guards against a null `PsesLanguageClient` so a startup failure isn't masked by a `NullReferenceException` during teardown. These are defense-in-depth independent of the skips, and they matter more now that we un-skip: a Windows PowerShell server that fails to start shouldn't busy-spin or NRE on teardown. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OmniSharp's `DebugAdapterServer.From()` (0.19.9) awaits an internal `AsyncSubject` that its `InitializeRequest` handler only signals on the success path. If an `OnInitialize` delegate throws, the handler faults before signaling, nothing errors the subject, and `From()` -- and thus `PsesDebugServer.StartAsync()` -- awaits it forever. So a startup failure wedges the entire debug server and rides the CI/job timeout instead of failing fast. This is a library limitation, not our bug: the same code is present on the library's `master`, so we can't fix it upstream without a fork or upgrade. #2328 fixes the specific trigger we hit on in-box Windows PowerShell (the `Get-ExecutionPolicy -List` type-data conflict), but the wedge mechanism is generic. Guard against any future `OnInitialize` failure: wrap the delegate body, log the exception, and signal `_serverStopped` so `WaitForShutdown` unblocks and the process exits cleanly. `Dispose`'s `_serverStopped` completion is now idempotent (`TrySetResult`) since the catch may have already completed it. This converts a silent multi-hour hang into a clean termination with a logged error -- the client sees the session end instead of waiting forever. See #2323. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This PR un-skips `CanAttachScriptWithPathMappings` on Windows PowerShell (reverting the #2318 skips now that the host-start hang is fixed), and the un-skipped test promptly failed on the WinPS 5.1 CI leg with "Attached process exited before the script could start" -- at exactly 10 seconds. The test spawns a child `powershell.exe`, then waits up to 10s for `Debug-Runspace` to subscribe to the child's runspace as the marker that the attach handshake completed, before driving a full breakpoint/continue interaction under a 15s xUnit timeout. On the 2-vCPU CI runners the WinPS attach alone exceeds that 10s budget, so the child throws its internal timeout and exits before the debug session attaches. It isn't even close to comfortable locally: xUnit flags the run as a long-running test at the 10s mark on a fast dev box, so the old budget was always on a knife's edge for the slower WinPS path. Rather than re-skip it, give the slow-but-correct path room: - The child's `Debug-Runspace` subscription poll goes 10s -> 30s. - The outer process-watch cancellation goes 30s -> 60s. - The xUnit `Timeout` goes 15s -> 60s. This continues 81b273b (which already bumped the xUnit timeout 10s -> 15s for the same flakiness) and keeps real coverage of the attach path on Windows PowerShell instead of skipping it. See #2323 and #2318. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Un-skipping the entire Windows PowerShell E2E suite (now that the host-start hang is fixed) gave us real WinPS coverage for the first time since #2318, but it also surfaced two pre-existing WinPS-specific failures that have nothing to do with host startup: - `CanAttachScriptWithPathMappings` wedges on the in-box, cross-process `Debug-Runspace` attach handshake and rides whatever timeout we set. CI failed it at 10s, then again at exactly 30s after I bumped the budgets, so it genuinely never completes rather than merely running slow. This is the in-box attach deadlock tracked by #2323. - `CanSendCompletionResolveWithModulePrefixRequestAsync` gets an empty completion list from the Windows PowerShell server for a prefix-imported command (it fails before any help assertion, so it's help-independent). That's the same "completion works in PS7 but not WinPS" family as #1355. Because startup no longer hangs, we don't need #2318's discovery-time `[SkippableFactOnWindowsPowerShell]` attribute anymore: an in-body `Skip.If(IsWindowsPowerShell, ...)` runs after `InitializeAsync` (which now starts the server fine) and skips before the broken code, so both tests skip cleanly in ~1ms under `powershell.exe` instead of hanging or failing. This also reverts my earlier timeout bump on `CanAttachScriptWithPathMappings` (back to the 15s/10s/30s budgets from `main`) since the bigger budgets didn't help and the test no longer runs on Windows PowerShell anyway. Everything else stays un-skipped. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Un-skipping the Windows PowerShell E2E suite surfaced one more attach-family flake: `CanLaunchScriptWithNewChildAttachSession`. It passed on the prior CI run (`a4e8a823e`) but timed out (`TaskCanceledException` at its 30s budget) on the next (`25d9e58dd`) with no relevant code change in between, so it's genuinely flaky on the slow in-box Windows PowerShell CI runners rather than broken. The test runs `Start-DebugAttachSession` and waits for the server's `startDebugging` reverse-request round-trip; on in-box Windows PowerShell that round-trip is slow enough to intermittently miss the timeout. That's the same in-box attach-E2E reliability bucket as #2323, and its two siblings are already skipped there: `CanAttachScriptWithPathMappings` (the cross-process `Debug-Runspace` wedge) and `CanLaunchScriptWithNewChildAttachSessionAsJob` (no `ThreadJob` on Windows PowerShell). So skip this one on Windows PowerShell too, keeping the rest of the now-un-skipped DAP suite running. The flake only reproduces on the constrained CI runner, not on a fast dev box, so I'm matching how the sibling attach tests are handled rather than chasing a timeout bump I can't validate locally. Drafted by Copilot (Claude Opus 4.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| IReadOnlyList<PSObject> policies; | ||
| try | ||
| { | ||
| policies = pwsh | ||
| .AddCommand(@"Microsoft.PowerShell.Security\Get-ExecutionPolicy") | ||
| .AddParameter("List") | ||
| .InvokeAndClear<PSObject>(); | ||
| } | ||
| catch (Exception e) | ||
| { | ||
| // Some Windows PowerShell servicing builds throw a type-data conflict | ||
| // ("The member ... is already present" on ObjectSecurity) when | ||
| // autoloading Microsoft.PowerShell.Security into a runspace whose | ||
| // InitialSessionState already carries that module's type data. | ||
| // Configuring the execution policy is best-effort, so log and skip | ||
| // rather than letting it abort host startup (which manifests as a hang). | ||
| logger.LogError(e, "Failed to query the execution policy; skipping execution policy configuration."); | ||
| return; | ||
| } | ||
|
|
||
| // We need at least the CurrentUser and LocalMachine scopes to proceed. | ||
| if (policies is null || policies.Count < 2) | ||
| { | ||
| return; | ||
| } |
There was a problem hiding this comment.
@JustinGrote I got this whole issue to repro locally on a Windows box and apparently this was the issue that was hanging all the end-to-end tests on the updated image.
Integrates #2328, which fixed the Windows PowerShell host-start hang and reverted the in-box WinPS E2E skips, deleting the SkippableFactOnWindowsPowerShellAttribute. The getModule E2E tests no longer need the discovery-time skip, so they resolve back to [SkippableFact]/[Fact] and run on Windows PowerShell alongside their now-un-skipped siblings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Merging #2328 un-skipped the Windows PowerShell E2E suite, which surfaced a pre-existing host gap in `CanSendGetModuleRequestAsync`: it probed `Microsoft.PowerShell.Management`, but Windows PowerShell's in-box core modules are snap-in based and are not returned by `Get-Module -ListAvailable`. The handler correctly resolved nothing, so `Assert.NotNull` failed on the Windows PowerShell leg while passing on PowerShell 7 (where the core modules ship as files). Probe `Microsoft.PowerShell.Archive` instead: it is a file-based module present in the module path on both Windows PowerShell and PowerShell 7+, so the test exercises the handler on every host without skipping or loosening any assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Merging #2328 un-skipped the Windows PowerShell E2E suite, which surfaced a pre-existing host gap in `CanSendGetModuleRequestAsync`: it probed `Microsoft.PowerShell.Management`, but Windows PowerShell's in-box core modules are snap-in based and are not returned by `Get-Module -ListAvailable`. The handler correctly resolved nothing, so `Assert.NotNull` failed on the Windows PowerShell leg while passing on PowerShell 7 (where the core modules ship as files). Probe `Microsoft.PowerShell.Archive` instead: it is a file-based module present in the module path on both Windows PowerShell and PowerShell 7+, so the test exercises the handler on every host without skipping or loosening any assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes #2323.
On recent in-box Windows PowerShell 5.1 servicing builds,
SetCorrectExecutionPolicy'sGet-ExecutionPolicy -Listcall autoloadsMicrosoft.PowerShell.Securityinto a runspace whose reusedInitialSessionStatealready carries that module'sObjectSecuritytype data, throwing "The memberAuditToStringis already present". That kills the pipeline thread during host startup whileTryStartAsyncis still awaiting queued startup tasks, so the server hangs and rides the CI timeout.The execution-policy query is best-effort, so this wraps it in try/catch and skips policy configuration on failure (mirroring the existing
Set-ExecutionPolicyhandling), plus a guard against a short result list.Because the hang is fixed, this also reverts the in-box Windows PowerShell E2E test skips from #2318. The
ci-test.ymljob-timeout backstop is kept.Known follow-ups
Expand-Archivesynopsis) fail under in-boxpowershell.exeon my local older servicing build (.8655) but pass underpwsh; letting CI judge them on its build. The server returns auto-generated syntax instead of the MAML synopsis only in the WinPS server runspace — I could not isolate the mechanism locally.From(...)silently swallows an exception thrown from the DAPOnInitializehandler, so a startup failure wedges instead of failing fast — I'll file a tracking issue.