Skip to content

[SPARK-57191][YARN] Fix driver hang when MonitorThread encounters unexpected exception#56274

Open
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-57191-yarn-driver-hang
Open

[SPARK-57191][YARN] Fix driver hang when MonitorThread encounters unexpected exception#56274
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-57191-yarn-driver-hang

Conversation

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

In YARN client mode, YarnClientSchedulerBackend's MonitorThread only catches InterruptedException / InterruptedIOException. If any other exception occurs during application monitoring (e.g., network failure, credential expiration, or other runtime errors), the thread dies silently. Since the driver JVM has active non-daemon threads (SparkUI, heartbeats), the process hangs indefinitely in a zombie state.

This patch adds a NonFatal catch clause that logs the error and calls sc.stop(), ensuring the driver shuts down cleanly.

Why are the changes needed?

In managed environments (cloud platform agents, workflow schedulers), a hung driver is indistinguishable from one doing legitimate post-execution work. This causes resource leakage, orphaned processes, and extended job timeout durations.

Does this PR introduce any user-facing change?

Yes. Previously, certain failures in the monitor thread caused the driver to hang forever. Now the driver shuts down cleanly with an error log.

How was this patch tested?

Added a new test in YarnClientSchedulerBackendSuite with a test that mocks Client.monitorApplication to throw a RuntimeException and asserts sc.stop() is called (via SparkListener.onApplicationEnd).

Was this patch authored or co-authored using generative AI tooling?

Yes.

…xpected exception

In YARN client mode, YarnClientSchedulerBackend's MonitorThread only catches InterruptedException/InterruptedIOException. If any other exception occurs (e.g., network failure, credential expiration) during application monitoring, the thread dies silently while the driver JVM hangs indefinitely due to non-daemon threads (SparkUI, heartbeats) keeping the process alive.

This patch adds a NonFatal catch clause that logs the error and calls sc.stop() to ensure the driver shuts down cleanly instead of hanging.
@shrirangmhalgi shrirangmhalgi marked this pull request as ready for review June 2, 2026 13:25
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

@pan3793 / @sarutak / @LuciferYang Could you please review this small fix. The MonitorThread in YARN client mode silently swallows non-interrupt exceptions, leaving the driver hung. The fix adds a NonFatal catch that calls sc.stop().

…path

Wire the reflected thread into backend.monitorThread so that when sc.stop() triggers YarnClientSchedulerBackend.stop(), the full production path (stop -> monitorThread.stopMonitor()) is exercised.
@pan3793
Copy link
Copy Markdown
Member

pan3793 commented Jun 3, 2026

it sounds like a good idea to expand spark.yarn.am.clientModeExitOnError to cover this case, cc @AngersZhuuuu, do you experience such an issue that MonitorThread crashes?

…comment wording

- Use structured logging API (logError(log"...", e)) per sarutak's review
- Add System.exit(1) when AM_CLIENT_MODE_EXIT_ON_ERROR is set, matching the existing happy-path behavior for FAILED/KILLED states
- Fix test comment: 'fatal error' -> 'unexpected non-fatal error'
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

Thanks @sarutak and @pan3793 for the reviews! All the feedback is addressed in the latest commit:

  1. Fixed comment wording ("unexpected non-fatal error")
  2. Switched to structured logging API (logError(log"...", e))
  3. Added System.exit(1) guarded by AM_CLIENT_MODE_EXIT_ON_ERROR, matching the existing pattern for FAILED/KILLED states

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants