Skip to content

Fix StateMachineProcedure EOF state re-execution#18099

Open
Caideyipi wants to merge 4 commits into
masterfrom
fix/pipe-procedure-eof-state-npe
Open

Fix StateMachineProcedure EOF state re-execution#18099
Caideyipi wants to merge 4 commits into
masterfrom
fix/pipe-procedure-eof-state-npe

Conversation

@Caideyipi

Copy link
Copy Markdown
Collaborator

Description

Problem

A ConfigNode may occasionally re-schedule a StateMachineProcedure whose internal state stack already points to the EOF sentinel after procedure recovery or leader-switch timing. In that case getCurrentState() returns null.

For pipe procedures such as PipeHandleLeaderChangeProcedure, the next execution path can then pass that null state back into the state-machine framework or the concrete enum mapping. The observed production symptom is an NPE like:

Cannot invoke "org.apache.iotdb.confignode.procedure.state.pipe.task.OperatePipeTaskState.ordinal()" because "" is null

The procedure can roll back and the ConfigNode can continue starting, but this is still a framework robustness bug during recovery.

Root Cause

StateMachineProcedure uses an internal EOF sentinel to represent no-more-state. When the procedure is scheduled while the last state is EOF but stateFlow still allows execution, getCurrentState() returns null.

The framework previously did not guard this case at the beginning of execute(). That allowed subclasses to receive a null state, even though their getStateId() implementations normally expect a real enum state and call ordinal().

Fix

This PR makes StateMachineProcedure.execute() treat a null current state as an already-completed EOF state:

  • log a warning with the procedure id and procedure details;
  • set stateFlow to NO_MORE_STATE;
  • clear the deserialized-state marker;
  • return without calling subclass executeFromState().

This keeps EOF handling inside the state-machine framework instead of requiring every enum-based procedure subclass to be null-safe.

UT / IT Coverage Check

UT coverage is added in STMProcedureTest:

  • testEofStateReexecutionDoesNotCallExecuteFromState constructs the exact framework edge case by forcing the internal state stack to EOF while stateFlow remains HAS_MORE_STATE;
  • the test verifies doExecute() returns normally;
  • the test also verifies subclass executeFromState() is not called, so no enum ordinal() path can receive a null state.

Existing PipeHandleLeaderChangeProcedureTest continues to cover serialization compatibility for the pipe leader-change procedure. The new UT targets the shared framework path, which is where the null state is produced.

IT coverage was reviewed and not added for this PR. The reported production reproduction depends on a timing-sensitive ConfigNode startup / Ratis leader-election / pipe leader-change recovery race, which is not deterministic enough for an integration test. The deterministic failure condition is the framework EOF/null state, and the new unit test covers that condition directly. Cluster ITs would mainly duplicate startup orchestration without reliably asserting the root cause.

Validation

Passed locally:

mvn spotless:apply -pl iotdb-core/confignode
git diff --check

Attempted but blocked locally before reaching ConfigNode tests:

mvn test -pl iotdb-core/confignode -Dtest=STMProcedureTest,PipeHandleLeaderChangeProcedureTest

This failed during dependency resolution because local/remote repositories did not provide required 2.0.11-SNAPSHOT reactor artifacts for the standalone ConfigNode module.

mvn test -pl iotdb-core/confignode -am -Dtest=STMProcedureTest,PipeHandleLeaderChangeProcedureTest -Dsurefire.failIfNoSpecifiedTests=false -DfailIfNoTests=false -DfailIfNoSpecifiedTests=false

This failed before reaching ConfigNode tests while compiling iotdb-protocol/thrift-datanode generated sources because javax.annotation was missing from the current local build environment.


This PR has:

  • been self-reviewed.
    • concurrent read
    • concurrent write
    • concurrent read and write
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage.
  • added integration tests. Not added because the deterministic framework edge is covered by UT; the full cluster reproduction is timing-sensitive and not suitable as a stable IT.
  • been tested in a test IoTDB cluster. Not run locally because Maven did not reach ConfigNode tests in this environment, as described above.

Key changed/added classes (or packages if there are too many classes) in this PR
  • org.apache.iotdb.confignode.procedure.impl.StateMachineProcedure
  • org.apache.iotdb.confignode.procedure.STMProcedureTest
Comment on lines +153 to +156
LOG.warn(
"StateMachineProcedure pid={} is scheduled with EOF state, skip execution: {}",
getProcId(),
this);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i18n

DO NOT LET ME SAY IT AGAIN

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 52632c4: moved the EOF-state error messages into the ConfigNode procedure i18n messages with both English and Chinese entries.

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.55%. Comparing base (5ce95b1) to head (0670b67).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18099      +/-   ##
============================================
+ Coverage     41.53%   41.55%   +0.02%     
  Complexity      318      318              
============================================
  Files          5296     5296              
  Lines        371679   371825     +146     
  Branches      48088    48107      +19     
============================================
+ Hits         154360   154528     +168     
+ Misses       217319   217297      -22     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants