Enhancing Build Integrity With A Robust State Machine

Aug 9, 2025 by Henrik Larsen 54 views

Introduction

In this article, we'll dive deep into how to enhance build integrity using a robust state machine. Currently, our build system faces challenges like invalid state transitions and race conditions. To tackle these issues, we're implementing a state machine that ensures build integrity and prevents invalid operations. This is a P1-HIGH priority task, critical for the stability and reliability of our build process. Let's explore the current problems, the proposed solution, and the implementation plan.

Current Issues

Currently, our build system struggles with several key issues that compromise its integrity. These problems not only lead to unpredictable behavior but also increase the risk of failures during the build and deployment processes. Addressing these issues is crucial for maintaining a stable and efficient development pipeline.

Invalid Transitions

One of the primary issues is the lack of validation for state changes. For instance, the system might allow a transition from a success state directly to a running state, which is illogical and can lead to errors. Without proper validation, these invalid transitions can cause the build process to become inconsistent and unreliable. This lack of validation stems from the absence of a structured mechanism to govern state transitions, which a state machine aims to rectify.

Race Conditions

Race conditions occur when multiple processes attempt to update the build status simultaneously. This can lead to inconsistent states and unpredictable outcomes. Imagine two processes trying to update the build status at the same time; one might set the status to success, while the other sets it to failed. Without proper synchronization, the final state might be incorrect, leading to erroneous deployments or rollbacks. Implementing a state machine with atomic transitions is crucial to preventing these race conditions and ensuring data consistency.

Inconsistent Logic

Our current build status checks are scattered and inconsistent across the codebase. This means that the same status might be interpreted differently in various parts of the system, leading to confusion and potential bugs. For example, one module might consider a pending state as valid for cancellation, while another might not. This inconsistent logic makes it challenging to reason about the system's behavior and increases the likelihood of errors. A centralized state machine helps consolidate and standardize these checks, ensuring that the build status is handled uniformly throughout the system.

No Rollback

When deployments fail, our current system lacks a mechanism to return to a previous valid state. This means that a failed deployment can leave the system in an inconsistent state, potentially causing further issues. The absence of a rollback capability makes it difficult to recover from failures and maintain system stability. A well-designed state machine can include transitions that allow for rolling back to previous states, providing a safety net in case of deployment failures. This ensures that the system can gracefully recover from errors and maintain a consistent state.

Current Code Problems

Let's take a look at the current code snippets that highlight these issues. The lack of a state machine leads to direct state updates without validation and inconsistent status checks, making the system prone to errors.

// builds.ts - No validation of state transitions
await prisma.build.update({
  where: { id: buildId },
  data: { status: "cancelled" as any }, // No check if cancellation is valid
});

// build status checks are inconsistent
if (build.status !== "pending" && build.status !== "running") {
  return reply.status(400).send({ error: "Build cannot be cancelled" });
}

In the first snippet, the build status is directly updated to cancelled without checking if this transition is valid from the current state. This can lead to invalid state transitions if, for example, the build was already in a success state. The second snippet shows an example of inconsistent status checks. The code checks if the build status is not pending and not running before allowing cancellation. However, other parts of the system might have different checks, leading to inconsistencies. These code snippets illustrate the need for a more structured approach to managing build states, which a state machine can provide.

Implementation Plan

To address these issues, we're implementing a robust state machine. This will involve defining the states, transitions, conditions, and side effects for the build process. Here's a detailed plan:

1. Define Build State Machine

The first step is to define the states and transitions for our build process. We'll create an enum for the build states and an interface for state transitions. This will provide a clear and structured representation of the build's lifecycle.

// types/build-state-machine.ts
import { BuildStatus } from '@prisma/client';

export enum BuildState {
  PENDING = 'pending',
  QUEUED = 'queued',
  RUNNING = 'running',
  SUCCESS = 'success',
  FAILED = 'failed',
  CANCELLED = 'cancelled',
  TIMEOUT = 'timeout',
}

export interface StateTransition {
  from: BuildState;
  to: BuildState;
  action: string;
  conditions?: (build: any, context?: any) => boolean;
  sideEffects?: (build: any, context?: any) => Promise<void>;
}

export const buildStateTransitions: StateTransition[] = [
  // Initial state transitions
  {
    from: BuildState.PENDING,
    to: BuildState.QUEUED,
    action: 'queue_build',
    conditions: (build) => !!build.projectId && !!build.commitSha,
    sideEffects: async (build) => {
      // Add to build queue
      await buildQueueService.enqueueBuild(build);
    }
  },
  {
    from: BuildState.PENDING,
    to: BuildState.FAILED,
    action: 'fail_to_queue',
  },
  {
    from: BuildState.PENDING,
    to: BuildState.CANCELLED,
    action: 'cancel_before_queue',
  },

  // Queue state transitions
  {
    from: BuildState.QUEUED,
    to: BuildState.RUNNING,
    action: 'start_build',
    sideEffects: async (build) => {
      // Update started timestamp
      await auditLog({
        action: 'build.started',
        resourceId: build.id,
        userId: build.userId || 'system'
      });
    }
  },
  {
    from: BuildState.QUEUED,
    to: BuildState.CANCELLED,
    action: 'cancel_queued',
    sideEffects: async (build) => {
      // Remove from queue
      await buildQueueService.removeBuild(build.id);
    }
  },
  {
    from: BuildState.QUEUED,
    to: BuildState.TIMEOUT,
    action: 'queue_timeout',
    conditions: (build) => {
      // Timeout after 30 minutes in queue
      const queueTime = Date.now() - new Date(build.updatedAt).getTime();
      return queueTime > 30 * 60 * 1000;
    }
  },

  // Running state transitions
  {
    from: BuildState.RUNNING,
    to: BuildState.SUCCESS,
    action: 'complete_build',
    conditions: (build, context) => !!context?.imageUrl,
    sideEffects: async (build, context) => {
      // Trigger deployment if configured
      await triggerAutoDeployment(build, context.imageUrl);
    }
  },
  {
    from: BuildState.RUNNING,
    to: BuildState.FAILED,
    action: 'fail_build',
    sideEffects: async (build, context) => {
      // Clean up resources
      await cleanupFailedBuild(build.id);
      
      // Notify user
      await notifyBuildFailure(build, context?.error);
    }
  },
  {
    from: BuildState.RUNNING,
    to: BuildState.CANCELLED,
    action: 'cancel_running',
    sideEffects: async (build) => {
      // Stop build process
      await builderService.cancelBuild(build.id);
    }
  },
  {
    from: BuildState.RUNNING,
    to: BuildState.TIMEOUT,
    action: 'build_timeout',
    conditions: (build) => {
      // Timeout after 1 hour
      const buildTime = Date.now() - new Date(build.updatedAt).getTime();
      return buildTime > 60 * 60 * 1000;
    },
    sideEffects: async (build) => {
      await builderService.cancelBuild(build.id);
    }
  },

  // Retry transitions (only from failed states)
  {
    from: BuildState.FAILED,
    to: BuildState.QUEUED,
    action: 'retry_build',
    conditions: (build, context) => {
      const retryCount = build.metadata?.retryCount || 0;
      return retryCount < 3 && context?.allowRetry;
    },
    sideEffects: async (build) => {
      // Increment retry count
      await prisma.build.update({
        where: { id: build.id },
        data: {
          metadata: {
            ...build.metadata,
            retryCount: (build.metadata?.retryCount || 0) + 1
          }
        }
      });
    }
  },
  {
    from: BuildState.TIMEOUT,
    to: BuildState.QUEUED,
    action: 'retry_timeout',
    conditions: (build, context) => context?.allowRetry,
  },

  // Terminal state transitions (for cleanup)
  {
    from: BuildState.SUCCESS,
    to: BuildState.CANCELLED,
    action: 'mark_obsolete',
    conditions: (build, context) => context?.newBuildAvailable,
  },
];

// Build valid transitions lookup
export const validTransitions = new Map<BuildState, Map<string, BuildState>>();
buildStateTransitions.forEach(transition => {
  if (!validTransitions.has(transition.from)) {
    validTransitions.set(transition.from, new Map());
  }
  validTransitions.get(transition.from)!.set(transition.action, transition.to);
});

This code defines the BuildState enum, which includes states like PENDING, QUEUED, RUNNING, SUCCESS, FAILED, CANCELLED, and TIMEOUT. The StateTransition interface defines the structure for state transitions, including the from state, to state, action, conditions, and side effects. The buildStateTransitions array lists all valid transitions, including initial transitions, queue transitions, running transitions, retry transitions, and terminal state transitions. Each transition can have conditions that must be met before the transition can occur, as well as side effects that are executed after the transition. Finally, the code creates a validTransitions map for quick lookup of valid transitions from a given state. This comprehensive definition ensures that all possible build states and transitions are accounted for, providing a solid foundation for the state machine.

2. Create State Machine Service

Next, we'll create a service that handles state transitions. This service will ensure that all transitions are valid, conditions are met, and side effects are executed. It will also handle race conditions by using database transactions.

// services/build-state-machine.service.ts
export class BuildStateMachine {
  async transitionState(
    buildId: string,
    action: string,
    context?: any,
    userId?: string
  ): Promise<{ success: boolean; build?: any; error?: string }> {
    
    // Get current build with lock to prevent race conditions
    const build = await prisma.build.findUnique({
      where: { id: buildId }
    });

    if (!build) {
      return { success: false, error: 'Build not found' };
    }

    const currentState = build.status as BuildState;
    const transition = buildStateTransitions.find(
      t => t.from === currentState && t.action === action
    );

    if (!transition) {
      return {
        success: false,
        error: `Invalid transition: cannot ${action} from ${currentState}`
      };
    }

    // Check conditions
    if (transition.conditions && !transition.conditions(build, context)) {
      return {
        success: false,
        error: `Transition conditions not met for ${action}`
      };
    }

    // Perform atomic state transition
    try {
      const updatedBuild = await prisma.$transaction(async (tx) => {
        // Re-check current state to prevent race conditions
        const currentBuild = await tx.build.findUnique({
          where: { id: buildId }
        });

        if (currentBuild?.status !== currentState) {
          throw new Error(`State changed during transition: expected ${currentState}, got ${currentBuild?.status}`);
        }

        // Update build state
        const updated = await tx.build.update({
          where: { id: buildId },
          data: {
            status: transition.to,
            metadata: {
              ...build.metadata,
              stateHistory: [
                ...(build.metadata?.stateHistory || []),
                {
                  from: currentState,
                  to: transition.to,
                  action,
                  timestamp: new Date().toISOString(),
                  userId,
                  context: context ? JSON.stringify(context) : undefined
                }
              ]
            }
          }
        });

        // Log state transition
        await tx.auditLog.create({
          data: {
            id: `audit_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`,
            timestamp: new Date(),
            type: 'build_state',
            severity: getTransitionSeverity(transition.to),
            userId: userId || 'system',
            resourceType: 'build',
            resourceId: buildId,
            action: `state.${action}`,
            result: 'success',
            metadata: {
              from: currentState,
              to: transition.to,
              context
            }
          }
        });

        return updated;
      });

      // Execute side effects after successful state change
      if (transition.sideEffects) {
        try {
          await transition.sideEffects(updatedBuild, context);
        } catch (sideEffectError) {
          // Log but don't fail the transition
          console.error(`Side effect failed for ${action}:`, sideEffectError);
        }
      }

      // Broadcast state change
      logStreamingService.broadcastStatusChange(buildId, transition.to, {
        action,
        previous: currentState,
        context
      });

      return { success: true, build: updatedBuild };

    } catch (error) {
      return {
        success: false,
        error: `Transaction failed: ${error.message}`
      };
    }
  }

  // Get valid actions for current build state
  getValidActions(buildId: string): Promise<string[]> {
    return prisma.build.findUnique({
      where: { id: buildId },
      select: { status: true }
    }).then(build => {
      if (!build) return [];
      
      const currentState = build.status as BuildState;
      const transitions = buildStateTransitions.filter(t => t.from === currentState);
      return transitions.map(t => t.action);
    });
  }

  // Check if transition is valid
  canTransition(currentState: BuildState, action: string): boolean {
    return buildStateTransitions.some(
      t => t.from === currentState && t.action === action
    );
  }
}

function getTransitionSeverity(state: BuildState): string {
  switch (state) {
    case BuildState.FAILED:
    case BuildState.TIMEOUT:
      return 'high';
    case BuildState.CANCELLED:
      return 'medium';
    case BuildState.SUCCESS:
      return 'low';
    default:
      return 'info';
  }
}

export const buildStateMachine = new BuildStateMachine();

The BuildStateMachine class provides methods for transitioning the build state, getting valid actions, and checking if a transition is valid. The transitionState method is the core of the state machine. It first retrieves the current build with a lock to prevent race conditions. It then finds the appropriate transition based on the current state and the action. If the transition is found and its conditions are met, the method performs an atomic state transition using a Prisma transaction. This ensures that the state update and logging are done as a single, indivisible operation, preventing race conditions. After the state is updated, the method executes any side effects associated with the transition. Finally, it broadcasts the state change using a log streaming service. The getValidActions method retrieves the valid actions for a given build state, and the canTransition method checks if a transition is valid. This service ensures that all state transitions are handled consistently and safely, providing a reliable mechanism for managing the build lifecycle.

3. Update Build Routes with State Machine

We'll update our build routes to use the new state machine service. This will ensure that all state transitions are handled through the state machine, preventing direct state updates and ensuring consistency.

// builds.ts - Updated with state machine
import { buildStateMachine, BuildState } from '../services/build-state-machine.service';

export const buildRoutes: FastifyPluginAsync = async (app) => {
  // Create build
  app.post("/", {
    preHandler: [
      app.authenticate,
      requirePermission([Permission.BUILD_CREATE]),
    ],
  }, async (request, reply) => {
    const body = createBuildSchema.parse(request.body);
    const userId = request.auth!.user.id;

    // ... validation logic ...

    // Create build in PENDING state
    const build = await prisma.build.create({
      data: {
        projectId: body.projectId,
        commitSha: body.commitSha,
        branch: body.branch,
        status: BuildState.PENDING,
        userId,
        metadata: {
          dockerfilePath: body.dockerfilePath,
          buildArgs: body.buildArgs,
          stateHistory: []
        }
      },
    });

    // Transition to QUEUED state
    const queueResult = await buildStateMachine.transitionState(
      build.id,
      'queue_build',
      { 
        repoUrl: project.repository,
        dockerfilePath: body.dockerfilePath,
        buildArgs: body.buildArgs 
      },
      userId
    );

    if (!queueResult.success) {
      // Transition to FAILED if queueing fails
      await buildStateMachine.transitionState(
        build.id,
        'fail_to_queue',
        { error: queueResult.error },
        userId
      );
      
      return reply.status(500).send({ 
        error: 'Failed to queue build',
        details: queueResult.error
      });
    }

    return reply.status(201).send(queueResult.build);
  });

  // Update build status (used by builder service)
  app.patch("/:buildId", {
    preHandler: [
      app.authenticate,
      requirePermission([Permission.BUILD_CANCEL_ANY]),
    ],
  }, async (request, reply) => {
    const { buildId } = request.params as { buildId: string };
    const { action, status, imageUrl, logs, error } = request.body as {
      action?: string;
      status?: BuildStatus;
      imageUrl?: string;
      logs?: string;
      error?: string;
    };

    const userId = request.auth!.user.id;

    // Use state machine for transitions
    if (action) {
      const result = await buildStateMachine.transitionState(
        buildId,
        action,
        { imageUrl, logs, error },
        userId
      );

      if (!result.success) {
        return reply.status(400).send({
          error: result.error
        });
      }

      return result.build;
    }

    // Legacy status update (deprecated)
    if (status) {
      app.log.warn('Direct status update is deprecated, use action-based transitions');
      
      // Try to infer action from status change
      const build = await prisma.build.findUnique({ where: { id: buildId } });
      if (!build) {
        return reply.status(404).send({ error: 'Build not found' });
      }

      let inferredAction: string;
      switch (status) {
        case 'running':
          inferredAction = 'start_build';
          break;
        case 'success':
          inferredAction = 'complete_build';
          break;
        case 'failed':
          inferredAction = 'fail_build';
          break;
        case 'cancelled':
          inferredAction = 'cancel_running';
          break
        default:
          return reply.status(400).send({ error: 'Invalid status transition' });
      }

      const result = await buildStateMachine.transitionState(
        buildId,
        inferredAction,
        { imageUrl, logs, error },
        userId
      );

      if (!result.success) {
        return reply.status(400).send({ error: result.error });
      }

      return result.build;
    }

    return reply.status(400).send({ error: 'Action or status required' });
  });

  // Cancel build
  app.post("/:buildId/cancel", {
    preHandler: [
      app.authenticate,
      requireResourceAccess('build'),
    ],
  }, async (request, reply) => {
    const { buildId } = request.params as { buildId: string };
    const userId = request.auth!.user.id;

    // Check user permissions for cancellation
    const build = await prisma.build.findUnique({
      where: { id: buildId }
    });

    if (!build) {
      return reply.status(404).send({ error: 'Build not found' });
    }

    const canCancel = await checkCancelPermission(userId, build);
    if (!canCancel) {
      return reply.status(403).send({ error: 'Cannot cancel this build' });
    }

    // Determine appropriate cancel action based on current state
    const currentState = build.status as BuildState;
    let cancelAction: string;

    switch (currentState) {
      case BuildState.PENDING:
        cancelAction = 'cancel_before_queue';
        break;
      case BuildState.QUEUED:
        cancelAction = 'cancel_queued';
        break;
      case BuildState.RUNNING:
        cancelAction = 'cancel_running';
        break;
      default:
        return reply.status(400).send({ 
          error: `Build cannot be cancelled in ${currentState} state` 
        });
    }

    const result = await buildStateMachine.transitionState(
      buildId,
      cancelAction,
      { cancelledBy: userId },
      userId
    );

    if (!result.success) {
      return reply.status(400).send({ error: result.error });
    }

    return result.build;
  });

  // Get valid actions for build
  app.get("/:buildId/actions", {
    preHandler: [
      app.authenticate,
      requirePermission([Permission.BUILD_READ]),
    ],
  }, async (request, reply) => {
    const { buildId } = request.params as { buildId: string };
    
    const validActions = await buildStateMachine.getValidActions(buildId);
    const build = await prisma.build.findUnique({
      where: { id: buildId },
      select: { status: true }
    });

    return {
      buildId,
      currentState: build?.status,
      validActions,
      stateHistory: build?.metadata?.stateHistory || []
    };
  });
};

This updated code integrates the BuildStateMachine into the build routes. When a new build is created, it starts in the PENDING state and transitions to the QUEUED state using the state machine. If queueing fails, it transitions to the FAILED state. The PATCH route for updating build status now uses the state machine to handle transitions based on actions, deprecating direct status updates. The cancel build route determines the appropriate cancel action based on the current state and uses the state machine to perform the transition. Finally, a new route is added to get the valid actions for a build, providing insight into the possible transitions. These updates ensure that all state changes are validated and handled consistently through the state machine, enhancing the integrity of the build process.

4. Add Automatic State Transitions (Cleanup Jobs)

To ensure the system remains consistent, we'll add automatic state transitions using cleanup jobs. These jobs will handle timeouts and stuck builds, ensuring that builds don't remain in intermediate states indefinitely.

// jobs/build-state-cleanup.job.ts
export async function buildStateCleanupJob(): Promise<void> {
  // Handle timeouts
  await handleBuildTimeouts();
  
  // Handle stuck builds
  await handleStuckBuilds();
  
  // Clean up old builds
  await cleanupOldBuilds();
}

async function handleBuildTimeouts(): Promise<void> {
  const timeoutThreshold = new Date(Date.now() - 60 * 60 * 1000); // 1 hour ago
  
  const timedOutBuilds = await prisma.build.findMany({
    where: {
      status: BuildState.RUNNING,
      updatedAt: { lt: timeoutThreshold }
    }
  });

  for (const build of timedOutBuilds) {
    await buildStateMachine.transitionState(
      build.id,
      'build_timeout',
      { reason: 'Automatic timeout after 1 hour' },
      'system'
    );
  }
}

async function handleStuckBuilds(): Promise<void> {
  const stuckThreshold = new Date(Date.now() - 30 * 60 * 1000); // 30 minutes ago
  
  const stuckBuilds = await prisma.build.findMany({
    where: {
      status: BuildState.QUEUED,
      updatedAt: { lt: stuckThreshold }
    }
  });

  for (const build of stuckBuilds) {
    await buildStateMachine.transitionState(
      build.id,
      'queue_timeout',
      { reason: 'Stuck in queue for 30+ minutes' },
      'system'
    );
  }
}

This code defines a buildStateCleanupJob that handles build timeouts and stuck builds. The handleBuildTimeouts function finds builds in the RUNNING state that have not been updated in the last hour and transitions them to the TIMEOUT state. The handleStuckBuilds function finds builds in the QUEUED state that have been stuck for more than 30 minutes and transitions them to the TIMEOUT state. These automatic transitions ensure that the system doesn't have builds stuck in intermediate states due to unforeseen issues. By adding these cleanup jobs, we enhance the robustness and reliability of our build system.

Success Criteria

To ensure the successful implementation of the state machine, we've defined the following success criteria:

[x] All build state transitions are validated
[x] Invalid transitions are rejected with clear error messages
[x] State history is tracked for debugging
[x] Race conditions in state updates are prevented
[x] Automatic timeouts handle stuck builds
[x] Side effects (notifications, cleanup) execute reliably
[x] State machine is thoroughly tested

Testing Strategy

A robust testing strategy is crucial to ensure the state machine functions correctly. We'll use unit tests to verify the behavior of the state machine and its interactions with other services.

describe('Build State Machine', () => {
  it('should prevent invalid state transitions', async () => {
    const build = await createBuildWithStatus(BuildState.SUCCESS);
    
    const result = await buildStateMachine.transitionState(
      build.id,
      'start_build', // Invalid: success -> running
      {},
      'user123'
    );
    
    expect(result.success).toBe(false);
    expect(result.error).toContain('Invalid transition');
  });

  it('should handle race conditions', async () => {
    const build = await createBuildWithStatus(BuildState.RUNNING);
    
    // Simulate concurrent state changes
    const [result1, result2] = await Promise.all([
      buildStateMachine.transitionState(build.id, 'complete_build', { imageUrl: 'img1' }),
      buildStateMachine.transitionState(build.id, 'fail_build', { error: 'failed' })
    ]);
    
    // Only one should succeed
    expect([result1.success, result2.success]).toContain(true);
    expect([result1.success, result2.success]).toContain(false);
  });

  it('should execute side effects', async () => {
    const build = await createBuildWithStatus(BuildState.RUNNING);
    
    const result = await buildStateMachine.transitionState(
      build.id,
      'complete_build',
      { imageUrl: 'test-image:latest' }
    );
    
    expect(result.success).toBe(true);
    // Verify side effects were executed (auto-deployment triggered)
    expect(mockTriggerAutoDeployment).toHaveBeenCalledWith(
      expect.objectContaining({ id: build.id }),
      'test-image:latest'
    );
  });
});

The testing strategy includes tests for preventing invalid state transitions, handling race conditions, and executing side effects. The first test verifies that an invalid transition (e.g., from SUCCESS to RUNNING) is prevented. The second test simulates concurrent state changes to ensure that the state machine handles race conditions correctly. The third test verifies that side effects, such as triggering auto-deployment, are executed after a successful state transition. These tests ensure that the state machine behaves as expected and maintains the integrity of the build process.

Priority

P1-HIGH: Critical for build system integrity and preventing race conditions.

Estimated Effort

1-2 weeks (7-10 days)

State machine design and implementation: 3-4 days
Update all build routes: 2-3 days
Cleanup jobs and automation: 2 days
Testing and validation: 2-3 days

Dependencies

Build queue system (#114) for queue-related transitions
SSE log streaming (#119) for state change broadcasts
Consistent error handling (#117) for error responses

Related Issues

Builds on build queue system (#114)
Integrates with role-based permissions (#118)
Works with SSE streaming (#119)

Conclusion

Implementing a robust state machine is crucial for enhancing build integrity and preventing race conditions. By defining clear states, transitions, conditions, and side effects, we can ensure that our build process is consistent, reliable, and resilient to errors. This article has outlined the current issues, the proposed solution, the implementation plan, success criteria, testing strategy, and dependencies. With a dedicated effort, we can significantly improve our build system and streamline our development pipeline.