Enhancing Build Integrity With A Robust State Machine
Introduction
In this article, we'll dive deep into how to enhance build integrity using a robust state machine. Currently, our build system faces challenges like invalid state transitions and race conditions. To tackle these issues, we're implementing a state machine that ensures build integrity and prevents invalid operations. This is a P1-HIGH priority task, critical for the stability and reliability of our build process. Let's explore the current problems, the proposed solution, and the implementation plan.
Current Issues
Currently, our build system struggles with several key issues that compromise its integrity. These problems not only lead to unpredictable behavior but also increase the risk of failures during the build and deployment processes. Addressing these issues is crucial for maintaining a stable and efficient development pipeline.
Invalid Transitions
One of the primary issues is the lack of validation for state changes. For instance, the system might allow a transition from a success state directly to a running state, which is illogical and can lead to errors. Without proper validation, these invalid transitions can cause the build process to become inconsistent and unreliable. This lack of validation stems from the absence of a structured mechanism to govern state transitions, which a state machine aims to rectify.
Race Conditions
Race conditions occur when multiple processes attempt to update the build status simultaneously. This can lead to inconsistent states and unpredictable outcomes. Imagine two processes trying to update the build status at the same time; one might set the status to success, while the other sets it to failed. Without proper synchronization, the final state might be incorrect, leading to erroneous deployments or rollbacks. Implementing a state machine with atomic transitions is crucial to preventing these race conditions and ensuring data consistency.
Inconsistent Logic
Our current build status checks are scattered and inconsistent across the codebase. This means that the same status might be interpreted differently in various parts of the system, leading to confusion and potential bugs. For example, one module might consider a pending state as valid for cancellation, while another might not. This inconsistent logic makes it challenging to reason about the system's behavior and increases the likelihood of errors. A centralized state machine helps consolidate and standardize these checks, ensuring that the build status is handled uniformly throughout the system.
No Rollback
When deployments fail, our current system lacks a mechanism to return to a previous valid state. This means that a failed deployment can leave the system in an inconsistent state, potentially causing further issues. The absence of a rollback capability makes it difficult to recover from failures and maintain system stability. A well-designed state machine can include transitions that allow for rolling back to previous states, providing a safety net in case of deployment failures. This ensures that the system can gracefully recover from errors and maintain a consistent state.
Current Code Problems
Let's take a look at the current code snippets that highlight these issues. The lack of a state machine leads to direct state updates without validation and inconsistent status checks, making the system prone to errors.
// builds.ts - No validation of state transitions
await prisma.build.update({
where: { id: buildId },
data: { status: "cancelled" as any }, // No check if cancellation is valid
});
// build status checks are inconsistent
if (build.status !== "pending" && build.status !== "running") {
return reply.status(400).send({ error: "Build cannot be cancelled" });
}
In the first snippet, the build status is directly updated to cancelled without checking if this transition is valid from the current state. This can lead to invalid state transitions if, for example, the build was already in a success state. The second snippet shows an example of inconsistent status checks. The code checks if the build status is not pending and not running before allowing cancellation. However, other parts of the system might have different checks, leading to inconsistencies. These code snippets illustrate the need for a more structured approach to managing build states, which a state machine can provide.
Implementation Plan
To address these issues, we're implementing a robust state machine. This will involve defining the states, transitions, conditions, and side effects for the build process. Here's a detailed plan:
1. Define Build State Machine
The first step is to define the states and transitions for our build process. We'll create an enum for the build states and an interface for state transitions. This will provide a clear and structured representation of the build's lifecycle.
// types/build-state-machine.ts
import { BuildStatus } from '@prisma/client';
export enum BuildState {
PENDING = 'pending',
QUEUED = 'queued',
RUNNING = 'running',
SUCCESS = 'success',
FAILED = 'failed',
CANCELLED = 'cancelled',
TIMEOUT = 'timeout',
}
export interface StateTransition {
from: BuildState;
to: BuildState;
action: string;
conditions?: (build: any, context?: any) => boolean;
sideEffects?: (build: any, context?: any) => Promise<void>;
}
export const buildStateTransitions: StateTransition[] = [
// Initial state transitions
{
from: BuildState.PENDING,
to: BuildState.QUEUED,
action: 'queue_build',
conditions: (build) => !!build.projectId && !!build.commitSha,
sideEffects: async (build) => {
// Add to build queue
await buildQueueService.enqueueBuild(build);
}
},
{
from: BuildState.PENDING,
to: BuildState.FAILED,
action: 'fail_to_queue',
},
{
from: BuildState.PENDING,
to: BuildState.CANCELLED,
action: 'cancel_before_queue',
},
// Queue state transitions
{
from: BuildState.QUEUED,
to: BuildState.RUNNING,
action: 'start_build',
sideEffects: async (build) => {
// Update started timestamp
await auditLog({
action: 'build.started',
resourceId: build.id,
userId: build.userId || 'system'
});
}
},
{
from: BuildState.QUEUED,
to: BuildState.CANCELLED,
action: 'cancel_queued',
sideEffects: async (build) => {
// Remove from queue
await buildQueueService.removeBuild(build.id);
}
},
{
from: BuildState.QUEUED,
to: BuildState.TIMEOUT,
action: 'queue_timeout',
conditions: (build) => {
// Timeout after 30 minutes in queue
const queueTime = Date.now() - new Date(build.updatedAt).getTime();
return queueTime > 30 * 60 * 1000;
}
},
// Running state transitions
{
from: BuildState.RUNNING,
to: BuildState.SUCCESS,
action: 'complete_build',
conditions: (build, context) => !!context?.imageUrl,
sideEffects: async (build, context) => {
// Trigger deployment if configured
await triggerAutoDeployment(build, context.imageUrl);
}
},
{
from: BuildState.RUNNING,
to: BuildState.FAILED,
action: 'fail_build',
sideEffects: async (build, context) => {
// Clean up resources
await cleanupFailedBuild(build.id);
// Notify user
await notifyBuildFailure(build, context?.error);
}
},
{
from: BuildState.RUNNING,
to: BuildState.CANCELLED,
action: 'cancel_running',
sideEffects: async (build) => {
// Stop build process
await builderService.cancelBuild(build.id);
}
},
{
from: BuildState.RUNNING,
to: BuildState.TIMEOUT,
action: 'build_timeout',
conditions: (build) => {
// Timeout after 1 hour
const buildTime = Date.now() - new Date(build.updatedAt).getTime();
return buildTime > 60 * 60 * 1000;
},
sideEffects: async (build) => {
await builderService.cancelBuild(build.id);
}
},
// Retry transitions (only from failed states)
{
from: BuildState.FAILED,
to: BuildState.QUEUED,
action: 'retry_build',
conditions: (build, context) => {
const retryCount = build.metadata?.retryCount || 0;
return retryCount < 3 && context?.allowRetry;
},
sideEffects: async (build) => {
// Increment retry count
await prisma.build.update({
where: { id: build.id },
data: {
metadata: {
...build.metadata,
retryCount: (build.metadata?.retryCount || 0) + 1
}
}
});
}
},
{
from: BuildState.TIMEOUT,
to: BuildState.QUEUED,
action: 'retry_timeout',
conditions: (build, context) => context?.allowRetry,
},
// Terminal state transitions (for cleanup)
{
from: BuildState.SUCCESS,
to: BuildState.CANCELLED,
action: 'mark_obsolete',
conditions: (build, context) => context?.newBuildAvailable,
},
];
// Build valid transitions lookup
export const validTransitions = new Map<BuildState, Map<string, BuildState>>();
buildStateTransitions.forEach(transition => {
if (!validTransitions.has(transition.from)) {
validTransitions.set(transition.from, new Map());
}
validTransitions.get(transition.from)!.set(transition.action, transition.to);
});
This code defines the BuildState
enum, which includes states like PENDING, QUEUED, RUNNING, SUCCESS, FAILED, CANCELLED, and TIMEOUT. The StateTransition
interface defines the structure for state transitions, including the from state, to state, action, conditions, and side effects. The buildStateTransitions
array lists all valid transitions, including initial transitions, queue transitions, running transitions, retry transitions, and terminal state transitions. Each transition can have conditions that must be met before the transition can occur, as well as side effects that are executed after the transition. Finally, the code creates a validTransitions
map for quick lookup of valid transitions from a given state. This comprehensive definition ensures that all possible build states and transitions are accounted for, providing a solid foundation for the state machine.
2. Create State Machine Service
Next, we'll create a service that handles state transitions. This service will ensure that all transitions are valid, conditions are met, and side effects are executed. It will also handle race conditions by using database transactions.
// services/build-state-machine.service.ts
export class BuildStateMachine {
async transitionState(
buildId: string,
action: string,
context?: any,
userId?: string
): Promise<{ success: boolean; build?: any; error?: string }> {
// Get current build with lock to prevent race conditions
const build = await prisma.build.findUnique({
where: { id: buildId }
});
if (!build) {
return { success: false, error: 'Build not found' };
}
const currentState = build.status as BuildState;
const transition = buildStateTransitions.find(
t => t.from === currentState && t.action === action
);
if (!transition) {
return {
success: false,
error: `Invalid transition: cannot ${action} from ${currentState}`
};
}
// Check conditions
if (transition.conditions && !transition.conditions(build, context)) {
return {
success: false,
error: `Transition conditions not met for ${action}`
};
}
// Perform atomic state transition
try {
const updatedBuild = await prisma.$transaction(async (tx) => {
// Re-check current state to prevent race conditions
const currentBuild = await tx.build.findUnique({
where: { id: buildId }
});
if (currentBuild?.status !== currentState) {
throw new Error(`State changed during transition: expected ${currentState}, got ${currentBuild?.status}`);
}
// Update build state
const updated = await tx.build.update({
where: { id: buildId },
data: {
status: transition.to,
metadata: {
...build.metadata,
stateHistory: [
...(build.metadata?.stateHistory || []),
{
from: currentState,
to: transition.to,
action,
timestamp: new Date().toISOString(),
userId,
context: context ? JSON.stringify(context) : undefined
}
]
}
}
});
// Log state transition
await tx.auditLog.create({
data: {
id: `audit_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`,
timestamp: new Date(),
type: 'build_state',
severity: getTransitionSeverity(transition.to),
userId: userId || 'system',
resourceType: 'build',
resourceId: buildId,
action: `state.${action}`,
result: 'success',
metadata: {
from: currentState,
to: transition.to,
context
}
}
});
return updated;
});
// Execute side effects after successful state change
if (transition.sideEffects) {
try {
await transition.sideEffects(updatedBuild, context);
} catch (sideEffectError) {
// Log but don't fail the transition
console.error(`Side effect failed for ${action}:`, sideEffectError);
}
}
// Broadcast state change
logStreamingService.broadcastStatusChange(buildId, transition.to, {
action,
previous: currentState,
context
});
return { success: true, build: updatedBuild };
} catch (error) {
return {
success: false,
error: `Transaction failed: ${error.message}`
};
}
}
// Get valid actions for current build state
getValidActions(buildId: string): Promise<string[]> {
return prisma.build.findUnique({
where: { id: buildId },
select: { status: true }
}).then(build => {
if (!build) return [];
const currentState = build.status as BuildState;
const transitions = buildStateTransitions.filter(t => t.from === currentState);
return transitions.map(t => t.action);
});
}
// Check if transition is valid
canTransition(currentState: BuildState, action: string): boolean {
return buildStateTransitions.some(
t => t.from === currentState && t.action === action
);
}
}
function getTransitionSeverity(state: BuildState): string {
switch (state) {
case BuildState.FAILED:
case BuildState.TIMEOUT:
return 'high';
case BuildState.CANCELLED:
return 'medium';
case BuildState.SUCCESS:
return 'low';
default:
return 'info';
}
}
export const buildStateMachine = new BuildStateMachine();
The BuildStateMachine
class provides methods for transitioning the build state, getting valid actions, and checking if a transition is valid. The transitionState
method is the core of the state machine. It first retrieves the current build with a lock to prevent race conditions. It then finds the appropriate transition based on the current state and the action. If the transition is found and its conditions are met, the method performs an atomic state transition using a Prisma transaction. This ensures that the state update and logging are done as a single, indivisible operation, preventing race conditions. After the state is updated, the method executes any side effects associated with the transition. Finally, it broadcasts the state change using a log streaming service. The getValidActions
method retrieves the valid actions for a given build state, and the canTransition
method checks if a transition is valid. This service ensures that all state transitions are handled consistently and safely, providing a reliable mechanism for managing the build lifecycle.
3. Update Build Routes with State Machine
We'll update our build routes to use the new state machine service. This will ensure that all state transitions are handled through the state machine, preventing direct state updates and ensuring consistency.
// builds.ts - Updated with state machine
import { buildStateMachine, BuildState } from '../services/build-state-machine.service';
export const buildRoutes: FastifyPluginAsync = async (app) => {
// Create build
app.post("/", {
preHandler: [
app.authenticate,
requirePermission([Permission.BUILD_CREATE]),
],
}, async (request, reply) => {
const body = createBuildSchema.parse(request.body);
const userId = request.auth!.user.id;
// ... validation logic ...
// Create build in PENDING state
const build = await prisma.build.create({
data: {
projectId: body.projectId,
commitSha: body.commitSha,
branch: body.branch,
status: BuildState.PENDING,
userId,
metadata: {
dockerfilePath: body.dockerfilePath,
buildArgs: body.buildArgs,
stateHistory: []
}
},
});
// Transition to QUEUED state
const queueResult = await buildStateMachine.transitionState(
build.id,
'queue_build',
{
repoUrl: project.repository,
dockerfilePath: body.dockerfilePath,
buildArgs: body.buildArgs
},
userId
);
if (!queueResult.success) {
// Transition to FAILED if queueing fails
await buildStateMachine.transitionState(
build.id,
'fail_to_queue',
{ error: queueResult.error },
userId
);
return reply.status(500).send({
error: 'Failed to queue build',
details: queueResult.error
});
}
return reply.status(201).send(queueResult.build);
});
// Update build status (used by builder service)
app.patch("/:buildId", {
preHandler: [
app.authenticate,
requirePermission([Permission.BUILD_CANCEL_ANY]),
],
}, async (request, reply) => {
const { buildId } = request.params as { buildId: string };
const { action, status, imageUrl, logs, error } = request.body as {
action?: string;
status?: BuildStatus;
imageUrl?: string;
logs?: string;
error?: string;
};
const userId = request.auth!.user.id;
// Use state machine for transitions
if (action) {
const result = await buildStateMachine.transitionState(
buildId,
action,
{ imageUrl, logs, error },
userId
);
if (!result.success) {
return reply.status(400).send({
error: result.error
});
}
return result.build;
}
// Legacy status update (deprecated)
if (status) {
app.log.warn('Direct status update is deprecated, use action-based transitions');
// Try to infer action from status change
const build = await prisma.build.findUnique({ where: { id: buildId } });
if (!build) {
return reply.status(404).send({ error: 'Build not found' });
}
let inferredAction: string;
switch (status) {
case 'running':
inferredAction = 'start_build';
break;
case 'success':
inferredAction = 'complete_build';
break;
case 'failed':
inferredAction = 'fail_build';
break;
case 'cancelled':
inferredAction = 'cancel_running';
break
default:
return reply.status(400).send({ error: 'Invalid status transition' });
}
const result = await buildStateMachine.transitionState(
buildId,
inferredAction,
{ imageUrl, logs, error },
userId
);
if (!result.success) {
return reply.status(400).send({ error: result.error });
}
return result.build;
}
return reply.status(400).send({ error: 'Action or status required' });
});
// Cancel build
app.post("/:buildId/cancel", {
preHandler: [
app.authenticate,
requireResourceAccess('build'),
],
}, async (request, reply) => {
const { buildId } = request.params as { buildId: string };
const userId = request.auth!.user.id;
// Check user permissions for cancellation
const build = await prisma.build.findUnique({
where: { id: buildId }
});
if (!build) {
return reply.status(404).send({ error: 'Build not found' });
}
const canCancel = await checkCancelPermission(userId, build);
if (!canCancel) {
return reply.status(403).send({ error: 'Cannot cancel this build' });
}
// Determine appropriate cancel action based on current state
const currentState = build.status as BuildState;
let cancelAction: string;
switch (currentState) {
case BuildState.PENDING:
cancelAction = 'cancel_before_queue';
break;
case BuildState.QUEUED:
cancelAction = 'cancel_queued';
break;
case BuildState.RUNNING:
cancelAction = 'cancel_running';
break;
default:
return reply.status(400).send({
error: `Build cannot be cancelled in ${currentState} state`
});
}
const result = await buildStateMachine.transitionState(
buildId,
cancelAction,
{ cancelledBy: userId },
userId
);
if (!result.success) {
return reply.status(400).send({ error: result.error });
}
return result.build;
});
// Get valid actions for build
app.get("/:buildId/actions", {
preHandler: [
app.authenticate,
requirePermission([Permission.BUILD_READ]),
],
}, async (request, reply) => {
const { buildId } = request.params as { buildId: string };
const validActions = await buildStateMachine.getValidActions(buildId);
const build = await prisma.build.findUnique({
where: { id: buildId },
select: { status: true }
});
return {
buildId,
currentState: build?.status,
validActions,
stateHistory: build?.metadata?.stateHistory || []
};
});
};
This updated code integrates the BuildStateMachine
into the build routes. When a new build is created, it starts in the PENDING state and transitions to the QUEUED state using the state machine. If queueing fails, it transitions to the FAILED state. The PATCH
route for updating build status now uses the state machine to handle transitions based on actions, deprecating direct status updates. The cancel build route determines the appropriate cancel action based on the current state and uses the state machine to perform the transition. Finally, a new route is added to get the valid actions for a build, providing insight into the possible transitions. These updates ensure that all state changes are validated and handled consistently through the state machine, enhancing the integrity of the build process.
4. Add Automatic State Transitions (Cleanup Jobs)
To ensure the system remains consistent, we'll add automatic state transitions using cleanup jobs. These jobs will handle timeouts and stuck builds, ensuring that builds don't remain in intermediate states indefinitely.
// jobs/build-state-cleanup.job.ts
export async function buildStateCleanupJob(): Promise<void> {
// Handle timeouts
await handleBuildTimeouts();
// Handle stuck builds
await handleStuckBuilds();
// Clean up old builds
await cleanupOldBuilds();
}
async function handleBuildTimeouts(): Promise<void> {
const timeoutThreshold = new Date(Date.now() - 60 * 60 * 1000); // 1 hour ago
const timedOutBuilds = await prisma.build.findMany({
where: {
status: BuildState.RUNNING,
updatedAt: { lt: timeoutThreshold }
}
});
for (const build of timedOutBuilds) {
await buildStateMachine.transitionState(
build.id,
'build_timeout',
{ reason: 'Automatic timeout after 1 hour' },
'system'
);
}
}
async function handleStuckBuilds(): Promise<void> {
const stuckThreshold = new Date(Date.now() - 30 * 60 * 1000); // 30 minutes ago
const stuckBuilds = await prisma.build.findMany({
where: {
status: BuildState.QUEUED,
updatedAt: { lt: stuckThreshold }
}
});
for (const build of stuckBuilds) {
await buildStateMachine.transitionState(
build.id,
'queue_timeout',
{ reason: 'Stuck in queue for 30+ minutes' },
'system'
);
}
}
This code defines a buildStateCleanupJob
that handles build timeouts and stuck builds. The handleBuildTimeouts
function finds builds in the RUNNING state that have not been updated in the last hour and transitions them to the TIMEOUT state. The handleStuckBuilds
function finds builds in the QUEUED state that have been stuck for more than 30 minutes and transitions them to the TIMEOUT state. These automatic transitions ensure that the system doesn't have builds stuck in intermediate states due to unforeseen issues. By adding these cleanup jobs, we enhance the robustness and reliability of our build system.
Success Criteria
To ensure the successful implementation of the state machine, we've defined the following success criteria:
- [x] All build state transitions are validated
- [x] Invalid transitions are rejected with clear error messages
- [x] State history is tracked for debugging
- [x] Race conditions in state updates are prevented
- [x] Automatic timeouts handle stuck builds
- [x] Side effects (notifications, cleanup) execute reliably
- [x] State machine is thoroughly tested
Testing Strategy
A robust testing strategy is crucial to ensure the state machine functions correctly. We'll use unit tests to verify the behavior of the state machine and its interactions with other services.
describe('Build State Machine', () => {
it('should prevent invalid state transitions', async () => {
const build = await createBuildWithStatus(BuildState.SUCCESS);
const result = await buildStateMachine.transitionState(
build.id,
'start_build', // Invalid: success -> running
{},
'user123'
);
expect(result.success).toBe(false);
expect(result.error).toContain('Invalid transition');
});
it('should handle race conditions', async () => {
const build = await createBuildWithStatus(BuildState.RUNNING);
// Simulate concurrent state changes
const [result1, result2] = await Promise.all([
buildStateMachine.transitionState(build.id, 'complete_build', { imageUrl: 'img1' }),
buildStateMachine.transitionState(build.id, 'fail_build', { error: 'failed' })
]);
// Only one should succeed
expect([result1.success, result2.success]).toContain(true);
expect([result1.success, result2.success]).toContain(false);
});
it('should execute side effects', async () => {
const build = await createBuildWithStatus(BuildState.RUNNING);
const result = await buildStateMachine.transitionState(
build.id,
'complete_build',
{ imageUrl: 'test-image:latest' }
);
expect(result.success).toBe(true);
// Verify side effects were executed (auto-deployment triggered)
expect(mockTriggerAutoDeployment).toHaveBeenCalledWith(
expect.objectContaining({ id: build.id }),
'test-image:latest'
);
});
});
The testing strategy includes tests for preventing invalid state transitions, handling race conditions, and executing side effects. The first test verifies that an invalid transition (e.g., from SUCCESS to RUNNING) is prevented. The second test simulates concurrent state changes to ensure that the state machine handles race conditions correctly. The third test verifies that side effects, such as triggering auto-deployment, are executed after a successful state transition. These tests ensure that the state machine behaves as expected and maintains the integrity of the build process.
Priority
P1-HIGH: Critical for build system integrity and preventing race conditions.
Estimated Effort
1-2 weeks (7-10 days)
- State machine design and implementation: 3-4 days
- Update all build routes: 2-3 days
- Cleanup jobs and automation: 2 days
- Testing and validation: 2-3 days
Dependencies
- Build queue system (#114) for queue-related transitions
- SSE log streaming (#119) for state change broadcasts
- Consistent error handling (#117) for error responses
Related Issues
- Builds on build queue system (#114)
- Integrates with role-based permissions (#118)
- Works with SSE streaming (#119)
Conclusion
Implementing a robust state machine is crucial for enhancing build integrity and preventing race conditions. By defining clear states, transitions, conditions, and side effects, we can ensure that our build process is consistent, reliable, and resilient to errors. This article has outlined the current issues, the proposed solution, the implementation plan, success criteria, testing strategy, and dependencies. With a dedicated effort, we can significantly improve our build system and streamline our development pipeline.