Harvester Node Drain Bug: Stuck In Maintenance Mode

by Henrik Larsen 52 views

Hey guys! Let's dive into a tricky bug we've been seeing in Harvester where nodes sometimes fail to drain and enter maintenance mode. This can be a real headache when you're trying to perform updates or reboots, so let's break down the issue, how to reproduce it, and what we can do about it.

Describe the Bug

So, the main issue is that when you try to put a node into maintenance mode via the UI, the process gets stuck. Ideally, what should happen is:

  1. The node gets cordoned off.
  2. VM migrations are scheduled to move VMs off the node.
  3. The node is drained using kubectl drain --ignore-daemonsets.
  4. Once the drain is complete, maintenance mode labels are applied, and you're good to go.

But sometimes, the nodes just migrate all the VMs and then… nothing. They sit there in a cordoned state, refusing to complete the drain. We've seen this go on for up to 18 hours! That's when you start getting impatient and might consider rebooting the node. Manually killing all the pods on the node also seems to get the drain to complete, which is a clue but not ideal.

The core problem revolves around nodes failing to complete the drain process, leaving them stuck in a cordoned state despite VMs having been migrated or shut down. This directly impacts the ability to perform necessary maintenance tasks like updates and reboots. When a node is stuck, it can disrupt cluster operations and delay critical maintenance procedures.

One of the critical aspects of this issue is its intermittent nature. The problem doesn't occur every time a node is put into maintenance mode, making it difficult to pinpoint the exact conditions that trigger it. This unpredictability adds to the complexity of troubleshooting and resolving the bug. The inconsistency suggests that it might be related to specific configurations, resource utilization, or perhaps even timing issues within the cluster.

Further complicating matters is the potential interaction with Longhorn, a distributed storage solution often used with Harvester. The error messages observed during failed drain attempts often point to Longhorn-related pods, suggesting a deeper integration issue. This means that solving the problem might require understanding the intricacies of how Harvester and Longhorn interact during node maintenance.

To Reproduce

To try and reproduce this, simply place a node (or several) into maintenance mode. The issue seems to pop up a few minutes after the last VM has migrated off the node or shut down. It's not a guaranteed thing, but it happens often enough to be a pain. The best way to reproduce this issue is to consistently place nodes into maintenance mode across different cluster states and workloads. Documenting the specific circumstances under which the failure occurs can help in identifying common patterns or triggers.

It's also essential to monitor the node draining process closely, especially after the last VM has been migrated or shut down. Keeping an eye on the events and logs during this period can provide valuable insights into the cause of the failure. Detailed logging and monitoring strategies are key to capturing the elusive conditions that lead to the drain process getting stuck.

Additionally, try varying the load on the cluster before initiating maintenance mode. High resource utilization, network congestion, or pending operations might exacerbate the issue. Testing with different levels of load can help determine if resource contention plays a role in the problem. Simulating real-world production scenarios is crucial for uncovering the root cause of the bug.

Expected Behavior

What we want to happen is simple: the drain completes, and the node smoothly enters maintenance mode without any manual intervention. The expected behavior is a clear and timely transition to maintenance mode once all VMs have been migrated or shut down. This involves the successful eviction of all non-essential pods, application of maintenance mode labels, and confirmation that the node is ready for maintenance tasks.

A smooth transition not only ensures minimal disruption to running workloads but also allows administrators to perform updates, reboots, and other maintenance activities with confidence. The system should handle the node draining process efficiently, ensuring that all necessary steps are completed without manual intervention. Any deviation from this expected behavior indicates a problem that needs to be addressed.

For users, the expected behavior translates into a reliable and predictable maintenance process. When a node is placed into maintenance mode, administrators should be able to trust that the system will handle the migration and eviction of pods gracefully. This reliability is crucial for maintaining the overall stability and availability of the cluster. A consistent and predictable maintenance process is essential for building trust in the system's operational capabilities.

Additional context

This has happened a few times now, and each time it's been a pain to resolve. Digging deeper, we've noticed an endless error message like this:

evicting pod longhorn-system/instance-manager-<UUID>
error when evicting pods/"instance-manager-<UUID>" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

This just goes on and on. Most recently, we had three nodes stuck in this state. Killing the pod manually kicked off the drain for the other two nodes as well, which is a bit weird. This seems like it might be a Longhorn bug since it involves those pods, but since we're only seeing it in Harvester and during maintenance mode, we figured we'd start here.

The persistent error message highlights the role of Pod Disruption Budgets (PDBs) in preventing the eviction of critical pods. PDBs are designed to protect applications by ensuring that a minimum number of replicas are running at all times. In this case, the instance-manager pods from Longhorn appear to be protected by a PDB that prevents their eviction during the drain process. Understanding how PDBs are configured and applied in the cluster is crucial for diagnosing this issue.

The fact that manually killing the pod triggered the drain for other nodes suggests a potential dependency or synchronization problem. It's possible that the drain process is blocked by a single pod, and once that pod is removed, the process can proceed for other nodes. This behavior points to a need for better coordination and dependency management in the drain process.

The connection to Longhorn also raises questions about the interplay between storage and compute resources during maintenance operations. Ensuring that storage-related pods can be safely and efficiently evicted or migrated is crucial for a smooth maintenance process. The interaction between Harvester and Longhorn, particularly in the context of PDBs and pod eviction, warrants further investigation.

Workaround and Mitigation

So, what can you do in the meantime? Here are a couple of workarounds:

  • Use kubectl drain <node> --ignore-daemonsets to see which pods are blocking the drain.
  • Manually stop/delete those pods.

Alternatively, if you're feeling brave:

  • YOLO and reboot the node after verifying that the VMs have migrated (not ideal, but sometimes necessary).

Detailed Explanation of Workarounds

  1. Using kubectl drain <node> --ignore-daemonsets: This command is your first line of defense. It attempts to drain the specified node, moving all evictable pods to other nodes. The --ignore-daemonsets flag is crucial because DaemonSets are designed to run a pod on every node, and you generally don't want to evict them. By running this command, you can identify the specific pods that are preventing the drain process from completing. The output will usually highlight the pods that are failing to evict, providing a clear indication of the problem.

    This approach is particularly useful because it gives you direct feedback on the drain process. It allows you to see exactly which pods are causing issues and why they might be failing to evict. Understanding the specific pods involved is the first step in developing a targeted solution.

  2. Manually Stopping/Deleting Pods: Once you've identified the problematic pods, you can take manual action to resolve the issue. Stopping or deleting the pods can unblock the drain process, allowing the node to enter maintenance mode. However, it's essential to exercise caution when using this method. Before stopping or deleting any pod, make sure you understand its role in the system and the potential impact of its removal.

    For example, if a pod is part of a critical application, stopping it might cause service disruption. In such cases, it's best to coordinate with the application owners or consult the documentation to determine the safest course of action. Sometimes, scaling up the application's deployment before stopping the pod can help maintain availability.

The YOLO Reboot Option (Use with Caution!)

  • Rebooting the Node: This is the