Skip to main content

Longhorn CSI Plugin Fails to Start

Note

This is not official documentation for AutomationSuite

Issue Description

During the installation of Automation Suite or AI Center via Automation Suite, the longhorn-csi-plugin fails to start.

Root Cause

The longhorn CSI plugin is the interface between Kubernetes (the container Orchestration platform used in Automation Suite) and longhorn (the storage solution used for persistent data).

In other words it allows the Kubernetes engine to talk to the storage engine. If the CSI plugin is not starting, it can be cause by many issues but is usually symptomatic of a more fundamental problem.

Diagnosis

  1. If the longhorn-csi-plugin pod is not starting, it is probably in a CrashLoopBackOff state. The first thing to do to is list to get more info about the state.

    1. Run this command kubectl -n longhorn-system get pods -l app=longhorn-csi-plugin -o wide
    2. In the output pay attention to the READY column. If the READY column is 1/2, it most likely means that the CSI-Plugin cannot talk to longhorn backend. This will be covered in later steps.
      • Example: longhorn-system longhorn-csi-plugin-jwvnj 1/2 CrashLoopBackOff
    3. Also pay attention to which node the failed pod is running on.
  2. Next, do a describe of the pod to make sure that any other important information is identified

    1. This returns a list of the pods that are running in the cluster. Look for the one that is in a crashed state kubectl -n longhorn-system get pods -l app=longhorn-csi-plugin
    2. kubectl -n describe pod <pod name>
    3. Alternatively run the command: kubectl -n longhorn-system describe pod -l app=longhorn-csi-plugin
    4. Look at the events associated with the Pod. They are at the end of the output and usually they include some helpful information.
    5. Most likely the error will be: Back-off restarting failed container
  3. Try to get the logs from the crashed instance

    1. identity the pod that is in a crashed state kubectl -n longhorn-system get pods -l app=longhorn-csi-plugin
    2. kubectl -n longhorn-system logs <pod name> -c longhorn-csi-plugin (Note, the -p option might need to be added at the end of the command.)
      • Check for the error: Failed to initialize Longhorn API client: Get \"http://longhorn-backend:9500/v1\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    3. Get the logs for the node-driver-registrar
      • kubectl -n longhorn-system logs <pod name> node-driver-registrar (Note, the -p option might need to be added at the end of the command.) Check for the error: W0423 04:16:48.096462 11095 connection.go:173] Still connecting to unix:///csi/csi.sock
    4. If the two errors mentioned above are seen, the issue is most likely that the different nodes in the network cannot talk to each other and something is wrong with the overlay network. Go to the Section Debugging Node Overlay Network
  4. If the the logs did not show the errors mentioned, or if the issue is still not clear, create a support bundle using the Linux Log Collector Script and send the information to UiPath .

Debugging Node Overlay Network Issue

Note

These steps only apply to Multi-node installations.

  1. First, go through the following KB article to see if the overlay network is functioning overlay networking test

    • Hint: This can also be verified via step 6 below by using a TCP dump to check if UDP packets are reaching nodes
    • For airgapped environments, most likely the tcpdump test will be easier to perform
  2. If the tests fail, that means the overlay network is not working

  3. If the test failed, then that means the overlay network is not working. Most likely the culprit is that the nodes cannot communicate over UDP port 8472

    • This port is mentioned in the documentation
    • Typically there are three culprits that can cause this
      • Local Firewall
      • IPTables on the VM
      • Network firewall that exists outside of the VM
    • The default redhat firewall is firewall. The state can be checked with the command,
      • firewall-cmd --state
      • To open the port,
        • sudo firewall-cmd --zone=public --permanent --add-port=8472/udp
        • sudo firewall-cmd --reload
      Important:

      It is best to talk to the Linux Administrator and Security teams before making changes like this. Additionally, they may be using a different firewall solution or they may have a private zone.

      • Finally, it might be best to see if the firewall can be turned off. It is often easier to manage and some Linux administrators prefer to keep it simple by having a firewall at the network level in a place that exists outside of the VM.
  4. If the firewall is disabled then the next culprit could be IP Tables

    • IP Tables can be complicated to troubleshoot. However, if logging is enabled for the IP Tables, then a check can be done to see if packages are being dropped.
      • Note IP Tables can be easier to read by using the -S command. For example to see the rules for the input table try: iptables -S INPUT
    • Make sure to be on the node where the pod is crashed. This information was gathered in Step 1. of the Diagnosing Section
    • Run the command: iptables -A INPUT -j LOG --log-prefix DROPPED_INPUT
    • This will cause dropped packages to be logged in /var/log/messages
    • Run the following command to see if packets are being dropped: tail -f /var/log/messages | grep "DROPPED_INPUT.*DPT=8472"
    • If the command returns anything packages are being dropped.
    • Talk to your Admin who setup the IP tables to get the ports unblocked or an exception added to the IP Tables
    • Also note that some Linux administrator recommend not configuring IP Tables on specific machines and instead relying on your network firewall
  5. To determine if the issue is caused by an external firewall, perform a package capture,

    • To do a package capture, tcpdump must be installed: yum install tcpdump
    • To see if traffic is coming through run: tcpdump -i any udp and port 8472
    • Let the command run for a few minutes. If not traffic is coming through most likely its being blocked by an external firewall.
    • Its also possibly the issue is intermittent if there is an external dynamic firewall.
  6. Another quick test is to test UDP on a different port

    • On the node where the pod is crashing, run: nc -ul <random port number> - This creates a UDP listener
    • On another node, run: nc -u <first node name> <port number from previous step>
    • Then, on the second node, type something and hit enter. The typed text should show up on the first node where the UDP listener is running
    • If this test does not work, then that means either UDP is only allowed on some ports, or UDP is indeed blocked.
  7. If the above steps do not help, use the following link to create a support bundle to send to UiPath