Backup failed because of TooManySnapshot error
This is not official documentation for AutomationSuite
During backup, longhorn volumes are backup by taking the snapshot of volume and sending the snapshot to remote location. If longhorn volume is having issue with snapshot creation, i.e if snapshot count for volume reached to >248
, then backup will not succeed (since snapshot creation will fail).
There are multiple steps to identify if backup is failing because of TooManySnapshot
error:
- By checking velero logs using below command:
kubectl logs -n velero -l app.kubernetes.io/name=velero -c velero |grep "Waiting for volumesnapshotcontents"
Sample output:
time="2023-12-15T08:15:59Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef to have snapshot handle. Retrying in 5s" backup=velero/daily-2 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:182" pluginName=velero-plugin-for-csi
time="2023-12-15T08:15:59Z" level=info msg="Waiting for volumesnapshotcontents snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef to have snapshot handle. Retrying in 5s" backup=velero/daily-2 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:182" pluginName=velero-plugin-for-csi
- By checking longhorn pods logs using below command:
kubectl logs -n longhorn-system -l app=csi-snapshotter --tail=-1 |grep "too many snapshots created"
Sample Output:
I1215 08:39:41.707351 1 snapshot_controller.go:291] createSnapshotWrapper: CreateSnapshot for content snapcontent-f073b88e-bd01-40a8-a645-ec929e276cef returned error: rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=failed to create snapshot: proxyServer=10.42.7.56:8501 destination=10.42.7.56:10004: failed to snapshot volume: rpc error: code = Unknown desc = failed to create snapshot snapshot-f073b88e-bd01-40a8-a645-ec929e276cef for volume 10.42.7.56:10004: rpc error: code = Unknown desc = too many snapshots created] from [http://longhorn-backend:9500/v1/volumes/pvc-7d89efa4-3d60-4837-a632-f190cd3cd9ed?action=snapshotCreate]
Solution:
To clean up the snapshots for all volumes, please run below script longhorn-snapshot-cleanup.sh
While executing the script, you need to pass below argument:
- -u -> longhorn backend URL. Run
kubectl get svc -n longhorn-system longhorn-backend -o json | jq -r '.spec | (.clusterIP|tostring) + ":" + (.ports[0].port|tostring)'
to fetch URL - -d -> Number of days, snapshots older than given number of days will get deleted
You can use below command to fetch list of the volume having toomanysnapshot error
kubectl get volumes -n longhorn-system -o json | jq -r '.items[] | select(([ .status.conditions[] | select(.type == "toomanysnapshots" and .status == "True") ] | length ) == 1 ) | .metadata.name'
How to automate snapshot cleanup
To make snapshot deletion automatically, please create a cronjob using below file. Run kubectl apply -f snapshot-cleanup.yaml
to create cronjob. You can change schedule parameter accordingly.
The script can be found here longhorn-snapshot-cleanup.yaml