Migrate to CSI
It would probably be good at this point to rewrite ghost using CSI (container storage interface), replacing the old and at this point unmaintained "flexvolume" approach.
This would be a significant undertaking, but has the potential to simplify the design of ghost, allowing more parts to be done by standard (kubernetes) components. It also makes the system more future-proof.
Advantages of the CSI driver API over the flexvolume one:
-
There are separate commands for attaching vs. mounting, and for detaching vs. unmounting. That helps with better error reporting and dealing with situations where one of those actions fails while the other succeeds, thus eliminating the need for tracking attachment/mount state locally.
-
We can set a static "volume context" on a disk when it's created, containing the cosmos ID. This will be passed to subsequent attach and detach calls, so the driver actually knows which cosmos disk is involved.
-
CSI has a feature (
LIST_VOLUMES
,LIST_VOLUMES_PUBLISHED_NOTES
) for letting Kubernetes ask the driver which volumes are currently attached, which flexvolumes lack. I hope that Kubernetes actually uses this when a node is rebooted, checking which old disks are still present so it doesn't need to reattach all of them.
All options
Given that this rewrite would be a significant undertaking, we should reexamine all our options at this point.
Rewrite ghost to use CSI
Facts
- Ghost as it is now is a bash script, a CSI version would be a Go script
- There are separate commands for attaching vs. mounting, and for detaching vs. unmounting.
Pro
- separate commands eliminating the need for tracking attachment/mount state locally.
- CSI has a feature (LIST_VOLUMES, LIST_VOLUMES_PUBLISHED_NOTES) for letting Kubernetes ask the driver which volumes are currently attached, which flexvolumes lack. I hope that Kubernetes actually uses this when a node is rebooted, checking which old disks are still present so it doesn't need to reattach all of them.
- simplifies design
- We can set a static "volume context" on a disk when it's created, containing the cosmos ID. This will be passed to subsequent attach and detach calls, so the driver actually knows which cosmos disk is involved.
Con
- unexperienced in Go
Feelings (we do not debate these, they are always true)
- Feels like throwing away some work invested in creating current ghost.
Fix ghost without a full rewrite
In particular we'd need to fix #18 (closed); see discussion in that issue for options.
Facts
Pro
- Hopefully relatively small time investment.
Con
Feelings
- not sure we can solve all issues
- Putting more time in a solution that we now know has inherent problems.
- The "local administration" necessary for ghost feels brittle, thus prone to future problems.
Ceph native
From the cluster's point of view it would be natural to talk to ceph directly, bypassing the cosmos2 disk api.
Facts
Pro
- clean option from kubernetes perspective
- less 'own' tools to manage
- less dependency on Dom0
Con
- loads of dev/ops needed
- requires loads of extra security
- if on current platform needs ipv6 for VMs
Feelings
- we might never be able to secure this enough
NFS
Another option is to bypass the cosmos2 disk api, but instead of accessing ceph disks directly, create an intermediate VM that gets a single big virtual disk for all cluster storage, and have the cluster access that storage over NFS.
Facts
- There is an existing kubernetes project that does the dynamic provisioning in this case: nfs-subdir-external-provisioner.
Pro
- Easy to set up from the cluster side: existing provisioner, no driver required.
Con
- The performance of databases over nfs might not suffice.
- Requires some manual work for every cluster to set up an nfs server instance.