Migrate to CSI

It would probably be good at this point to rewrite ghost using CSI (container storage interface), replacing the old and at this point unmaintained "flexvolume" approach.

This would be a significant undertaking, but has the potential to simplify the design of ghost, allowing more parts to be done by standard (kubernetes) components. It also makes the system more future-proof.

Advantages of the CSI driver API over the flexvolume one:

There are separate commands for attaching vs. mounting, and for detaching vs. unmounting. That helps with better error reporting and dealing with situations where one of those actions fails while the other succeeds, thus eliminating the need for tracking attachment/mount state locally.
We can set a static "volume context" on a disk when it's created, containing the cosmos ID. This will be passed to subsequent attach and detach calls, so the driver actually knows which cosmos disk is involved.
CSI has a feature (LIST_VOLUMES, LIST_VOLUMES_PUBLISHED_NOTES) for letting Kubernetes ask the driver which volumes are currently attached, which flexvolumes lack. I hope that Kubernetes actually uses this when a node is rebooted, checking which old disks are still present so it doesn't need to reattach all of them.

All options

Given that this rewrite would be a significant undertaking, we should reexamine all our options at this point.

Rewrite ghost to use CSI

Facts

Ghost as it is now is a bash script, a CSI version would be a Go script
There are separate commands for attaching vs. mounting, and for detaching vs. unmounting.

Pro

separate commands eliminating the need for tracking attachment/mount state locally.
CSI has a feature (LIST_VOLUMES, LIST_VOLUMES_PUBLISHED_NOTES) for letting Kubernetes ask the driver which volumes are currently attached, which flexvolumes lack. I hope that Kubernetes actually uses this when a node is rebooted, checking which old disks are still present so it doesn't need to reattach all of them.
simplifies design
We can set a static "volume context" on a disk when it's created, containing the cosmos ID. This will be passed to subsequent attach and detach calls, so the driver actually knows which cosmos disk is involved.

Con

unexperienced in Go

Feelings (we do not debate these, they are always true)

Feels like throwing away some work invested in creating current ghost.

Fix ghost without a full rewrite

In particular we'd need to fix #18 (closed); see discussion in that issue for options.

Facts

Pro

Hopefully relatively small time investment.

Con

Feelings

not sure we can solve all issues
Putting more time in a solution that we now know has inherent problems.
The "local administration" necessary for ghost feels brittle, thus prone to future problems.

Ceph native

From the cluster's point of view it would be natural to talk to ceph directly, bypassing the cosmos2 disk api.

Facts

Pro

clean option from kubernetes perspective
less 'own' tools to manage
less dependency on Dom0

Con

loads of dev/ops needed
requires loads of extra security
if on current platform needs ipv6 for VMs

Feelings

we might never be able to secure this enough

NFS

Another option is to bypass the cosmos2 disk api, but instead of accessing ceph disks directly, create an intermediate VM that gets a single big virtual disk for all cluster storage, and have the cluster access that storage over NFS.

Facts

There is an existing kubernetes project that does the dynamic provisioning in this case: nfs-subdir-external-provisioner.

Pro

Easy to set up from the cluster side: existing provisioner, no driver required.

Con

The performance of databases over nfs might not suffice.
Requires some manual work for every cluster to set up an nfs server instance.

Feelings

Edited Nov 05, 2020 by Arie Peterson