Garbage Collection

One key difference between Concourse and other CI systems is that everything runs in isolated environments. Where some CI systems may just run builds one at a time on a single VM and reusing a working directory, Concourse creates fresh Containers and Volumes to ensure things can safely run in a repeatable environment, isolated from other workloads running on the same worker.

This introduces a new problem of knowing when Concourse should remove these containers and volumes. Safely identifying things for removal and then getting rid of them, releasing their resources, is the process of garbage collection.

Goals

Let's define our metrics for success:

  • Safe. There should never be a case where a build is running and a container or volume is removed out from under it, causing the build to fail. Resource checking should also never result in errors from check containers being removed. No one should even know garbage collection is happening.

  • Airtight. Everything Concourse creates, whether it's a container or volume on a worker or an entry in the database, should never leak. Each object should have a fully defined lifecycle such that there is a clear end to its use. The ATC should be interruptible at any point in time and at the very least be able to remove any state it had created beforehand.

  • Resilient. Garbage collection should never be outpaced by the workload. A single misbehaving worker should not prevent garbage collection from being performed on other workers. A slow delete of a volume should not prevent garbage collecting of other things on the same worker.

How it Works

The garbage collector is a batch operation that runs every 30 seconds. This number was chosen arbitrarily and may be reduced in the future. It's important to note that the collector must be able to run frequently enough to not be outpaced by the workload producing things, and so the batch operation should be able to complete pretty quickly.

The batch operation first performs garbage collection within the database alone, removing rows that are no longer needed. The removal of rows from one stage will often result in removals in a later stage. They are run in the following order:

If any of the above operations fail, the garbage collector will just log an error and move on. This is so that failure to collect one class of objects does not prevent everything else from being garbage collected. Failure at any part of the garbage collection is OK; it can just retry on the next pass.

After the initial pass of garbage collection in the database, there should now be a set of volumes and containers that meet criteria for garbage collection. These two are a bit more complicated to garbage-collect; they both require talking to a worker, and waiting on a potentially slow delete.

Containers and volumes are the costliest resources consumed by Concourse. There are also many of them created over time as builds execute and pipelines perform their resource checking. Therefore it is important to parallelize this aspect of garbage collection so that one slow delete or one slow worker does not cause them to pile up.

So, the next two steps are Container Collection and Volume Collection.

Container Collection

First, a fairly simple query is executed to find containers that meet one of the following conditions:

Once these containers are found, they are all deleted in parallel, with a max-in-flight limit per worker so that the worker doesn't get hammered by a burst of writes.

The deletion of every container is a careful process to ensure they never leak and are never deleted while a user is hijacked into them:

  • If the container is CREATING, we mark it CREATED. This is a bit wonky but makes it easier to just step it through the rest of the lifecycle, since if there was a container being created on the worker, we need to clean it up.

  • If the container is CREATED, we first check to see if it was hijacked. If not, we transition it to DESTROYING.

    If the container is hijacked, we try to find the container in the worker.

    If the worker container is found, we set a grace time on it (a period of inactivity after which the container will be reaped by the worker itself), mark the database container as discontinued, and transition the container to DESTROYING.

    If the worker container is not found, we transition the container to DESTROYING, just to funnel it down the same code path as below.

  • If the container is DESTROYING, and the container is discontinued, we check if the container has expired yet (via the grace time) by looking for it on the worker. If it's still there, we leave it alone, and leave the container in the database. If it's gone, we reap the container from the database.

    If the container is not discontinued, we destroy the container on the worker and reap the container from the database.

Note that if any point of the above process fails, the container is left in its current state in the databsae. A container is only ever removed from the database when it's guaranteed that everything has been cleaned up.

Volume Collection

Volume collection is quite a bit simpler than Container Collection.

First, volumes are found for deletion. This is just a query for volumes that have NULL references for all four volume owners:

Next, each CREATED volume is transitioned to DESTROYING. This transition can fail if the volume is being used as the parent of a copy-on-write volume that is still in use (e.g. by a build).

Then, for each volume for DESTROYING state, including those that were just transitioned, we execute the following in parallel (as with containers, there is a max-in-flight limit per worker):

  • First, look up the volume on the worker and destroy it if it's found.

  • Next, delete the volume from the database.

As with containers, if any part of the deletion sequence returns an error, the volume is skipped. A volume is only ever removed from the database when it's guaranteed that everything has been cleaned up.