Gitlab Runner autoscaling infrastructure on Hetzner Cloud with Terraform

We are running Gitlab with the integrated CI system since a couple of years. To process our CI builds, we had run just 3 workers distributed into idle machines (e.g. server in office, support/internal testing server etc.). Since we have now more developers committing regularly, sometimes now our CI builds take long time to get to the front of the queue. So, adding more workers was about time.

Also, that seemed to be a good time, to play around with Hetzner’s newish Cloud offer, with VMs from starting at 3-10 EUR for a typical runner size.

First iteration: running additional static (but immutable) workers

My first idea was, to just use Terraform to deploy 1-3 more workers once. Using DevOps practices all the configuration should be totally automatic. As we are using Ansible internally for almost all operations, my first idea was to use Terraform to order the vms and then use Ansible as provisioners.

Turns our, that is not necessary, as Hetzner allows to pass a Cloud-Init YAML configuration, which cover a lot of simpler provisioning use cases. In the end we ended up with the configuration, that I documented into this Gist.

Basically:

use Terraform Hcloud provider
using Terraform apply to order $count number of vms
use a cloudinit.yml and pass that as “user_data”. That takes care of:
- adding Docker and Gitlab apt-repositories
- installing both docker and Gitlab-runner
- running gitlab-runner register at the end to automatically register the runner at your Gitlab instance

Great! Only 2 config files!

Deployed, worked relatively good, after figuring some things out; Things, that went wrong the first times/lessons learned:

how to pass array values to gitlab-runner register: use the flag multiple times.
No swap. By default the VM has no swap. One of our tests uses a virus scanner (clamd), selenium (Firefox), Elasticsearch and the app, which exploded the 4GB RAM of the second most smallest instance. Luckily you can specify a Swap on cloudinit
Unregister on infrastructure change. If you change part of the config or want to remove some hosts, you need to unregister the runner (or guess you don’t “need”, but otherwise end up with zombie runner in the UI).
cloudinit.yml first line MUST be #cloud-config
Both docker and gitlab always uses some curl|sudo bash install scripts for adding keys. To find out the correct key, needed to use keyserver.ubuntu.com, gpg –receiv-keys, or just try adding all the keys, apt yells you at after updating unverified source.

Full source code in the end (normally, would split up the tf files) (also generate a ssh key before to keys/id_rsa(.pub)):

# main.tf - putting all the variables into files like creds.auto.tfvars
variable "hcloud_token" {}
provider "hcloud" {
  token = "${var.hcloud_token}"
}

resource "hcloud_ssh_key" "default" {
  name = "Terraform Key"
  public_key = "${file("keys/id_rsa.pub")}"
}

variable "default_image" {}
variable "allowed_images_extra" {}
variable "registration_token" {}
variable "gitlab_instance_url" {}

data "template_file" "cloudinit" {
  template = "${file("templates/cloudinit.yml.cfg")}"
  vars = {
    gitlab_url = "${var.gitlab_instance_url}"
    registration_token = "${var.registration_token}"
    allowed_images_extra = "${var.allowed_images_extra}"
    default_image = "${var.default_image}"
    tag_list = "docker"
  }
}

resource "hcloud_server" "worker" {
  count = 0
  name = "hcworker-${count.index}"
  image = "ubuntu-18.04"
  server_type = "cx21"
  location = "nbg1"
  ssh_keys = ["${hcloud_ssh_key.default.id}"]
  user_data = "${data.template_file.cloudinit.rendered}"

  provisioner "remote-exec" {
    inline = [
      "gitlab-runner unregister --all-runners || true"
    ]
    when = "destroy"
    connection {
      type     = "ssh"
      user     = "root"
      private_key = "${file("keys/id_rsa")}"
    }
  }
}

#cloud-config
# templates/cloudinit.yml.cfg
groups:
- docker
users:
- name: gitlab-runner
  groups: docker
apt:
  sources:
    docker.list:
      source: 'deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable'
      keyid: 0EBFCD88 # GPG key ID published on a key server
    gitlab.list:
      source: 'deb https://packages.gitlab.com/runner/gitlab-runner/ubuntu/ $RELEASE main'
      keyid: "0x14219A96E15E78F4"
      keyserver: keyserver.ubuntu.com
    gitlab2.list:
      keyid: "0x3CFCF9BAF27EAB47"
      keyserver: keyserver.ubuntu.com

package_upgrade: true
package_update: true

packages:
- debian-archive-keyring
- apt-transport-https
- ca-certificates
- software-properties-common
- htop
- docker-ce
- golang-go
- gitlab-runner
- fail2ban
- vim

write_files:
  - owner: root:root
    path: /etc/cron.d/your_cronjob
    content: "* 5 * * * root (/usr/bin/docker ps --filter status=dead --filter status=exited -aq   |  /usr/bin/xargs /usr/bin/docker rm -v 2> /dev/null) || true"

  - owner: root:root
    path: /root/register.sh
    content: |
      gitlab-runner register --executor docker \
        -u ${gitlab_url} \
        --run-untagged \
        --tag-list ${tag_list} \
        --locked=false \
        --non-interactive \
        -r ${registration_token} \
        --docker-privileged=true \
        --docker-pull-policy=if-not-present --docker-shm-size=268435456 \
        --docker-volumes='/cache' \
        --docker-image="${default_image}" \
        --docker-allowed-images '*' \
        --docker-allowed-images '*/*' \
        --docker-allowed-images '*/*/*' \
        --docker-allowed-images '${allowed_images_extra}' \
        --docker-allowed-services 'redis:*' \
        --docker-allowed-services 'postgres:*' \
        --docker-allowed-services 'mysql:*' \
        --docker-allowed-services '${allowed_images_extra}'
runcmd:
  - [/bin/bash, /root/register.sh]

power_state:
  delay: "now"
  mode: reboot
  message: First reboot
  condition: True
swap:
  filename: /var/swap.img
  size: "auto" # or size in bytes
  maxsize: 4294967296

Nice! Worked. But then I stumbled upon this idea on commanding Docker-machine to create & destroy Hetzner machines on demand. Gitlab runner supports docker machine and autoscaling and has really nice business related options, like:

OffPeakPeriods - Which time there is normally no CI activity? …Like night-time, weekends etc.? At this time you can configure a different set of available runners and safe money in the end.

Second iteration: autoscaling with docker-machine

It’s almost the same config. We just need the smallest VM possible as the Gitlab-Runner broker, as there is very few code run on that machine. Otherwise, we added a bunch of config options to the register call and destroying all machines on shutdown.

variable "autoscaling_server" {
  default = "cx31"
}
variable "autoscaling_os" {
  default = "ubuntu-18.04"
}
variable "autoscaling_concurrency" {
  default = 3
}

data "template_file" "broker_cloudinit" {
  template = "${file("templates/broker_cloudinit.yml.cfg")}"
  vars = {
    gitlab_url = "${var.gitlab_instance_url}"
    registration_token = "${var.registration_token}"
    allowed_images_extra = "${var.allowed_images_extra}"
    default_image = "${var.default_image}"
    hcloud_token = "${var.hcloud_token}"
    tag_list = "ruby"
    autoscaling_os = "${var.autoscaling_os}"
    autoscaling_server = "${var.autoscaling_server}"
    autoscaling_concurrency = "${var.autoscaling_concurrency}"
  }
}
resource "hcloud_server" "autoscaling_boss" {
  count = 1
  name = "hcbroker-${count.index}"
  image = "ubuntu-18.04"
  server_type = "cx11"
  location = "nbg1"
  ssh_keys = ["${hcloud_ssh_key.default.id}"]
  user_data = "${data.template_file.broker_cloudinit.rendered}"

  provisioner "remote-exec" {
    inline = [
      "docker-machine kill `docker-machine ls -f '\{\{.Name}}'` || true",
      "docker-machine rm `docker-machine ls -f '\{\{.Name}}'` || true",
      "gitlab-runner unregister --all-runners || true"
    ]
    when = "destroy"
    connection {
      type     = "ssh"
      user     = "root"
      private_key = "${file("keys/id_rsa")}"
    }
  }
}

output "broker_ssh_ip" {
  value = "ssh -F ssh.config root@${join(" ", hcloud_server.autoscaling_boss.*.ipv4_address)}"
}

#cloud-config
groups:
- docker
users:
- name: gitlab-runner
  groups: docker
apt:
  sources:
    docker.list:
      source: 'deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable'
      keyid: 0EBFCD88 # GPG key ID published on a key server
    gitlab.list:
      source: 'deb https://packages.gitlab.com/runner/gitlab-runner/ubuntu/ $RELEASE main'
      keyid: "0x14219A96E15E78F4"
      keyserver: keyserver.ubuntu.com
    gitlab2.list:
      keyid: "0x3CFCF9BAF27EAB47"
      keyserver: keyserver.ubuntu.com

package_upgrade: true
package_update: true

packages:
- debian-archive-keyring
- apt-transport-https
- ca-certificates
- software-properties-common
- htop
- docker-ce
- golang-go
- gitlab-runner
- fail2ban
- vim
- curl
- tar
- jq

write_files:
  - owner: 'root:root'
    path: /root/download_driver.sh
    content: |
      echo "Downloading docker-machine"
      wget -q https://github.com/docker/machine/releases/download/v0.16.1/docker-machine-Linux-x86_64 -O /usr/bin/docker-machine
      chmod +x /usr/bin/docker-machine

      echo "Downloading docker-machine-driver-hetzner"
      mkdir /build
      cd /build
      curl -sLo hetzner.tar.gz $(curl --silent https://api.github.com/repos/JonasProgrammer/docker-machine-driver-hetzner/releases | jq -r '. | first | .assets[] | select(.name|contains("linux_amd64")).browser_download_url')
      tar xf hetzner.tar.gz
      chmod +x docker-machine-driver-hetzner
      mv docker-machine-driver-hetzner /usr/local/bin/

  - owner: 'root:root'
    path: /root/register.sh
    content: |
      gitlab-runner register \
        --executor docker+machine \
        -u ${gitlab_url} \
        --run-untagged \
        --tag-list ${tag_list} \
        --locked=false \
        --non-interactive \
        -r ${registration_token} \
        --docker-privileged=true \
        --docker-pull-policy=if-not-present --docker-shm-size=268435456 \
        --docker-volumes='/cache' \
        --docker-image="${default_image}" \
        --machine-idle-nodes 1 \
        --machine-idle-time 1800 \
        --machine-max-builds 100 \
        --machine-machine-driver "hetzner" \
        --machine-machine-name "runner-%s" \
        --machine-machine-options "hetzner-api-token=${hcloud_token}" \
        --machine-machine-options "hetzner-image=${autoscaling_os}" \
        --machine-machine-options "hetzner-server-type=${autoscaling_server}" \
        --machine-off-peak-periods '* * 0-8,18-23 * * mon-fri *' \
        --machine-off-peak-periods '* * * * * sat,sun *' \
        --machine-off-peak-timezone "Europe/Berlin" \
        --docker-allowed-images '*' \
        --docker-allowed-images '*/*' \
        --docker-allowed-images '*/*/*' \
        --docker-allowed-images '${allowed_images_extra}' \
        --docker-allowed-services 'redis:*' \
        --docker-allowed-services 'postgres:*' \
        --docker-allowed-services 'mysql:*' \
        --docker-allowed-services '${allowed_images_extra}'

      sed -i 's/concurrent = [0-9]\+/concurrent = ${autoscaling_concurrency}/' /etc/gitlab-runner/config.toml

runcmd:
  - [/bin/bash, /root/download_driver.sh]
  - [/bin/bash, /root/register.sh]

power_state:
  delay: "now"
  mode: reboot
  message: First reboot
  condition: True
swap:
  filename: /var/swap.img
  size: "auto" # or size in bytes
  maxsize: 4294967296

Things of notice:

We download the latest Docker machine Hetzner driver from JonasProgrammer/docker-machine-driver-hetzner
Docker machine version is still hardcoded to the latest, maybe could be the same like the other one
Number of concurrent allowed runners must be set with sed …. not possible on register yet (Issue)

But, in the end, your bill will look a little funny:

Hetzner bill with a lot of positions.

Third iteration: adding S3-compatible proxy and registry cache

Autoscaling is great, but when running from new VMs all the time, there is no cache that can be used by dependeny management, and the used docker images have to be downloaded over and over again (If you are hosting your Docker images on S3, that also impacts your bill needlessly). So, adding a caching layer is the next step.

See this gist for the full source code (and the final version).

In that iteration, we order a volume from Hetzner and mount it on the server in the cloudinit. This volume can be resized on demand later (but not downsized), if the caching or registry is too large. We then run the S3-compatible proxy (Minio) and the Docker registry with docker and point both to the new volume. That should improve the initial performance by quite a bit and reduce intermediate failures by contacting Rubygems or NPM repositories.

The Gitlab docs are amazing and already present the snippets that add a S3-compatible caching layer and a registry. As we had already installed Docker (for no reason anymore) on the Runner Docker machine host, we can easily add that into the flow and host both services on the same mini-VM. Currently, that 2 services plus Gitlab-runner take about 300MB of RAM, so using the smallest machine (2 GB), we still have plenty of margin.

If that article helped you, shoot me a Thank You on Twitter (@stefanwienert) :)

Enjoy!