Scientyfic World

How I Got n8n Running on Kubernetes

A comprehensive guide to help you deploy n8n on Google Cloud using Google Kubernetes Engine for scalable, secure, and automated workflow management....
Share:

Get an AI summary of this article

Deploy n8n on Google Cloud using Google Kubernetes Engine banner image

Most teams reach for n8n because they want workflow automation without paying per-execution. Then they try to self-host it on a single VM and it works—until it doesn’t. Queue backs up, the process dies, nobody knows why, and suddenly someone’s manually re-running 40 workflows at midnight.

I’ve seen this pattern enough times that the fix is predictable: move it to Kubernetes, run workers separately from the main process, and let the cluster handle scaling. GKE is where most teams end up because it integrates cleanly with the rest of their GCP stack. This is how that actually gets done.

Why GKE and Not Just a Bigger VM

The short answer: a bigger VM still fails the same way. The root problem isn’t resources—it’s that n8n’s main process is doing too many things at once. It’s handling the UI, the API, incoming webhooks, and executing workflows. When volume spikes, everything degrades together.

Queue mode fixes this. You separate the main process (UI + scheduling) from workers (actual execution). Workers can scale horizontally. If one crashes, others keep running. The main process stays stable.

GKE handles the orchestration side of this well—auto-scaling, node repair, rolling updates. You do give up simplicity. There’s more to configure, more that can go wrong during setup. That tradeoff is worth it past a certain workflow volume, not worth it for a hobby project.

The Cluster Setup

Standard cluster, not Autopilot. Autopilot restricts some configurations that matter for stateful workloads. Regional deployment across three zones—worth the cost if this is anything close to production.

gcloud container clusters create n8n-production-cluster \
  --region us-central1 \
  --node-locations us-central1-a,us-central1-b,us-central1-c \
  --num-nodes 1 \
  --machine-type e2-standard-2 \
  --disk-size 50GB \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade

Cluster creation takes 5–10 minutes. Once it’s up:

gcloud container clusters get-credentials n8n-production-cluster --region us-central1
kubectl get nodes

If your nodes aren’t showing Ready, check your quota. GCP free tier accounts sometimes hit vCPU limits silently and the error message isn’t obvious.

PostgreSQL First

n8n defaults to SQLite if it can’t connect to a real database. This is easy to miss. You deploy everything, it seems to work, and then you notice the logs say it’s using SQLite. Everything you thought was saved in Postgres isn’t.

Set up Postgres before you touch n8n configs.

Create a namespace first:

kubectl create namespace n8n

The storage class and PVC:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: postgres-ssd
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  zones: us-central1-a,us-central1-b,us-central1-c
allowVolumeExpansion: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: n8n
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: postgres-ssd
  resources:
    requests:
      storage: 20Gi

Secrets—base64 encoded, not plain text:

echo -n 'your-secure-password' | base64
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: n8n
type: Opaque
data:
  POSTGRES_USER: bjhuX3VzZXI=
  POSTGRES_PASSWORD: <your-base64-password>
  POSTGRES_DB: bjhu

The StatefulSet matters here more than a Deployment. Postgres needs stable network identity and persistent storage. Using a Deployment for Postgres is one of those things that seems fine until you have a pod reschedule.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: n8n
spec:
  serviceName: postgres-service
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15-alpine
        ports:
        - containerPort: 5432
        envFrom:
        - secretRef:
            name: postgres-secret
        env:
        - name: PGDATA
          value: "/var/lib/postgresql/data/pgdata"
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          exec:
            command: ["pg_isready", "-U", "n8n_user"]
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          exec:
            command: ["pg_isready", "-U", "n8n_user"]
          initialDelaySeconds: 5
          periodSeconds: 5
  volumeClaimTemplates:
  - metadata:
      name: postgres-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: postgres-ssd
      resources:
        requests:
          storage: 20Gi

Service to expose it internally:

apiVersion: v1
kind: Service
metadata:
  name: postgres-service
  namespace: n8n
spec:
  selector:
    app: postgres
  ports:
  - port: 5432
    targetPort: 5432
  type: ClusterIP

Apply it all, then check the pod is actually Running not just Pending:

kubectl apply -f postgres-secret.yaml
kubectl apply -f postgres-storage.yaml
kubectl apply -f postgres-deployment.yaml
kubectl apply -f postgres-service.yaml
kubectl get pods -n n8n

Redis for Queue Mode

This part is straightforward. Redis sits between the main n8n process and the workers—main process puts jobs on the queue, workers pick them up.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: n8n
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        ports:
        - containerPort: 6379
        args:
        - redis-server
        - --appendonly
        - "yes"
        - --maxmemory
        - "512mb"
        - --maxmemory-policy
        - "allkeys-lru"
        volumeMounts:
        - name: redis-storage
          mountPath: /data
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          tcpSocket:
            port: 6379
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          exec:
            command: ["redis-cli", "ping"]
          initialDelaySeconds: 5
          periodSeconds: 5
      volumes:
      - name: redis-storage
        persistentVolumeClaim:
          claimName: redis-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-data
  namespace: n8n
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: postgres-ssd
  resources:
    requests:
      storage: 5Gi

Redis service:

apiVersion: v1
kind: Service
metadata:
  name: redis-service
  namespace: n8n
spec:
  selector:
    app: redis
  ports:
  - port: 6379
    targetPort: 6379
  type: ClusterIP

Deploying n8n — Main Process and Workers Separately

This is where queue mode configuration has to be right or nothing works. The EXECUTIONS_MODE: "queue" env var is what switches n8n into this mode. If it’s missing or wrong, n8n runs in regular mode and the workers do nothing.

ConfigMap holds the non-sensitive config. Secrets hold passwords and encryption keys.

apiVersion: v1
kind: Secret
metadata:
  name: n8n-secret
  namespace: n8n
type: Opaque
data:
  N8N_ENCRYPTION_KEY: <base64-encoded-32-char-key>
  DB_POSTGRESDB_PASSWORD: <base64-same-as-postgres-secret>
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: n8n-config
  namespace: n8n
data:
  DB_TYPE: "postgresdb"
  DB_POSTGRESDB_HOST: "postgres-service"
  DB_POSTGRESDB_PORT: "5432"
  DB_POSTGRESDB_DATABASE: "n8n"
  DB_POSTGRESDB_USER: "n8n_user"
  DB_POSTGRESDB_SCHEMA: "public"
  EXECUTIONS_MODE: "queue"
  QUEUE_BULL_REDIS_HOST: "redis-service"
  QUEUE_BULL_REDIS_PORT: "6379"
  QUEUE_BULL_REDIS_DB: "0"
  N8N_HOST: "n8n.yourdomain.com"
  N8N_PROTOCOL: "https"
  N8N_PORT: "5678"
  WEBHOOK_URL: "https://n8n.yourdomain.com/"
  N8N_SECURE_COOKIE: "true"
  N8N_BLOCK_ENV_ACCESS_IN_NODE: "true"
  N8N_PAYLOAD_SIZE_MAX: "16777216"
  EXECUTIONS_DATA_PRUNE: "true"
  EXECUTIONS_DATA_MAX_AGE: "168"
  GENERIC_TIMEZONE: "America/New_York"

Main process deployment—always exactly one replica. This is not a thing you scale horizontally:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: n8n-main
  namespace: n8n
spec:
  replicas: 1
  selector:
    matchLabels:
      app: n8n-main
  template:
    metadata:
      labels:
        app: n8n-main
    spec:
      containers:
      - name: n8n-main
        image: docker.n8n.io/n8nio/n8n:latest
        ports:
        - containerPort: 5678
        envFrom:
        - configMapRef:
            name: n8n-config
        - secretRef:
            name: n8n-secret
        volumeMounts:
        - name: n8n-data
          mountPath: /home/node/.n8n
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 5678
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /healthz
            port: 5678
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: n8n-data
        persistentVolumeClaim:
          claimName: n8n-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: n8n-data
  namespace: n8n
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: postgres-ssd
  resources:
    requests:
      storage: 10Gi

Workers are what actually scale. Notice the command: ["n8n", "worker"]—that’s what makes this a worker pod instead of another main process. Easy thing to miss:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: n8n-worker
  namespace: n8n
spec:
  replicas: 2
  selector:
    matchLabels:
      app: n8n-worker
  template:
    metadata:
      labels:
        app: n8n-worker
    spec:
      containers:
      - name: n8n-worker
        image: docker.n8n.io/n8nio/n8n:latest
        command: ["n8n", "worker"]
        envFrom:
        - configMapRef:
            name: n8n-config
        - secretRef:
            name: n8n-secret
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          exec:
            command: ["/bin/sh", "-c", "ps aux | grep '[n]8n worker' || exit 1"]
          initialDelaySeconds: 30
          periodSeconds: 30
        readinessProbe:
          exec:
            command: ["/bin/sh", "-c", "ps aux | grep '[n]8n worker' || exit 1"]
          initialDelaySeconds: 10
          periodSeconds: 10

Service to front the main process:

apiVersion: v1
kind: Service
metadata:
  name: n8n-service
  namespace: n8n
spec:
  selector:
    app: n8n-main
  ports:
  - port: 5678
    targetPort: 5678
  type: ClusterIP

SSL and Ingress

Teams underestimate how much time this step takes. The configuration looks simple. Getting cert-manager to actually issue a certificate, with DNS propagated and the ACME challenge resolving correctly—that’s where hours disappear.

Install cert-manager:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
kubectl get pods --namespace cert-manager

Wait until all cert-manager pods are Running before proceeding. Don’t skip this.

ClusterIssuer for Let’s Encrypt:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: gce

Before creating the Ingress, get the external IP from your cluster and create an A record pointing your domain to it. The certificate won’t issue until DNS resolves. This takes a few minutes (sometimes longer depending on your DNS provider’s TTL).

Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: n8n-ingress
  namespace: n8n
  annotations:
    kubernetes.io/ingress.class: "gce"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    kubernetes.io/ingress.allow-http: "false"
spec:
  tls:
  - hosts:
    - n8n.yourdomain.com
    secretName: n8n-tls-secret
  rules:
  - host: n8n.yourdomain.com
    http:
      paths:
      - path: /*
        pathType: ImplementationSpecific
        backend:
          service:
            name: n8n-service
            port:
              number: 5678

If the certificate stays in Pending, check cert-manager logs:

kubectl describe certificate n8n-tls-secret -n n8n
kubectl logs -n cert-manager deployment/cert-manager

Usually it’s DNS not propagated yet, or the HTTP challenge endpoint isn’t reachable (check your ingress is actually handling port 80 for the ACME challenge).

Auto-Scaling Workers

This is the part that makes the whole architecture worth it. Workers scale based on CPU. The HPA watches utilization and adjusts replica count:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: n8n-worker-hpa
  namespace: n8n
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: n8n-worker
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

The stabilizationWindowSeconds on scale-down matters. Without a cooldown, HPA will scale workers down too aggressively between workflow bursts and you’ll constantly be spinning them back up. 5 minutes is a reasonable starting point.

Verify metrics-server is running—HPA doesn’t work without it:

kubectl get deployment metrics-server -n kube-system

Backups

A CronJob that runs pg_dump nightly is the minimum. Don’t skip this because “Postgres is on persistent storage.” The persistent volume survives pod restarts; it does not protect you from accidentally deleting a workflow, bad migrations, or a corrupted volume.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: n8n
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: postgres-backup
            image: postgres:15-alpine
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-secret
                  key: POSTGRES_PASSWORD
            command:
            - /bin/sh
            - -c
            - |
              BACKUP_FILE="/backup/n8n-backup-$(date +%Y%m%d-%H%M%S).sql"
              pg_dump -h postgres-service -U n8n_user -d n8n > $BACKUP_FILE
              gzip $BACKUP_FILE
              find /backup -name "*.sql.gz" -mtime +7 -delete
            volumeMounts:
            - name: backup-storage
              mountPath: /backup
          restartPolicy: OnFailure
          volumes:
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup-storage
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: backup-storage
  namespace: n8n
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: postgres-ssd
  resources:
    requests:
      storage: 50Gi

Things That Actually Break

n8n silently uses SQLite. Check the logs after deployment. If you see anything about SQLite, your DB env vars aren’t reaching the container. Common cause: secret encoding wrong, or the secret name in secretRef doesn’t match what you created.

kubectl exec -n n8n deployment/n8n-main -- env | grep DB_

Workers running but not picking up jobs. Usually Redis connectivity. Test from inside the cluster:

kubectl exec -n n8n deployment/redis -- redis-cli ping

If that fails, your Redis service name or port in the n8n ConfigMap is wrong.

Certificate stuck in Pending. Either DNS hasn’t propagated or the Ingress isn’t handling HTTP traffic correctly (ACME needs port 80 to work). The kubernetes.io/ingress.allow-http: "false" annotation can interfere with this depending on GKE version—you may need to let HTTP through temporarily, get the cert issued, then lock it down.

HPA not scaling. Metrics-server not running, or your worker pods don’t have resource requests set. HPA calculates utilization as a percentage of the requested amount. If requests aren’t set, the calculation is undefined and HPA won’t scale.

Updating n8n

Rolling updates work well here. Pull the new image, apply it, watch the rollout:

kubectl set image deployment/n8n-main n8n-main=docker.n8n.io/n8nio/n8n:latest -n n8n
kubectl set image deployment/n8n-worker n8n-worker=docker.n8n.io/n8nio/n8n:latest -n n8n
kubectl rollout status deployment/n8n-main -n n8n

If something breaks, rollback is one command:

kubectl rollout undo deployment/n8n-main -n n8n

This is one of the actual day-to-day benefits of running on Kubernetes. Updates that would require downtime on a VM are non-events here.

Cost

A regional cluster with auto-scaling configured this way typically runs $80–150/month at low workflow volume, depending on how often workers scale up. Worker nodes on spot instances (preemptible in GCP terminology) cut compute costs significantly—workers are stateless and handle interruptions fine.

gcloud container node-pools create spot-workers \
  --cluster n8n-production-cluster \
  --region us-central1 \
  --machine-type e2-standard-2 \
  --spot \
  --num-nodes 0 \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 5 \
  --node-taints spot=true:NoSchedule

Add toleration to the worker deployment so pods actually schedule on spot nodes:

tolerations:
- key: spot
  operator: Equal
  value: "true"
  effect: NoSchedule
nodeSelector:
  cloud.google.com/gke-spot: "true"

Monitor actual usage before adjusting resource requests. Most initial deployments over-provision by 2x.

kubectl top pods -n n8n
kubectl top nodes

That’s the full setup. It’s more moving parts than a single VM, but each part does one thing, which makes debugging much more tractable. When something goes wrong—and it will—you can isolate which component is failing and fix it without touching everything else.

FAQs

What are the main benefits of deploying n8n on Google Kubernetes Engine (GKE) instead of Compute Engine or a standalone VM?

Deploying n8n on GKE offers superior scalability, auto-healing, and resource optimization. GKE enables horizontal scaling of n8n worker pods to handle fluctuating workloads and provides built-in features such as automated failover, managed SSL, and infrastructure-as-code deployment. This results in greater reliability, easier maintenance, and often lower operational costs compared to standalone VM-based deployments.

How do I enable HTTPS and secure my n8n deployment with SSL certificates?

You can configure HTTPS on GKE by using Kubernetes Ingress resources along with cert-manager (for Let’s Encrypt certificates) or Google-managed SSL certificates. This setup allows you to automatically provision, renew, and manage SSL certificates for your domain, ensuring all web traffic to your n8n instance is encrypted.

What is the best way to ensure high availability and disaster recovery for my n8n instance on GKE?

For high availability:
1. Deploy your GKE cluster across multiple zones (regional clusters).
2. Use PersistentVolumes with SSD-backed storage for PostgreSQL.
3. Enable Horizontal Pod Autoscaling and cluster autoscaling.
For disaster recovery:
1. Schedule regular backups of your PostgreSQL database using Kubernetes CronJobs.
2. Backup your Kubernetes manifests (secrets, ConfigMaps, deployments) so you can redeploy quickly if necessary.

How can I optimize costs when running n8n on GKE for production workloads?

To manage costs effectively:
1. Enable node and pod autoscaling, so resources only scale during peak times.
2. Use Google Cloud’s spot/preemptible VM node pools for n8n workers, since they are stateless and interruptions won’t impact task durability.
3. Regularly monitor resource requests/limits and adjust based on actual usage.
4. Use standard storage for backups and SSD storage for databases.

Can I update n8n to the latest version without downtime, and how do I handle updates on GKE?

Yes. With Kubernetes, you can perform rolling updates:
1. Update the Docker image tag in your n8n deployment manifests to the desired version.
2. Apply the changes using kubectl apply to trigger a rolling update, ensuring new pods come up before the old ones terminate.
3. Monitor the rollout to ensure no errors occur.
4. This method enables zero-downtime upgrades and easy rollbacks if issues are detected.

Snehasish Konger
Developed @scientyficworld.org | Technical writer @Nected | Content Developer
Connect with Snehasish Konger

On This page

Take a Pause with Intervals

A Sunday letter on building, writing, and thinking deeper as a developer — short, honest, and worth your time.

Snehasish Konger profile photo

"Hey there — I'm Snehasish. Hope this post saved you some head-scratching time! I've spent years turning technical chaos into clarity, and I'm here to be your guide through the maze of modern tech. Stick around for more lightbulb moments — we're just getting started."

Related Posts