Most teams reach for n8n because they want workflow automation without paying per-execution. Then they try to self-host it on a single VM and it works—until it doesn’t. Queue backs up, the process dies, nobody knows why, and suddenly someone’s manually re-running 40 workflows at midnight.
I’ve seen this pattern enough times that the fix is predictable: move it to Kubernetes, run workers separately from the main process, and let the cluster handle scaling. GKE is where most teams end up because it integrates cleanly with the rest of their GCP stack. This is how that actually gets done.
Why GKE and Not Just a Bigger VM
The short answer: a bigger VM still fails the same way. The root problem isn’t resources—it’s that n8n’s main process is doing too many things at once. It’s handling the UI, the API, incoming webhooks, and executing workflows. When volume spikes, everything degrades together.
Queue mode fixes this. You separate the main process (UI + scheduling) from workers (actual execution). Workers can scale horizontally. If one crashes, others keep running. The main process stays stable.
GKE handles the orchestration side of this well—auto-scaling, node repair, rolling updates. You do give up simplicity. There’s more to configure, more that can go wrong during setup. That tradeoff is worth it past a certain workflow volume, not worth it for a hobby project.
The Cluster Setup
Standard cluster, not Autopilot. Autopilot restricts some configurations that matter for stateful workloads. Regional deployment across three zones—worth the cost if this is anything close to production.
gcloud container clusters create n8n-production-cluster \
--region us-central1 \
--node-locations us-central1-a,us-central1-b,us-central1-c \
--num-nodes 1 \
--machine-type e2-standard-2 \
--disk-size 50GB \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--enable-autorepair \
--enable-autoupgrade
Cluster creation takes 5–10 minutes. Once it’s up:
gcloud container clusters get-credentials n8n-production-cluster --region us-central1
kubectl get nodes
If your nodes aren’t showing Ready, check your quota. GCP free tier accounts sometimes hit vCPU limits silently and the error message isn’t obvious.
PostgreSQL First
n8n defaults to SQLite if it can’t connect to a real database. This is easy to miss. You deploy everything, it seems to work, and then you notice the logs say it’s using SQLite. Everything you thought was saved in Postgres isn’t.
Set up Postgres before you touch n8n configs.
Create a namespace first:
kubectl create namespace n8n
The storage class and PVC:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: postgres-ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
zones: us-central1-a,us-central1-b,us-central1-c
allowVolumeExpansion: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
namespace: n8n
spec:
accessModes:
- ReadWriteOnce
storageClassName: postgres-ssd
resources:
requests:
storage: 20Gi
Secrets—base64 encoded, not plain text:
echo -n 'your-secure-password' | base64
apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
namespace: n8n
type: Opaque
data:
POSTGRES_USER: bjhuX3VzZXI=
POSTGRES_PASSWORD: <your-base64-password>
POSTGRES_DB: bjhu
The StatefulSet matters here more than a Deployment. Postgres needs stable network identity and persistent storage. Using a Deployment for Postgres is one of those things that seems fine until you have a pod reschedule.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: n8n
spec:
serviceName: postgres-service
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:15-alpine
ports:
- containerPort: 5432
envFrom:
- secretRef:
name: postgres-secret
env:
- name: PGDATA
value: "/var/lib/postgresql/data/pgdata"
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
livenessProbe:
exec:
command: ["pg_isready", "-U", "n8n_user"]
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command: ["pg_isready", "-U", "n8n_user"]
initialDelaySeconds: 5
periodSeconds: 5
volumeClaimTemplates:
- metadata:
name: postgres-storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: postgres-ssd
resources:
requests:
storage: 20Gi
Service to expose it internally:
apiVersion: v1
kind: Service
metadata:
name: postgres-service
namespace: n8n
spec:
selector:
app: postgres
ports:
- port: 5432
targetPort: 5432
type: ClusterIP
Apply it all, then check the pod is actually Running not just Pending:
kubectl apply -f postgres-secret.yaml
kubectl apply -f postgres-storage.yaml
kubectl apply -f postgres-deployment.yaml
kubectl apply -f postgres-service.yaml
kubectl get pods -n n8n
Redis for Queue Mode
This part is straightforward. Redis sits between the main n8n process and the workers—main process puts jobs on the queue, workers pick them up.
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
namespace: n8n
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7-alpine
ports:
- containerPort: 6379
args:
- redis-server
- --appendonly
- "yes"
- --maxmemory
- "512mb"
- --maxmemory-policy
- "allkeys-lru"
volumeMounts:
- name: redis-storage
mountPath: /data
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
tcpSocket:
port: 6379
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: redis-storage
persistentVolumeClaim:
claimName: redis-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: redis-data
namespace: n8n
spec:
accessModes:
- ReadWriteOnce
storageClassName: postgres-ssd
resources:
requests:
storage: 5Gi
Redis service:
apiVersion: v1
kind: Service
metadata:
name: redis-service
namespace: n8n
spec:
selector:
app: redis
ports:
- port: 6379
targetPort: 6379
type: ClusterIP
Deploying n8n — Main Process and Workers Separately
This is where queue mode configuration has to be right or nothing works. The EXECUTIONS_MODE: "queue" env var is what switches n8n into this mode. If it’s missing or wrong, n8n runs in regular mode and the workers do nothing.
ConfigMap holds the non-sensitive config. Secrets hold passwords and encryption keys.
apiVersion: v1
kind: Secret
metadata:
name: n8n-secret
namespace: n8n
type: Opaque
data:
N8N_ENCRYPTION_KEY: <base64-encoded-32-char-key>
DB_POSTGRESDB_PASSWORD: <base64-same-as-postgres-secret>
---
apiVersion: v1
kind: ConfigMap
metadata:
name: n8n-config
namespace: n8n
data:
DB_TYPE: "postgresdb"
DB_POSTGRESDB_HOST: "postgres-service"
DB_POSTGRESDB_PORT: "5432"
DB_POSTGRESDB_DATABASE: "n8n"
DB_POSTGRESDB_USER: "n8n_user"
DB_POSTGRESDB_SCHEMA: "public"
EXECUTIONS_MODE: "queue"
QUEUE_BULL_REDIS_HOST: "redis-service"
QUEUE_BULL_REDIS_PORT: "6379"
QUEUE_BULL_REDIS_DB: "0"
N8N_HOST: "n8n.yourdomain.com"
N8N_PROTOCOL: "https"
N8N_PORT: "5678"
WEBHOOK_URL: "https://n8n.yourdomain.com/"
N8N_SECURE_COOKIE: "true"
N8N_BLOCK_ENV_ACCESS_IN_NODE: "true"
N8N_PAYLOAD_SIZE_MAX: "16777216"
EXECUTIONS_DATA_PRUNE: "true"
EXECUTIONS_DATA_MAX_AGE: "168"
GENERIC_TIMEZONE: "America/New_York"
Main process deployment—always exactly one replica. This is not a thing you scale horizontally:
apiVersion: apps/v1
kind: Deployment
metadata:
name: n8n-main
namespace: n8n
spec:
replicas: 1
selector:
matchLabels:
app: n8n-main
template:
metadata:
labels:
app: n8n-main
spec:
containers:
- name: n8n-main
image: docker.n8n.io/n8nio/n8n:latest
ports:
- containerPort: 5678
envFrom:
- configMapRef:
name: n8n-config
- secretRef:
name: n8n-secret
volumeMounts:
- name: n8n-data
mountPath: /home/node/.n8n
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /healthz
port: 5678
initialDelaySeconds: 60
periodSeconds: 30
readinessProbe:
httpGet:
path: /healthz
port: 5678
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: n8n-data
persistentVolumeClaim:
claimName: n8n-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: n8n-data
namespace: n8n
spec:
accessModes:
- ReadWriteOnce
storageClassName: postgres-ssd
resources:
requests:
storage: 10Gi
Workers are what actually scale. Notice the command: ["n8n", "worker"]—that’s what makes this a worker pod instead of another main process. Easy thing to miss:
apiVersion: apps/v1
kind: Deployment
metadata:
name: n8n-worker
namespace: n8n
spec:
replicas: 2
selector:
matchLabels:
app: n8n-worker
template:
metadata:
labels:
app: n8n-worker
spec:
containers:
- name: n8n-worker
image: docker.n8n.io/n8nio/n8n:latest
command: ["n8n", "worker"]
envFrom:
- configMapRef:
name: n8n-config
- secretRef:
name: n8n-secret
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
livenessProbe:
exec:
command: ["/bin/sh", "-c", "ps aux | grep '[n]8n worker' || exit 1"]
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
exec:
command: ["/bin/sh", "-c", "ps aux | grep '[n]8n worker' || exit 1"]
initialDelaySeconds: 10
periodSeconds: 10
Service to front the main process:
apiVersion: v1
kind: Service
metadata:
name: n8n-service
namespace: n8n
spec:
selector:
app: n8n-main
ports:
- port: 5678
targetPort: 5678
type: ClusterIP
SSL and Ingress
Teams underestimate how much time this step takes. The configuration looks simple. Getting cert-manager to actually issue a certificate, with DNS propagated and the ACME challenge resolving correctly—that’s where hours disappear.
Install cert-manager:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
kubectl get pods --namespace cert-manager
Wait until all cert-manager pods are Running before proceeding. Don’t skip this.
ClusterIssuer for Let’s Encrypt:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: gce
Before creating the Ingress, get the external IP from your cluster and create an A record pointing your domain to it. The certificate won’t issue until DNS resolves. This takes a few minutes (sometimes longer depending on your DNS provider’s TTL).
Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: n8n-ingress
namespace: n8n
annotations:
kubernetes.io/ingress.class: "gce"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
kubernetes.io/ingress.allow-http: "false"
spec:
tls:
- hosts:
- n8n.yourdomain.com
secretName: n8n-tls-secret
rules:
- host: n8n.yourdomain.com
http:
paths:
- path: /*
pathType: ImplementationSpecific
backend:
service:
name: n8n-service
port:
number: 5678
If the certificate stays in Pending, check cert-manager logs:
kubectl describe certificate n8n-tls-secret -n n8n
kubectl logs -n cert-manager deployment/cert-manager
Usually it’s DNS not propagated yet, or the HTTP challenge endpoint isn’t reachable (check your ingress is actually handling port 80 for the ACME challenge).
Auto-Scaling Workers
This is the part that makes the whole architecture worth it. Workers scale based on CPU. The HPA watches utilization and adjusts replica count:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: n8n-worker-hpa
namespace: n8n
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: n8n-worker
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
The stabilizationWindowSeconds on scale-down matters. Without a cooldown, HPA will scale workers down too aggressively between workflow bursts and you’ll constantly be spinning them back up. 5 minutes is a reasonable starting point.
Verify metrics-server is running—HPA doesn’t work without it:
kubectl get deployment metrics-server -n kube-system
Backups
A CronJob that runs pg_dump nightly is the minimum. Don’t skip this because “Postgres is on persistent storage.” The persistent volume survives pod restarts; it does not protect you from accidentally deleting a workflow, bad migrations, or a corrupted volume.
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: n8n
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: postgres-backup
image: postgres:15-alpine
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: POSTGRES_PASSWORD
command:
- /bin/sh
- -c
- |
BACKUP_FILE="/backup/n8n-backup-$(date +%Y%m%d-%H%M%S).sql"
pg_dump -h postgres-service -U n8n_user -d n8n > $BACKUP_FILE
gzip $BACKUP_FILE
find /backup -name "*.sql.gz" -mtime +7 -delete
volumeMounts:
- name: backup-storage
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-storage
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: backup-storage
namespace: n8n
spec:
accessModes:
- ReadWriteOnce
storageClassName: postgres-ssd
resources:
requests:
storage: 50Gi
Things That Actually Break
n8n silently uses SQLite. Check the logs after deployment. If you see anything about SQLite, your DB env vars aren’t reaching the container. Common cause: secret encoding wrong, or the secret name in secretRef doesn’t match what you created.
kubectl exec -n n8n deployment/n8n-main -- env | grep DB_
Workers running but not picking up jobs. Usually Redis connectivity. Test from inside the cluster:
kubectl exec -n n8n deployment/redis -- redis-cli ping
If that fails, your Redis service name or port in the n8n ConfigMap is wrong.
Certificate stuck in Pending. Either DNS hasn’t propagated or the Ingress isn’t handling HTTP traffic correctly (ACME needs port 80 to work). The kubernetes.io/ingress.allow-http: "false" annotation can interfere with this depending on GKE version—you may need to let HTTP through temporarily, get the cert issued, then lock it down.
HPA not scaling. Metrics-server not running, or your worker pods don’t have resource requests set. HPA calculates utilization as a percentage of the requested amount. If requests aren’t set, the calculation is undefined and HPA won’t scale.
Updating n8n
Rolling updates work well here. Pull the new image, apply it, watch the rollout:
kubectl set image deployment/n8n-main n8n-main=docker.n8n.io/n8nio/n8n:latest -n n8n
kubectl set image deployment/n8n-worker n8n-worker=docker.n8n.io/n8nio/n8n:latest -n n8n
kubectl rollout status deployment/n8n-main -n n8n
If something breaks, rollback is one command:
kubectl rollout undo deployment/n8n-main -n n8n
This is one of the actual day-to-day benefits of running on Kubernetes. Updates that would require downtime on a VM are non-events here.
Cost
A regional cluster with auto-scaling configured this way typically runs $80–150/month at low workflow volume, depending on how often workers scale up. Worker nodes on spot instances (preemptible in GCP terminology) cut compute costs significantly—workers are stateless and handle interruptions fine.
gcloud container node-pools create spot-workers \
--cluster n8n-production-cluster \
--region us-central1 \
--machine-type e2-standard-2 \
--spot \
--num-nodes 0 \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 5 \
--node-taints spot=true:NoSchedule
Add toleration to the worker deployment so pods actually schedule on spot nodes:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-spot: "true"
Monitor actual usage before adjusting resource requests. Most initial deployments over-provision by 2x.
kubectl top pods -n n8n
kubectl top nodes
That’s the full setup. It’s more moving parts than a single VM, but each part does one thing, which makes debugging much more tractable. When something goes wrong—and it will—you can isolate which component is failing and fix it without touching everything else.
FAQs
What are the main benefits of deploying n8n on Google Kubernetes Engine (GKE) instead of Compute Engine or a standalone VM?
Deploying n8n on GKE offers superior scalability, auto-healing, and resource optimization. GKE enables horizontal scaling of n8n worker pods to handle fluctuating workloads and provides built-in features such as automated failover, managed SSL, and infrastructure-as-code deployment. This results in greater reliability, easier maintenance, and often lower operational costs compared to standalone VM-based deployments.
How do I enable HTTPS and secure my n8n deployment with SSL certificates?
You can configure HTTPS on GKE by using Kubernetes Ingress resources along with cert-manager (for Let’s Encrypt certificates) or Google-managed SSL certificates. This setup allows you to automatically provision, renew, and manage SSL certificates for your domain, ensuring all web traffic to your n8n instance is encrypted.
What is the best way to ensure high availability and disaster recovery for my n8n instance on GKE?
For high availability:
1. Deploy your GKE cluster across multiple zones (regional clusters).
2. Use PersistentVolumes with SSD-backed storage for PostgreSQL.
3. Enable Horizontal Pod Autoscaling and cluster autoscaling.
For disaster recovery:
1. Schedule regular backups of your PostgreSQL database using Kubernetes CronJobs.
2. Backup your Kubernetes manifests (secrets, ConfigMaps, deployments) so you can redeploy quickly if necessary.
How can I optimize costs when running n8n on GKE for production workloads?
To manage costs effectively:
1. Enable node and pod autoscaling, so resources only scale during peak times.
2. Use Google Cloud’s spot/preemptible VM node pools for n8n workers, since they are stateless and interruptions won’t impact task durability.
3. Regularly monitor resource requests/limits and adjust based on actual usage.
4. Use standard storage for backups and SSD storage for databases.
Can I update n8n to the latest version without downtime, and how do I handle updates on GKE?
Yes. With Kubernetes, you can perform rolling updates:
1. Update the Docker image tag in your n8n deployment manifests to the desired version.
2. Apply the changes using kubectl apply to trigger a rolling update, ensuring new pods come up before the old ones terminate.
3. Monitor the rollout to ensure no errors occur.
4. This method enables zero-downtime upgrades and easy rollbacks if issues are detected.