2022-06-01 CWS Spike

Script is called node recycler
- 1 million workflows in progress
Previously EKS
- Bring up new nodes only when up take old down
- Every 2-4 months
- kube-cycle
  - had awareness of node availability
Now
- Zone specific or one node at a time still
- 3-4 mins (as before)
- regular deployments into production, no particular issue (ie 3-4 mins)
  - helm deploy from Jenkins pipeline => serially
  - bump replica sets 3-6 actually made it worse
40 nodes in prod
ASG does one AZ at a time
Issue
- During full recycle
- take one node out suddenly have 20 pods to transfer
when see a spot failure
- services recover during ‘reasonable’ amount of time
spiral
- ASG has no notion of workload
- managed nodes
- pod disruption budgets: never give us less than 50% of budget
liveness / readiness probes
- (A) wait longer of up
log -> info in prod
- turn off reconciler
- one start (eg add extra pod)
CPU and RAM requests
- lower limit only? Maybe upper limit too low? Think not, but maybe forcing ASG upscale
- (A) ASG logs ; require more ‘high’ before trigger
is horizontal auto-scaler on? (no)
NewRelic
- JVM GCs?
- aggressive GC
David ???
- JVM min 2Gb, max 3Gb
cct-camunda-orchestrator
- books
apollo-camunda-orchestrator (99%)
- logbooks
also language services and others can sometimes get into spiral