• Script is called node recycler

    • 1 million workflows in progress
  • Previously EKS

    • Bring up new nodes only when up take old down
    • Every 2-4 months
    • kube-cycle
      • had awareness of node availability
  • Now

    • Zone specific or one node at a time still
    • 3-4 mins (as before)
    • regular deployments into production, no particular issue (ie 3-4 mins)
      • helm deploy from Jenkins pipeline => serially
      • bump replica sets 3-6 actually made it worse
  • 40 nodes in prod

  • ASG does one AZ at a time

  • Issue

    • During full recycle
    • take one node out suddenly have 20 pods to transfer
  • when see a spot failure

    • services recover during ‘reasonable’ amount of time
  • spiral

    • ASG has no notion of workload
    • managed nodes
    • pod disruption budgets: never give us less than 50% of budget
  • liveness / readiness probes

    • (A) wait longer of up
  • log -> info in prod

    • turn off reconciler
    • one start (eg add extra pod)
  • CPU and RAM requests

    • lower limit only? Maybe upper limit too low? Think not, but maybe forcing ASG upscale
    • (A) ASG logs ; require more ‘high’ before trigger
  • is horizontal auto-scaler on? (no)

  • NewRelic

    • JVM GCs?
    • aggressive GC
  • David ???

    • JVM min 2Gb, max 3Gb
  • cct-camunda-orchestrator

    • books
  • apollo-camunda-orchestrator (99%)

    • logbooks
  • also language services and others can sometimes get into spiral