2022-10-31

CWS call

  • Biz Svcs
    • cannot use PlatMan 0.50 because cannot reduce spot to 0 and multiple issues occurred with 1
  • Rob W
    • supposition that EIP will have same issue as CWS with PlatMan 1.0
    • recommend EIP run recycle of worker nodes to test own experience
  • Tim M
    • PDBs definitely work but node may be dropped if everything is not done in 2-3 mins
  • Felipe
    • Need improved communications from Cortex team, more nuance needed.
    • Some things are more aggressive in PlatMan and we are taking steps there
    • Similarly, some best practices being documented (in Tim M’s blog)
      • run through reason behind each recommendation
    • When do you propose to migrate? Sounds like you’re currently saying never.
  • Rob W
    • Check if we have maintenance window, don’t think so.
    • ‘Believe we have most of that in place’
  • Neil
    • ‘Believe we have most of that in place’
    • Need to test
  • James C
    • Don’t think anyone has said never.
  • Rob W
    • Will be months to be able to get any application level code change
    • Can we make Cortex be less aggressive?
  • Irfan
    • Sort AMIs first then what?
  • Rob W
    • Wants to do PDB configuration (minAvaiable) + priority class + PDB test
    • Then maybe AMIs
    • The perhaps can propose to business 30mins downtime for AMI reconcile temporarily
    • Then Java changes
  • Irfan
    • Opportunity to actually test the AMI change (don’t think I have seen what this is or at least it’s not clarified yet - Tim’s sleep idea?)
  • Neil: How do we know if EIP has the same issue?
    • Irfan, simply schedule a recycle, with CortexOps involved

2022-10-27,28 Vacation

2022-10-20

2022-10-19

  • apollo
    • now able to spring-boot:run locally under Java 17
    • no embedded LDAP so cannot login

2022-10-18

  • TIO town hall

    • metrics
      • eNPS = 43% - great!
      • motivation = 88%
      • satisfaction = 78%
    • sentiment
      • resourcing and hiring stands out as needing attention
    • reason for eNPS (text analysis)
      • culture, work-life balance, pay
    • lifecycle management (patching)
  • apollo

    • tests now passing with Java 17
    • conditionalise beans so runs

2022-10-17

Planning

  • SRE training (at least one module)
  • Write one capability (at least)
    • a) show me b) how to configure
    • there should be HOWTO for installing NewRelic module (collect own metrics)

Retro

  • IDEA: ’envelope’ as known environment where can execute reliably. Paired with client side app (instead of runbook.sh perhaps)

Cortex demo mtg

  • Matteo & Luis: end to end testing, but bigger than that!

    • every commit to main will get deployed to alpha PlatMan

    • big impact on reconciler queue because no scaling

    • no quality gates

    • future

      • add dev as ’nightly’ that would only impact cortex dev clusters
      • e2e tests gate from dev to alpha + another gate from alpha to beta
    • Felipe highlights quite some vagueness yet.

    • my concern would be repeatability since reconcile implies a certain initial state that is not ‘clean’

  • Ashish

    • runbook.sh script aimed at automating some of the diagnostics
      • current limitation on having to be already authenticated, looking to integrate
      • also note have to pipe through regex to get desired output
    • target-cluster-status-check.sh
      • a bunch of checks to see if good condition or not

2022-10-13

2022-10-12

Product management session from IP

  • Parent deck to mix and match but limit different messages
  • Rebase: Cortex is just the Infrastructure platform. SRE, dev portal and CI/CD are separated
  • Infrastructure Platform = Cloud infrastructure (AWS) + PlatMan
  • Why? Dedicated team to manage infra will reduce effort (and cost) for product teams
    • eg could plug in something like container scanning to the platform and product teams get it for free
    • for example investment made but still not able to quickly address something like log4j vulnerability
    • TPR infra questions automatically accepted if ‘on cortex’
  • Buy vs build:
    • ELS has unique problems, at least we’re following one set of best practices instead of one per team
    • ‘just’ EKS
  • Team benefits
    • Should be easy to build POCs with Cortex

SSDR incident

  • Sketchy documentation (cow-path analogy)
  • Black box behaviour with no explanation
  • No validation of input
  • Meaningless PR process: separation of responsibilities but no understanding
  • Oodles of tacit knowledge, only minimally aware that is even a problem
  • ‘product’ without product mindset
  • build in-house preferred resulting in ever more burden to maintain. No ’eroding platform’

2022-10-11

Migrate to PlatMan 1.0 from 0.50

https://github.com/elsevier-centraltechnology/cortex-operations/blob/f2aed6b7650fea4edb665b325736e5bfd7dfdd21/admin-manual/migrate-to-cortex-platform-manager.md

  • connect to aws-tio-platform-prod

  • if don’t have admin access (that is connecting to a partner cluster) need to assume a role

after merge check NR log_dispatcher first, then log_platform_manger

  • in AWS cluster will see no managed node groups initially then both ASGs and MNGs

  • then will migrate workloads (manually) and finally need to remove ASGs

  • cortex action runner configuration resulted in helm chart failed

    • secret controller-manager already exists
    • trigger reconcile manually
      • currently requires manual purge of queue

2022-10-10