October logbook

2022-10-31

CWS call

Biz Svcs
- cannot use PlatMan 0.50 because cannot reduce spot to 0 and multiple issues occurred with 1
Rob W
- supposition that EIP will have same issue as CWS with PlatMan 1.0
- recommend EIP run recycle of worker nodes to test own experience
Tim M
- PDBs definitely work but node may be dropped if everything is not done in 2-3 mins
Felipe
- Need improved communications from Cortex team, more nuance needed.
- Some things are more aggressive in PlatMan and we are taking steps there
- Similarly, some best practices being documented (in Tim M’s blog)
  - run through reason behind each recommendation
- When do you propose to migrate? Sounds like you’re currently saying never.
Rob W
- Check if we have maintenance window, don’t think so.
- ‘Believe we have most of that in place’
Neil
- ‘Believe we have most of that in place’
- Need to test
James C
- Don’t think anyone has said never.
Rob W
- Will be months to be able to get any application level code change
- Can we make Cortex be less aggressive?
Irfan
- Sort AMIs first then what?
Rob W
- Wants to do PDB configuration (minAvaiable) + priority class + PDB test
- Then maybe AMIs
- The perhaps can propose to business 30mins downtime for AMI reconcile temporarily
- Then Java changes
Irfan
- Opportunity to actually test the AMI change (don’t think I have seen what this is or at least it’s not clarified yet - Tim’s sleep idea?)
Neil: How do we know if EIP has the same issue?
- Irfan, simply schedule a recycle, with CortexOps involved

2022-10-27,28 Vacation

2022-10-20

1-2-1
- AWS cert
- work experience
- OpsGenie (no SSO)
- CWS
  - No Spring Boot 3 option (Camunda not ready);
  - No Spring Native (though 2 years out of date) https://javahippie.net/java/graal-vm/native-image/camunda/2020/05/31/camundanative.html
  - OpenJ9
    - ‘Liberty InstantOn’: https://www.openliberty.io/blog/2022/09/29/instant-on-beta.html
    - Getting started: https://blog.openj9.org/2022/09/26/getting-started-with-openj9-criu-support/
    - No official builds: https://www.eclipse.org/openj9/docs/builds/
SonarQube
- has to run as server, postpone for now

2022-10-19

apollo
- now able to spring-boot:run locally under Java 17
- no embedded LDAP so cannot login

2022-10-18

TIO town hall
- metrics
  - eNPS = 43% - great!
  - motivation = 88%
  - satisfaction = 78%
- sentiment
  - resourcing and hiring stands out as needing attention
- reason for eNPS (text analysis)
  - culture, work-life balance, pay
- lifecycle management (patching)
  - 48% servers have end of life software
  - Tanium reports on obsolete and end of life stuff
    - https://insights.bi.tio.systems/#/site/Elsevier/workbooks/11018/views
    - problem snuck back in by ???? mechanism
apollo
- tests now passing with Java 17
- conditionalise beans so runs

2022-10-17

Planning

SRE training (at least one module)
Write one capability (at least)
- a) show me b) how to configure
- there should be HOWTO for installing NewRelic module (collect own metrics)

Retro

IDEA: ’envelope’ as known environment where can execute reliably. Paired with client side app (instead of runbook.sh perhaps)

Cortex demo mtg

Matteo & Luis: end to end testing, but bigger than that!
- every commit to main will get deployed to alpha PlatMan
- big impact on reconciler queue because no scaling
- no quality gates
- future
  - add dev as ’nightly’ that would only impact cortex dev clusters
  - e2e tests gate from dev to alpha + another gate from alpha to beta
- Felipe highlights quite some vagueness yet.
- my concern would be repeatability since reconcile implies a certain initial state that is not ‘clean’
Ashish
- runbook.sh script aimed at automating some of the diagnostics
  - current limitation on having to be already authenticated, looking to integrate
  - also note have to pipe through regex to get desired output
- target-cluster-status-check.sh
  - a bunch of checks to see if good condition or not

2022-10-13

DSL for pipelines: https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.helpers.scm.GitContext.branches(amybe not the right one?)
Accelerate (DORA) report: https://services.google.com/fh/files/misc/2022_state_of_devops_report.pdf
- The Four Key Metrics of software delivery performance: deployment frequency, lead time for changes, change failure rate, and time to restore service.
- In short, loose coupling of software services impacts more than just technical impact. It also affects the socio-technical aspects of software development. Coupling is at the root of Conway’s Law—the idea that an organization’s design systems mirror their own communication structure. More loosely-coupled systems mean more loosely-coupled organizations with a more distributed, scalable, approach to development

2022-10-12

Product management session from IP

Parent deck to mix and match but limit different messages
Rebase: Cortex is just the Infrastructure platform. SRE, dev portal and CI/CD are separated
Infrastructure Platform = Cloud infrastructure (AWS) + PlatMan
Why? Dedicated team to manage infra will reduce effort (and cost) for product teams
- eg could plug in something like container scanning to the platform and product teams get it for free
- for example investment made but still not able to quickly address something like log4j vulnerability
- TPR infra questions automatically accepted if ‘on cortex’
Buy vs build:
- ELS has unique problems, at least we’re following one set of best practices instead of one per team
- ‘just’ EKS
Team benefits
- Should be easy to build POCs with Cortex

SSDR incident

Sketchy documentation (cow-path analogy)
Black box behaviour with no explanation
No validation of input
Meaningless PR process: separation of responsibilities but no understanding
Oodles of tacit knowledge, only minimally aware that is even a problem
‘product’ without product mindset
build in-house preferred resulting in ever more burden to maintain. No ’eroding platform’

2022-10-11

Migrate to PlatMan 1.0 from 0.50

https://github.com/elsevier-centraltechnology/cortex-operations/blob/f2aed6b7650fea4edb665b325736e5bfd7dfdd21/admin-manual/migrate-to-cortex-platform-manager.md

connect to aws-tio-platform-prod
if don’t have admin access (that is connecting to a partner cluster) need to assume a role

after merge check NR log_dispatcher first, then log_platform_manger

in AWS cluster will see no managed node groups initially then both ASGs and MNGs
then will migrate workloads (manually) and finally need to remove ASGs
cortex action runner configuration resulted in helm chart failed
- secret controller-manager already exists
- trigger reconcile manually
  - currently requires manual purge of queue

2022-10-10

AWS sessions 2022