2022-10-31
CWS call
- Biz Svcs
- cannot use PlatMan 0.50 because cannot reduce spot to 0 and multiple issues occurred with 1
- Rob W
- supposition that EIP will have same issue as CWS with PlatMan 1.0
- recommend EIP run recycle of worker nodes to test own experience
- Tim M
- PDBs definitely work but node may be dropped if everything is not done in 2-3 mins
- Felipe
- Need improved communications from Cortex team, more nuance needed.
- Some things are more aggressive in PlatMan and we are taking steps there
- Similarly, some best practices being documented (in Tim M’s blog)
- run through reason behind each recommendation
- When do you propose to migrate? Sounds like you’re currently saying never.
- Rob W
- Check if we have maintenance window, don’t think so.
- ‘Believe we have most of that in place’
- Neil
- ‘Believe we have most of that in place’
- Need to test
- James C
- Don’t think anyone has said never.
- Rob W
- Will be months to be able to get any application level code change
- Can we make Cortex be less aggressive?
- Irfan
- Sort AMIs first then what?
- Rob W
- Wants to do PDB configuration (
minAvaiable) + priority class + PDB test - Then maybe AMIs
- The perhaps can propose to business 30mins downtime for AMI reconcile temporarily
- Then Java changes
- Wants to do PDB configuration (
- Irfan
- Opportunity to actually test the AMI change (don’t think I have seen what this is or at least it’s not clarified yet - Tim’s sleep idea?)
- Neil: How do we know if EIP has the same issue?
- Irfan, simply schedule a recycle, with CortexOps involved
2022-10-27,28 Vacation
2022-10-20
1-2-1
- AWS cert
- work experience
- OpsGenie (no SSO)
- CWS
- No Spring Boot 3 option (Camunda not ready);
- No Spring Native (though 2 years out of date) https://javahippie.net/java/graal-vm/native-image/camunda/2020/05/31/camundanative.html
- OpenJ9
- ‘Liberty InstantOn’: https://www.openliberty.io/blog/2022/09/29/instant-on-beta.html
- Getting started: https://blog.openj9.org/2022/09/26/getting-started-with-openj9-criu-support/
- No official builds: https://www.eclipse.org/openj9/docs/builds/
SonarQube
- has to run as server, postpone for now
2022-10-19
- apollo
- now able to
spring-boot:runlocally under Java 17 - no embedded LDAP so cannot login
- now able to
2022-10-18
TIO town hall
- metrics
- eNPS = 43% - great!
- motivation = 88%
- satisfaction = 78%
- sentiment
- resourcing and hiring stands out as needing attention
- reason for eNPS (text analysis)
- culture, work-life balance, pay
- lifecycle management (patching)
- 48% servers have end of life software
- Tanium reports on obsolete and end of life stuff
- https://insights.bi.tio.systems/#/site/Elsevier/workbooks/11018/views
- problem snuck back in by ???? mechanism
- metrics
apollo
- tests now passing with Java 17
- conditionalise beans so runs
2022-10-17
Planning
- SRE training (at least one module)
- Write one capability (at least)
- a) show me b) how to configure
- there should be HOWTO for installing NewRelic module (collect own metrics)
Retro
- IDEA: ’envelope’ as known environment where can execute reliably. Paired with client side app (instead of runbook.sh perhaps)
Cortex demo mtg
Matteo & Luis: end to end testing, but bigger than that!
every commit to main will get deployed to alpha PlatMan
big impact on reconciler queue because no scaling
no quality gates
future
- add dev as ’nightly’ that would only impact cortex dev clusters
- e2e tests gate from dev to alpha + another gate from alpha to beta
Felipe highlights quite some vagueness yet.
my concern would be repeatability since reconcile implies a certain initial state that is not ‘clean’
Ashish
runbook.shscript aimed at automating some of the diagnostics- current limitation on having to be already authenticated, looking to integrate
- also note have to pipe through regex to get desired output
target-cluster-status-check.sh- a bunch of checks to see if good condition or not
2022-10-13
- DSL for pipelines: https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.helpers.scm.GitContext.branches(amybe not the right one?)
- Accelerate (DORA) report: https://services.google.com/fh/files/misc/2022_state_of_devops_report.pdf
- The Four Key Metrics of software delivery performance: deployment frequency, lead time for changes, change failure rate, and time to restore service.
- In short, loose coupling of software services impacts more than just technical impact. It also affects the socio-technical aspects of software development. Coupling is at the root of Conway’s Law—the idea that an organization’s design systems mirror their own communication structure. More loosely-coupled systems mean more loosely-coupled organizations with a more distributed, scalable, approach to development
2022-10-12
Product management session from IP
- Parent deck to mix and match but limit different messages
- Rebase: Cortex is just the Infrastructure platform. SRE, dev portal and CI/CD are separated
- Infrastructure Platform = Cloud infrastructure (AWS) + PlatMan
- Why? Dedicated team to manage infra will reduce effort (and cost) for product teams
- eg could plug in something like container scanning to the platform and product teams get it for free
- for example investment made but still not able to quickly address something like log4j vulnerability
- TPR infra questions automatically accepted if ‘on cortex’
- Buy vs build:
- ELS has unique problems, at least we’re following one set of best practices instead of one per team
- ‘just’ EKS
- Team benefits
- Should be easy to build POCs with Cortex
SSDR incident
- Sketchy documentation (cow-path analogy)
- Black box behaviour with no explanation
- No validation of input
- Meaningless PR process: separation of responsibilities but no understanding
- Oodles of tacit knowledge, only minimally aware that is even a problem
- ‘product’ without product mindset
- build in-house preferred resulting in ever more burden to maintain. No ’eroding platform’
2022-10-11
Migrate to PlatMan 1.0 from 0.50
connect to
aws-tio-platform-prodif don’t have admin access (that is connecting to a partner cluster) need to assume a role
after merge check NR log_dispatcher first, then log_platform_manger
in AWS cluster will see no managed node groups initially then both ASGs and MNGs
then will migrate workloads (manually) and finally need to remove ASGs
cortex action runner configuration resulted in helm chart failed
- secret controller-manager already exists
- trigger reconcile manually
- currently requires manual purge of queue