2023-06-30
- dev duty investigating drifts that turned out to be broken bootstrap
- further sbom drift cleanup
- began investigation of calico beta deployment only to find 3.26.1 chart is missing
2023-06-29
- delete eu-west-1 again - success, need to confirm CIDR block released (manual)
- cortex engineering form - delete a cluster
- (A) need to update procedure
- platform defs has dependency graph, largely for helm ordering
- deletion simply reverses this
- because use terraform outputs to pass data to downstream reconcilers must do platform update first before delete (that’s why cluster has to be in ‘good’ state before starting
- same risks of being blocked by PDB eviction as when updating
- does not actively remove Helm, just trusting that it goes when nodes go. Skipper cloud foundation / load balancer is excpetion to this.
- ephemeral clusters will get new CIDR range, cannot reuse because of CIDR allocation but also because Cortex assumes to create VPC
- argo will simply complain unable to connect to cluster (will have to enhance)
2023-06-28
- cluster role & role binding
- resolve reconciler failure on calico-apiserver in ap-southeast-1-test
2023-06-27
- send alpha drift
- cluster role & role binding
2023-06-26
delete
eu-west-1drift to slack inc. tests
NR training, getting ready for AIOps
resync alpha platform.yaml to apiserver enabled: true (although all alphas have apiserver with it False)
attempt beta sync of calico to 3.21.1 and apiserver enabled
2023-06-21
- drift detection fixes (goal of running each morning and writing to slack)
- write up roadmap item for post-Skipper in light of go-oidc response
- purl type for helm
2023-06-20
- calico
- upgrade to 3.26.1
- start writing unit tests for crtxctl to catch issues with sbom / drift
2023-06-19
- calico
- changed understanding of replace behaviour: only create-delete if –force
- validate instructions on calico site with minor tweaks
2023-06-14
- calico
- check if CWS using calico
- https://elsevier.atlassian.net/wiki/spaces/SRE/pages/119600974265468/DRAFT+-+Technical+Resilience+-+Infrastructure+Platform+Instance
- BTS were told must use network policies by ISDP / IA / DBS
- poke on this requirement
- Irfan expects we need to support global deny all
- could deliver by kyverno policy that cannot create namespace without accompanying policy
2023-06-12
- AWS training survey
- Consider reading / awareness of AWS well architected programme for dev plan
- Calico rollout
- core-engineering test cluster failure
- Delete
cortex-build-team-eu-west-1-alpha - issues using: core-kube-cycle-cli
- planning
- propose PR to go-oidc
- numbers and causes for failed releases in nonprod
- green light on crtxctl
- integration tests following Thomas
pytest
2023-06-05
- Raise support call for patching of EKS addons:
- C3
- Add COPS-P-000 runbook
2023-06-02 - migrate runbooks to ops site
2023-06-01 - office day for release process
MC gave a pretty good summary of current concerns and limitations from last call
queue: not getting benefits but getting complexity
alternatives
- rework reconciler as GHA
- remove async (queue) complexity
- increase visibility with std GHA definition and reporting
- rework reconciler as GHA
discussion
- concern over reliance on GHA: not an issue if use GHA merely as glue between binaries
RPC - Rights and permissions controller
- Nigel inherited from Phil HIbberd (along with CWS and QAS)
- old style EC2
- considerable tech debt
- kill multipple birds with one stone by merging into CWS platform
- Dave Cockram: solution architect: interested in moving PPE (and later PPM) to Cortex
- Abirami Manaharan (Chennai): lead dev on RPC
- Rob
RPC: Tomcat (Java 8, MVC, no Boot), Mongo -> Postgres, Elastic Search -> Postgres Small: Single repo for UI and backend timescales: Q3 & Q4
Multi-tenancy
- alreday have network isolation by
- also resource
- ongoing effort to resolve reconciliation failures
ACTIONS
- TPR1 Nigel
- runway, slack channel, assessment (Tim)