2023-02-28
- long slack conversation about Skipper
- concluded on needed test procedure: https://global-elsevier.slack.com/archives/C04MCB1J0F8/p1677584987026579
- Learned that Skipper uses host networking to bypass k8s networking! Argument is lower latency.
- work on bom-first troubleshoot generation
2023-02-27
planning
- blog for Ingress ocntroller WG update
- meet with Claire / Steve
- NPS w Felipe
AWS and crossplane
HM Core Platform
ingress controller working group
- not enough evidence for a change based on interviews
- multi-tenancy is the issue
- comes down to controller requiring access to native network? (Garett)
2023-02-24
- bom
- explore what may be generated, what needs manual authoring
- explore units and relationships between them
- write proposal
2023-02-23
- dev duty
- bom
- directory example
git clone git@github.com:cowboysysop/charts.git cowboysysop-charts cd cowboysysop-charts/ git checkout 794c87ef211bbd7eb112c4035df2d32c513b4d5e git switch -c vertical-pod-autoscaler-6.0.0 bom generate -n https://cortex.elsevier.systems/vertical-pod-autoscaler-6.0.0 -o vertical-pod-autoscaler-6.0.0.spdx --dirs=charts/vertical-pod-autoscaler bom document outline vertical-pod-autoscaler-6.0.0.spdx- appears to just list files, not understanding of reading deployment to find image
- vertical-pod-autoscaler results in:
- vpa-vertical-pod-autoscaler-admission-controller image: repository: k8s.gcr.io/autoscaling/vpa-admission-controller tag: 0.12.0 pullPolicy: IfNotPresent
- vpa-vertical-pod-autoscaler-recommender image: repository: k8s.gcr.io/autoscaling/vpa-recommender tag: 0.12.0 pullPolicy: IfNotPresent
- vpa-vertical-pod-autoscaler-updater image: repository: k8s.gcr.io/autoscaling/vpa-updater tag: 0.12.0 pullPolicy: IfNotPresent
- image example
# This way causes a stack overflow! bom generate -n https://cortex.elsevier.systems/vertical-pod-autoscaler-6.0.0 \ -o vertical-pod-autoscaler-6.0.0.spdx \ --dirs=charts/vertical-pod-autoscaler \ --image=k8s.gcr.io/autoscaling/vpa-admission-controller:0.12.0 \ --image=k8s.gcr.io/autoscaling/vpa-recommender:0.12.0 \ --image=k8s.gcr.io/autoscaling/vpa-updater:0.12.0 # This one works ok bom generate -n https://cortex.elsevier.systems/vpa-vertical-pod-autoscaler-updater-6.0.0 \ -o vpa-vertical-pod-autoscaler-updater-6.0.0.spdx \ --image=k8s.gcr.io/autoscaling/vpa-updater:0.12.0 \ --format json - so what do we want?
- Thinking about your release
makes it clear that many possible sboms exist so we need to consider the use of the bom.
My user story is: As an Ops Engineer, given that I have access to a cluster and a platform blueprint, then I can know they match (or detect the drift).
MUST:
- Provide a complete, versioned list of packages. SHOULD:
- Understand version of packaging (for example Helm chart) as well as version of software package.
- Clarify the difference between Package Supplier and Package Originator. For example the vertical pod autoscaler originator is the Kubernetes project but the supplier is cowboysysop. COULD:
- Provide checksums but not essential since (implicitly) we rely on the integrity of the CI/CD process to provide this.
- Provide license information, but not necessary since software should all be pre-approved via the IAW process.
- Thinking about your release
makes it clear that many possible sboms exist so we need to consider the use of the bom.
My user story is: As an Ops Engineer, given that I have access to a cluster and a platform blueprint, then I can know they match (or detect the drift).
MUST:
- directory example
2023-02-22
troubleshoot
- focus on understanding the content of kube-system.json and more intelligent exclusions
bom
- https://github.com/kubernetes-sigs/bom
go install sigs.k8s.io/bom/cmd/bom@latest - can generate and document spdx files from directory or more usefully from docker image
- https://github.com/kubernetes-sigs/bom
town hall
- 100% new CI/CD adopts standard
- 25% improvement in DORA on Cortex
2023-02-21
- troubleshoot.sh
- investigate fetching the helm charts for comparison
- summarise inestigation in ticket and seek new direction
- investigate a reference file of cluster state and stratightforward diff.
2023-02-20
- IPI spot replacement incident (no outage):
- I’ve a couple of questions that I noted here: https://elsevier.atlassian.net/wiki/spaces/SRE/pages/119601079531737/Spot+MNG+ASG+recommendations including the suggestion from Thomas of prioritising Cortex components. I also added a couple of links to the current docs: https://github.com/elsevier-centraltechnology/cortex-documentation/pull/83
- Skipper
- getting in the way of network isolation (IPI)
- trend of change
- is it actively developed?
- can we [continue to] fill any gaps in upstream docs?
2023-02-15
- book accommodation for KubeCon
- dev duty
- ctd troubleshoot generation
- RDP
- multi-tenant
- cost allocation by tag
- also use istio for network isolation (not a heavy use)
- ashish recommends skipper for network isolation
- Gabriel Oscares
- ingress controller
- istio but happy to move to skipper mostly, one place (kube-flow) may need it will tackle later phase
- phasing
- q1 assessment
- pilot with one domain in q1
- will be document domain
- no TPR 1 yet but Gabriel will do it
- then assessment
- want to avoid big bang
- target to complete not set
- Adrian, complete nuke and pave every few days
- might retain some basics like config bucket (part of bootstrap)
- ashish: might also retain vpc, cidr, vpc transit gateway???
- Shane Bedggood
- Shantanu
- PR on mins: 0 on demand, min: 1, desired
- multi-tenant
2023-01-14
- chat / support TiMi on boundary effort
and do we have any clue why it didn’t run? because at that time you were using the role boundaries?
Tim Stephenson 5:58 so if we run on main now with a brand new boundary-test-42 we should expect success?
Tim Miller 6:21 PM maybe? it DID work on platform-nonprod so maybe rp-test is f’d
Tim Stephenson 7:54 PM Hmm, we have been fooled by a false hypothesis a few times. Maybe we can try it again tomorrow
- an ingress call with shadow health
2023-02-13
- made some decent progress on troubleshoot preflight generation
- convo with Daniel and Khush on Inspector
2023-02-10
- install and explore troubleshoot.sh
2023-02-09
- CEIP-2672: Jenkins runbook
- quick wins:
- CEIP-3002: broken doc link
- CEIP-2409: requirements for platman q
- review and pick inspector tickets
- alternate NPS?
- what user activity can we get from k8s?
2023-02-08
- complete ceip-2855/3093: release process
- couple of ingress calls
2023-03-07
- merge and revise ami upgrade procedure
- had to go home lunchtime for power, doh!
2023-02-06
- ceip-3076: catch up 3 pages missed in cortex-ops revamp.
- ceip-2855/3093: release process
- merged comments from last week review
- started bpmn
- retro and (long) planning
2023-02-02 & 03
- #FF Charity Majors
- move cortex operations user guide material to backstage
- complete review and publication of revised
cortex-operations
2023-02-01
- Reviewing Cortex Ops guide (completed README)
- Reviewing post mortems
- Reviewing release process