2023-01-31
- Investigation of Skipper 5.3.0 incident Friday in webpresence staging
- Review with Khush, run skipper predicate rollout in Alpha
- write post mortem for CRD / Skipper incident
- Major review of Cortex Ops guide.
2023-01-30
- Review blog & helm charts with Khush, rollout in Alpha
- Planning
2023-01-27
- review predicate stuff with Khushnood
- mostly taken up with P1 incident and post-mortem
2023-01-26
- spike for Cortex Operations?
- setup hugo (theme + action)
- integrate existing content
- address hangover from ceip-3113
- ceip-2855: release page
- 1-2-1
- strategy for building ELS industry profile
- Lead role advertised reporting to Irfan
2023-01-25
- create ingress working group tickets
- make Kong super-admin
curl -H "Kong-Admin-Token: XXXXXXXXXXXXX" -X POST https://sandbox.kong-nonprod.cortex.elsevier.systems/_api/default/admins/<YOUR_USER_EMAIL>/roles --data 'roles=super-admin' - ceip-3113 onboarding ticket
- talk to jonathan about 404 page
2023-01-24
- CEIP-2855: Document Cortex release classification and procedure
- CEIP-3071: Skipper/Ingress documentation adjust correct subnets
2023-01-23 - vacation
2023-01-19
- 1-2-1:
- okrs:
- quality of cortex
- recruiting
- build & ops
- new regions
- training
- okrs:
- Raven: review assessment
- Review cortex-warnings
- Ashish: (Havent seen this in bigger cluster, CWS is always low in spec on alpha and probably reason why there are more restarts.)[https://tioengineering.slack.com/archives/C048YN9PMR9/p1674115887399789]
nri-bundle-nrk8s-kubelet-9jpxd ● 1/2 46 Crashback nri-bundle-nrk8s-kubelet-x79b7 ● 2/2 15 Running - Ashish: (All cpu 90+ request, and 150+ limits. (Overprovisioned))[https://tioengineering.slack.com/archives/C048YN9PMR9/p1674132909976159]
- Karpenter workshop, London, 2 Feb
- Ashish: (Havent seen this in bigger cluster, CWS is always low in spec on alpha and probably reason why there are more restarts.)[https://tioengineering.slack.com/archives/C048YN9PMR9/p1674115887399789]
2023-01-18
- Raven Q&A
- prod: 7 on demand, max 3, desired: 1, min 0
- updates to runway tracker, jira and assessment
- pair with Giani on Graal
- HM Graph
- looking for less EKS management (not upgrading EKS versions for example)
- happy with managing nodes and instance types
- using ‘std’ AWS ingress controller, prefer to retain as know how it works
- search uses preferred solr chart (want to switch to solr operator)
- search tightly coupled with graph and been suggested they use Kong for this intra-VPC comms
- will be in same account at some stage
- knowledge miner could be first, integrate with poc search spanning two teams
- solr uses zookeeper for persistence, one impl uses EFS
- think solr has own PVC (stack dfs solr state)
- Rota planning
2023-01-16
planning
- tickets for each partner to report ingress usage.
- release new skipper helm chart (alpha, beta & prod)
- CEIP-3051: ticket to make skipper default
ingressclass
CEIP-2909
- where is timeout on load balancer, verify and document (relies on AWS ALB default, could expose via skipper chart)
- diagram on how the ingress controller works
- raised question of whether to rename ingress controller to ’exposing apps internally / externally’?
2023-01-13
H-Graph
- Paul Piombino and David Childs requested new assessment for proposed new cluster
- Matthew Morgis: no TPR3 since 2019, very old,
- Paul: Knowledge cluster = H-Graph+Search
CEIP-2909
examine
tio-platform-nonprodaccount- 3 clusters: cortex-platform-manager-non-prod, sandbox-cluster, test-cluster
- 4 load balancers:
- 1 classic: ArgoCD in sandbox
- 3 ALBs:
- kube-ing-LB-OK2BKFQSL9N3: cortex-build-team-sandbox-cluster-alpha
- kube-ing-LB-1V9ZUQS1CZLDZ: cortex-platform-manager-non-prod
- kube-ing-LB-1TAF6A408ZUPV: core-engineering-test-cluster-alpha
In each case all apps on a given cluster are routing internal traffic through the same load balancer.
Further, we can see that setting any particular ingress to external (no load balancer type annotation or set to
internet-facing) a new LB is created to route this traffic.
Bug: no
ingressclassdefined (should list Skipper and mark it as default) Ref
2023-01-12
CEIP-2909
- kong-nonprod
Raven
- Alex, junior TIO
- Raven: self serve notification
- on EKS but old
- minimise op burden
- what kind email or more?
- Giana, Tech Lead
- old
- java 8
- bad terraform,
- want k8s
- mix of
terraform,kustomizeandhelm - example:
- last year started migrate nginx ingress
- VPC management not easy / well known
- clients provide IP range to need to connect to raven
- want java17, reliant on slade (intercept http and validate)
- first ELS account (shared services?)
- dual running,
- fulfillment (separate acct)
- Kim, product inc. support
- sponsor
- Ahmed, soft eng, new dad
- Mateus, soft eng
- Danyna, Prometheus, junior soft eng
- Terry, arch, MUST do TPR
- some transition alongside replatform
- newer java
- have arch diag.
- determine phasing
- focus on no feature pay down tech debt
- TPR 1 on Monday?
- ‘Raven foundations 2023’
- some transition alongside replatform
- Felipe
- do you really need to tell partners everything
- Thomas
- Alex, junior TIO
CEIP-2909
- Skipper ‘official’ Helm chart (NOTE: changes by Marcus Noble)
- Where is load balancer timeout specified?
- K8s does not include controller for ingress as it does for deployment and services, have to bring own.
- architecture example (single replica CWS app):
- ingress (ELB):
apollo-airflow-reporting, class=skipper, rules route host/path to backend (pod)app-dev-services.apollo-np.elsevier.com /airflow/ -> apollo-airflow-reporting:80 (100.67.139.112:8080) - service: apollo-airflow-reporting, type=ClusterIP, cluster-ip=172.20.113.70, external-ip=none
- pod:
apollo-airflow-reporting-6ddc8fb646-kz4vq, ip=100.67.139.112, address=ip-10-183-19-222.eu-west-1.compute.internal
- ingress (ELB):
- kong-nonprod example:
- ingress:
statuscode-tester.kong-nonprod.cortex.elsevier.systems / statuscode-tester:web (100.67.105.81:8080) internal-kube-ing-lb-wyt9pxgu2u4x-1108020136.elsevier.systemskubernetes.io/ingress.class: skipper - service:
statuscode-tester, type=ClusterIP, ip=172.20.76.52, external-ip=none - pod:
statuscode-tester-84f687f579-4qfms, ip=100.67.105.81, node=ip-10-183-33-13.eu-west-1.
- ingress:
2023-01-11
- CEIP-2909: Skipper (purely bug around 62 secs)
- Helm chart for Skipper to specify timeout = 62
- CEIP-3018
2023-01-10
- CEIP-2967: cross account secret
reapply the terraform role
- go to
- run manual action on argocd-spike branch
manifests/alpha/app/ contains
- cluster-tests: broken
- csi-secrets: working
- new-relic: WIP for cross account secret reading
try
kustomizeapproach documented by MC$ kustomize build documentation/gitops/manifests/alpha/overlay/sandbox/ Error: accumulating resources: accumulation err='accumulating resources from '../../base': '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base' must resolve to a file': recursed accumulation of path '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base': no matches for Id HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system; failed to find unique target for patch HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system # and run from the sandox dir… $ cd documentation/gitops/manifests/alpha/overlay/sandbox/ $ kustomize build . Error: accumulating resources: accumulation err='accumulating resources from '../../base': '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base' must resolve to a file': recursed accumulation of path '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base': no matches for Id HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system; failed to find unique target for patch HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system # what version of kustomize are you on? $ kustomize version {Version:kustomize/v4.5.7 GitCommit:56d82a8378dfc8dc3b3b1085e5a6e67b82966bd7 BuildDate:2022-08-02T16:28:01Z GoOs:darwin GoArch:amd64}
return to NewRelic application set approach
issue with helm chart 3.1:
$ kubectl apply -k manifests/alpha/app/newrelic/ applicationset.argoproj.io/appset-newrelic createdwithin ArgoCD UI:
rpc error: code = Unknown desc = Manifest generation error (cached): rpc error: code = FailedPrecondition desc = Failed to unmarshal "clusterrole.yaml": <nil>that is the chart version we currently using, hmm
manual install direct to the sandbox cluster fails due to missing secret (fair enough)
helm install nr-3.1 ~/Downloads/newrelic-3.1.0\(2\).tgz Error: INSTALLATION FAILED: execution error at (newrelic/templates/newrelic-prometheus/deployment.yaml:39:22): A license key is required
return to
kustomize, fix is here
- potential blogs:
justCLI orpicoclior format orJReleaser
2023-01-09
- ArgoCD demo (discussion and delivery)
- retro
- planning
- Skipper meetings blocked for now
- follow up about ISDP and GHA after Weds
- CEIP-2909: Skipper (purely bug around 62 secs)
- refine EFS and Kyverno tickets
- REInvent wash up
- too big
- get more out of each return visit
- ’learning conference’ not networking
- Karin H arguing for sustainability pillar, Steve S more prosaic: efficient apps = less energy = better
- Karpenter
- cleaned up and archived GitHub gateway spike
2022-01-05
CEIP-2972: document Skipper secure configuration
CEIP-2974: complete FAQs update
1-2-1
- Kong envelope: Mark Williamson to take on so we step back.
- convert to spike and park as done
- exclude log aggregation parts
- OKRs:
- need to be more measurable this year
- workshops / Cortex U
- K8s CKA? (started Udemy course but too basic so far)
- Kong envelope: Mark Williamson to take on so we step back.
Investigate alert and encourage KA to add to runbook
2022-02-04
- Ingress mtg
- Reviews with Felipe: ingress, documentation and support epics, Jenkins
- Ingress comms (email by committee)
- discussion of Skipper with Khushnood
- CEIP-2974:
- restructure doc tree
- remove duplicate material
- from FAQs:
- remove ‘What is the Cortex Infrastructure Platform?’ (already in ‘What is Cortex?’)
- remove ‘What is Platform Manager (PlatMan)? Why is it important?’ (already in ‘What is Cortex?’)
- from FAQs:
2022-01-03
Planning
- Skipper replacement: IPI (part of A&G) requires cortex direction before deciding to come off skipper (or not) that is likely to do the same thing as core-engineering
- issues with Skipper due to ‘host port’ usage (whatever that is)
- Garrett’s view: can we extend timeout, can we configure ???
- working assumption that no partner can be using the skipper as currently would be removing necessary annotation => safe to make the change
- apply annotation to all Cortex ingress
- skipper level change so no outage
- IDEA: Kyverno to enforce (prevent ingress creation without applicable annotation)
- issues with Skipper due to ‘host port’ usage (whatever that is)
- Fully managed service: IP talks of direction of travel being fully managed but not yet in a position to do it.
‘Fargate of ELS’
- examples: common CI/CD initiative, remove partner access to kube-system etc.
- Skipper replacement: IPI (part of A&G) requires cortex direction before deciding to come off skipper (or not) that is likely to do the same thing as core-engineering
CEIP-2970
- rework and fight with the partially documented Jenkins pipeline DSL