January logbook

2023-01-31

Investigation of Skipper 5.3.0 incident Friday in webpresence staging
Review with Khush, run skipper predicate rollout in Alpha
write post mortem for CRD / Skipper incident
Major review of Cortex Ops guide.

2023-01-30

Review blog & helm charts with Khush, rollout in Alpha
Planning

2023-01-27

review predicate stuff with Khushnood
mostly taken up with P1 incident and post-mortem

2023-01-26

spike for Cortex Operations?
- setup hugo (theme + action)
- integrate existing content
- address hangover from ceip-3113
ceip-2855: release page
1-2-1
- strategy for building ELS industry profile
- Lead role advertised reporting to Irfan

2023-01-25

create ingress working group tickets

make Kong super-admin

curl -H "Kong-Admin-Token: XXXXXXXXXXXXX" -X POST https://sandbox.kong-nonprod.cortex.elsevier.systems/_api/default/admins/<YOUR_USER_EMAIL>/roles --data 'roles=super-admin'

ceip-3113 onboarding ticket
- talk to jonathan about 404 page

2023-01-24

CEIP-2855: Document Cortex release classification and procedure
CEIP-3071: Skipper/Ingress documentation adjust correct subnets

2023-01-23 - vacation

2023-01-19

1-2-1:
- okrs:
  - quality of cortex
  - recruiting
    - build & ops
  - new regions
  - training
Raven: review assessment
Review cortex-warnings
- Ashish: (Havent seen this in bigger cluster, CWS is always low in spec on alpha and probably reason why there are more restarts.)[https://tioengineering.slack.com/archives/C048YN9PMR9/p1674115887399789]
```
nri-bundle-nrk8s-kubelet-9jpxd     ●     1/2               46 Crashback
nri-bundle-nrk8s-kubelet-x79b7     ●     2/2               15 Running
```
- Ashish: (All cpu 90+ request, and 150+ limits. (Overprovisioned))[https://tioengineering.slack.com/archives/C048YN9PMR9/p1674132909976159]
- Karpenter workshop, London, 2 Feb

2023-01-18

Raven Q&A
- prod: 7 on demand, max 3, desired: 1, min 0
- updates to runway tracker, jira and assessment
- pair with Giani on Graal
HM Graph
- looking for less EKS management (not upgrading EKS versions for example)
- happy with managing nodes and instance types
- using ‘std’ AWS ingress controller, prefer to retain as know how it works
- search uses preferred solr chart (want to switch to solr operator)
- search tightly coupled with graph and been suggested they use Kong for this intra-VPC comms
- will be in same account at some stage
- knowledge miner could be first, integrate with poc search spanning two teams
- solr uses zookeeper for persistence, one impl uses EFS
- think solr has own PVC (stack dfs solr state)
Rota planning
- TODO: codify what are the rules for Incident Response, including

2023-01-16

planning
- tickets for each partner to report ingress usage.
- release new skipper helm chart (alpha, beta & prod)
- CEIP-3051: ticket to make skipper default ingressclass
CEIP-2909
- where is timeout on load balancer, verify and document (relies on AWS ALB default, could expose via skipper chart)
- diagram on how the ingress controller works
- raised question of whether to rename ingress controller to ’exposing apps internally / externally’?

2023-01-13

H-Graph
- Paul Piombino and David Childs requested new assessment for proposed new cluster
- Matthew Morgis: no TPR3 since 2019, very old,
- Paul: Knowledge cluster = H-Graph+Search
CEIP-2909
- examine tio-platform-nonprod account
  - 3 clusters: cortex-platform-manager-non-prod, sandbox-cluster, test-cluster
  - 4 load balancers:
    - 1 classic: ArgoCD in sandbox
    - 3 ALBs:
      - kube-ing-LB-OK2BKFQSL9N3: cortex-build-team-sandbox-cluster-alpha
      - kube-ing-LB-1V9ZUQS1CZLDZ: cortex-platform-manager-non-prod
      - kube-ing-LB-1TAF6A408ZUPV: core-engineering-test-cluster-alpha In each case all apps on a given cluster are routing internal traffic through the same load balancer. Further, we can see that setting any particular ingress to external (no load balancer type annotation or set to internet-facing) a new LB is created to route this traffic.
- Bug: no ingressclass defined (should list Skipper and mark it as default) Ref

2023-01-12

CEIP-2909
- kong-nonprod
  - has 2 ALBs
Raven
- Alex, junior TIO
  - Raven: self serve notification
  - on EKS but old
  - minimise op burden
  - what kind email or more?
- Giana, Tech Lead
  - old
  - java 8
  - bad terraform,
  - want k8s
  - mix of terraform, kustomize and helm
  - example:
    - last year started migrate nginx ingress
    - VPC management not easy / well known
  - clients provide IP range to need to connect to raven
  - want java17, reliant on slade (intercept http and validate)
  - first ELS account (shared services?)
  - dual running,
  - fulfillment (separate acct)
- Kim, product inc. support
  - sponsor
- Ahmed, soft eng, new dad
- Mateus, soft eng
- Danyna, Prometheus, junior soft eng
- Terry, arch, MUST do TPR
  - some transition alongside replatform
    - newer java
    - have arch diag.
    - determine phasing
    - focus on no feature pay down tech debt
  - TPR 1 on Monday?
  - ‘Raven foundations 2023’
- Felipe
  - do you really need to tell partners everything
- Thomas
CEIP-2909
- Skipper ‘official’ Helm chart (NOTE: changes by Marcus Noble)
- Where is load balancer timeout specified?
- K8s does not include controller for ingress as it does for deployment and services, have to bring own.
- architecture example (single replica CWS app):
  - ingress (ELB): apollo-airflow-reporting, class=skipper, rules route host/path to backend (pod) app-dev-services.apollo-np.elsevier.com /airflow/ -> apollo-airflow-reporting:80 (100.67.139.112:8080)
  - service: apollo-airflow-reporting, type=ClusterIP, cluster-ip=172.20.113.70, external-ip=none
  - pod: apollo-airflow-reporting-6ddc8fb646-kz4vq, ip=100.67.139.112, address=ip-10-183-19-222.eu-west-1.compute.internal
- kong-nonprod example:
  - ingress: statuscode-tester.kong-nonprod.cortex.elsevier.systems / statuscode-tester:web (100.67.105.81:8080) internal-kube-ing-lb-wyt9pxgu2u4x-1108020136.elsevier.systems kubernetes.io/ingress.class: skipper
  - service: statuscode-tester, type=ClusterIP, ip=172.20.76.52, external-ip=none
  - pod: statuscode-tester-84f687f579-4qfms, ip=100.67.105.81, node=ip-10-183-33-13.eu-west-1.

2023-01-11

CEIP-2909: Skipper (purely bug around 62 secs)
- Helm chart for Skipper to specify timeout = 62
CEIP-3018

2023-01-10

CEIP-2967: cross account secret

reapply the terraform role
- go to
- run manual action on argocd-spike branch
manifests/alpha/app/ contains
- cluster-tests: broken
- csi-secrets: working
- new-relic: WIP for cross account secret reading

try kustomize approach documented by MC

issues thread

$ kustomize build documentation/gitops/manifests/alpha/overlay/sandbox/
Error: accumulating resources: accumulation err='accumulating resources from '../../base': '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base' must resolve to a file': recursed accumulation of path '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base': no matches for Id HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system; failed to find unique target for patch HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system
# and run from the sandox dir…
$ cd documentation/gitops/manifests/alpha/overlay/sandbox/
$ kustomize build .
Error: accumulating resources: accumulation err='accumulating resources from '../../base': '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base' must resolve to a file': recursed accumulation of path '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base': no matches for Id HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system; failed to find unique target for patch HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system
# what version of kustomize are you on?
$ kustomize version 
{Version:kustomize/v4.5.7 GitCommit:56d82a8378dfc8dc3b3b1085e5a6e67b82966bd7 BuildDate:2022-08-02T16:28:01Z GoOs:darwin GoArch:amd64}

return to NewRelic application set approach
- issue with helm chart 3.1:
```
$ kubectl apply -k manifests/alpha/app/newrelic/
applicationset.argoproj.io/appset-newrelic created
```
  within ArgoCD UI: rpc error: code = Unknown desc = Manifest generation error (cached): rpc error: code = FailedPrecondition desc = Failed to unmarshal "clusterrole.yaml": <nil>
- that is the chart version we currently using, hmm
- manual install direct to the sandbox cluster fails due to missing secret (fair enough)
```
helm install nr-3.1 ~/Downloads/newrelic-3.1.0$2$.tgz
Error: INSTALLATION FAILED: execution error at (newrelic/templates/newrelic-prometheus/deployment.yaml:39:22): A license key is required
```
return to kustomize, fix is here

potential blogs: just CLI or picocli or format or JReleaser

2023-01-09

ArgoCD demo (discussion and delivery)
retro
planning
- Skipper meetings blocked for now
- follow up about ISDP and GHA after Weds
- CEIP-2909: Skipper (purely bug around 62 secs)
- refine EFS and Kyverno tickets
REInvent wash up
- too big
- get more out of each return visit
- ’learning conference’ not networking
- Karin H arguing for sustainability pillar, Steve S more prosaic: efficient apps = less energy = better
- Karpenter
cleaned up and archived GitHub gateway spike

2022-01-05

CEIP-2972: document Skipper secure configuration
CEIP-2974: complete FAQs update
1-2-1
- Kong envelope: Mark Williamson to take on so we step back.
  - convert to spike and park as done
  - exclude log aggregation parts
- OKRs:
  - need to be more measurable this year
  - workshops / Cortex U
  - K8s CKA? (started Udemy course but too basic so far)
Investigate alert and encourage KA to add to runbook

2022-02-04

Ingress mtg
Reviews with Felipe: ingress, documentation and support epics, Jenkins
Ingress comms (email by committee)
discussion of Skipper with Khushnood
CEIP-2974:
- restructure doc tree
- remove duplicate material
  - from FAQs:
    - remove ‘What is the Cortex Infrastructure Platform?’ (already in ‘What is Cortex?’)
    - remove ‘What is Platform Manager (PlatMan)? Why is it important?’ (already in ‘What is Cortex?’)

2022-01-03

Planning
- Skipper replacement: IPI (part of A&G) requires cortex direction before deciding to come off skipper (or not) that is likely to do the same thing as core-engineering
  - issues with Skipper due to ‘host port’ usage (whatever that is)
    - Garrett’s view: can we extend timeout, can we configure ???
  - working assumption that no partner can be using the skipper as currently would be removing necessary annotation => safe to make the change
  - apply annotation to all Cortex ingress
  - skipper level change so no outage
  - IDEA: Kyverno to enforce (prevent ingress creation without applicable annotation)
- Fully managed service: IP talks of direction of travel being fully managed but not yet in a position to do it. ‘Fargate of ELS’
  - examples: common CI/CD initiative, remove partner access to kube-system etc.
CEIP-2970
- rework and fight with the partially documented Jenkins pipeline DSL