2023-01-31

  • Investigation of Skipper 5.3.0 incident Friday in webpresence staging
  • Review with Khush, run skipper predicate rollout in Alpha
  • write post mortem for CRD / Skipper incident
  • Major review of Cortex Ops guide.

2023-01-30

  • Review blog & helm charts with Khush, rollout in Alpha
  • Planning

2023-01-27

  • review predicate stuff with Khushnood
  • mostly taken up with P1 incident and post-mortem

2023-01-26

  • spike for Cortex Operations?
    • setup hugo (theme + action)
    • integrate existing content
    • address hangover from ceip-3113
  • ceip-2855: release page
  • 1-2-1
    • strategy for building ELS industry profile
    • Lead role advertised reporting to Irfan

2023-01-25

  • create ingress working group tickets
  • make Kong super-admin
    curl -H "Kong-Admin-Token: XXXXXXXXXXXXX" -X POST https://sandbox.kong-nonprod.cortex.elsevier.systems/_api/default/admins/<YOUR_USER_EMAIL>/roles --data 'roles=super-admin'
    
  • ceip-3113 onboarding ticket
    • talk to jonathan about 404 page

2023-01-24

  • CEIP-2855: Document Cortex release classification and procedure
  • CEIP-3071: Skipper/Ingress documentation adjust correct subnets

2023-01-23 - vacation

2023-01-19

  • 1-2-1:
    • okrs:
      • quality of cortex
      • recruiting
        • build & ops
      • new regions
      • training
  • Raven: review assessment
  • Review cortex-warnings
    • Ashish: (Havent seen this in bigger cluster, CWS is always low in spec on alpha and probably reason why there are more restarts.)[https://tioengineering.slack.com/archives/C048YN9PMR9/p1674115887399789]
      nri-bundle-nrk8s-kubelet-9jpxd     ●     1/2               46 Crashback
      nri-bundle-nrk8s-kubelet-x79b7     ●     2/2               15 Running
      
    • Ashish: (All cpu 90+ request, and 150+ limits. (Overprovisioned))[https://tioengineering.slack.com/archives/C048YN9PMR9/p1674132909976159]
    • Karpenter workshop, London, 2 Feb

2023-01-18

  • Raven Q&A
    • prod: 7 on demand, max 3, desired: 1, min 0
    • updates to runway tracker, jira and assessment
    • pair with Giani on Graal
  • HM Graph
    • looking for less EKS management (not upgrading EKS versions for example)
    • happy with managing nodes and instance types
    • using ‘std’ AWS ingress controller, prefer to retain as know how it works
    • search uses preferred solr chart (want to switch to solr operator)
    • search tightly coupled with graph and been suggested they use Kong for this intra-VPC comms
    • will be in same account at some stage
    • knowledge miner could be first, integrate with poc search spanning two teams
    • solr uses zookeeper for persistence, one impl uses EFS
    • think solr has own PVC (stack dfs solr state)
  • Rota planning

2023-01-16

  • planning

    • tickets for each partner to report ingress usage.
    • release new skipper helm chart (alpha, beta & prod)
    • CEIP-3051: ticket to make skipper default ingressclass
  • CEIP-2909

    • where is timeout on load balancer, verify and document (relies on AWS ALB default, could expose via skipper chart)
    • diagram on how the ingress controller works
    • raised question of whether to rename ingress controller to ’exposing apps internally / externally’?

2023-01-13

  • H-Graph

    • Paul Piombino and David Childs requested new assessment for proposed new cluster
    • Matthew Morgis: no TPR3 since 2019, very old,
    • Paul: Knowledge cluster = H-Graph+Search
  • CEIP-2909

    • examine tio-platform-nonprod account

      • 3 clusters: cortex-platform-manager-non-prod, sandbox-cluster, test-cluster
      • 4 load balancers:
        • 1 classic: ArgoCD in sandbox
        • 3 ALBs:
          • kube-ing-LB-OK2BKFQSL9N3: cortex-build-team-sandbox-cluster-alpha
          • kube-ing-LB-1V9ZUQS1CZLDZ: cortex-platform-manager-non-prod
          • kube-ing-LB-1TAF6A408ZUPV: core-engineering-test-cluster-alpha In each case all apps on a given cluster are routing internal traffic through the same load balancer. Further, we can see that setting any particular ingress to external (no load balancer type annotation or set to internet-facing) a new LB is created to route this traffic.
    • Bug: no ingressclass defined (should list Skipper and mark it as default) Ref

2023-01-12

  • CEIP-2909

  • Raven

    • Alex, junior TIO
      • Raven: self serve notification
      • on EKS but old
      • minimise op burden
      • what kind email or more?
    • Giana, Tech Lead
      • old
      • java 8
      • bad terraform,
      • want k8s
      • mix of terraform, kustomize and helm
      • example:
        • last year started migrate nginx ingress
        • VPC management not easy / well known
      • clients provide IP range to need to connect to raven
      • want java17, reliant on slade (intercept http and validate)
      • first ELS account (shared services?)
      • dual running,
      • fulfillment (separate acct)
    • Kim, product inc. support
      • sponsor
    • Ahmed, soft eng, new dad
    • Mateus, soft eng
    • Danyna, Prometheus, junior soft eng
    • Terry, arch, MUST do TPR
      • some transition alongside replatform
        • newer java
        • have arch diag.
        • determine phasing
        • focus on no feature pay down tech debt
      • TPR 1 on Monday?
      • ‘Raven foundations 2023’
    • Felipe
      • do you really need to tell partners everything
    • Thomas
  • CEIP-2909

    • Skipper ‘official’ Helm chart (NOTE: changes by Marcus Noble)
    • Where is load balancer timeout specified?
    • K8s does not include controller for ingress as it does for deployment and services, have to bring own.
    • architecture example (single replica CWS app):
      • ingress (ELB): apollo-airflow-reporting, class=skipper, rules route host/path to backend (pod) app-dev-services.apollo-np.elsevier.com /airflow/ -> apollo-airflow-reporting:80 (100.67.139.112:8080)
      • service: apollo-airflow-reporting, type=ClusterIP, cluster-ip=172.20.113.70, external-ip=none
      • pod: apollo-airflow-reporting-6ddc8fb646-kz4vq, ip=100.67.139.112, address=ip-10-183-19-222.eu-west-1.compute.internal
    • kong-nonprod example:
      • ingress: statuscode-tester.kong-nonprod.cortex.elsevier.systems / statuscode-tester:web (100.67.105.81:8080) internal-kube-ing-lb-wyt9pxgu2u4x-1108020136.elsevier.systems kubernetes.io/ingress.class: skipper
      • service: statuscode-tester, type=ClusterIP, ip=172.20.76.52, external-ip=none
      • pod: statuscode-tester-84f687f579-4qfms, ip=100.67.105.81, node=ip-10-183-33-13.eu-west-1.

2023-01-11

  • CEIP-2909: Skipper (purely bug around 62 secs)
    • Helm chart for Skipper to specify timeout = 62
  • CEIP-3018

2023-01-10

  • CEIP-2967: cross account secret
    • reapply the terraform role

      • go to
      • run manual action on argocd-spike branch
    • manifests/alpha/app/ contains

      • cluster-tests: broken
      • csi-secrets: working
      • new-relic: WIP for cross account secret reading
    • try kustomize approach documented by MC

      • issues thread

        $ kustomize build documentation/gitops/manifests/alpha/overlay/sandbox/
        Error: accumulating resources: accumulation err='accumulating resources from '../../base': '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base' must resolve to a file': recursed accumulation of path '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base': no matches for Id HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system; failed to find unique target for patch HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system
        # and run from the sandox dir…
        $ cd documentation/gitops/manifests/alpha/overlay/sandbox/
        $ kustomize build .
        Error: accumulating resources: accumulation err='accumulating resources from '../../base': '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base' must resolve to a file': recursed accumulation of path '/Users/stephensont/git/cortex-argocd-spike/documentation/gitops/manifests/alpha/base': no matches for Id HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system; failed to find unique target for patch HelmChartInflationGenerator.builtin.[noGrp]/kube-resource-report.kube-system
        # what version of kustomize are you on?
        $ kustomize version 
        {Version:kustomize/v4.5.7 GitCommit:56d82a8378dfc8dc3b3b1085e5a6e67b82966bd7 BuildDate:2022-08-02T16:28:01Z GoOs:darwin GoArch:amd64}
        
    • return to NewRelic application set approach

      • issue with helm chart 3.1:

        $ kubectl apply -k manifests/alpha/app/newrelic/
        applicationset.argoproj.io/appset-newrelic created
        

        within ArgoCD UI: rpc error: code = Unknown desc = Manifest generation error (cached): rpc error: code = FailedPrecondition desc = Failed to unmarshal "clusterrole.yaml": <nil>

      • that is the chart version we currently using, hmm

      • manual install direct to the sandbox cluster fails due to missing secret (fair enough)

        helm install nr-3.1 ~/Downloads/newrelic-3.1.0\(2\).tgz
        Error: INSTALLATION FAILED: execution error at (newrelic/templates/newrelic-prometheus/deployment.yaml:39:22): A license key is required
        
    • return to kustomize, fix is here

  • potential blogs: just CLI or picocli or format or JReleaser

2023-01-09

  • ArgoCD demo (discussion and delivery)
  • retro
  • planning
    • Skipper meetings blocked for now
    • follow up about ISDP and GHA after Weds
    • CEIP-2909: Skipper (purely bug around 62 secs)
    • refine EFS and Kyverno tickets
  • REInvent wash up
    • too big
    • get more out of each return visit
    • ’learning conference’ not networking
    • Karin H arguing for sustainability pillar, Steve S more prosaic: efficient apps = less energy = better
    • Karpenter
  • cleaned up and archived GitHub gateway spike

2022-01-05

  • CEIP-2972: document Skipper secure configuration

  • CEIP-2974: complete FAQs update

  • 1-2-1

    • Kong envelope: Mark Williamson to take on so we step back.
      • convert to spike and park as done
      • exclude log aggregation parts
    • OKRs:
      • need to be more measurable this year
      • workshops / Cortex U
      • K8s CKA? (started Udemy course but too basic so far)
  • Investigate alert and encourage KA to add to runbook

2022-02-04

  • Ingress mtg
  • Reviews with Felipe: ingress, documentation and support epics, Jenkins
  • Ingress comms (email by committee)
  • discussion of Skipper with Khushnood
  • CEIP-2974:
    • restructure doc tree
    • remove duplicate material
      • from FAQs:
        • remove ‘What is the Cortex Infrastructure Platform?’ (already in ‘What is Cortex?’)
        • remove ‘What is Platform Manager (PlatMan)? Why is it important?’ (already in ‘What is Cortex?’)

2022-01-03

  • Planning

    • Skipper replacement: IPI (part of A&G) requires cortex direction before deciding to come off skipper (or not) that is likely to do the same thing as core-engineering
      • issues with Skipper due to ‘host port’ usage (whatever that is)
        • Garrett’s view: can we extend timeout, can we configure ???
      • working assumption that no partner can be using the skipper as currently would be removing necessary annotation => safe to make the change
      • apply annotation to all Cortex ingress
      • skipper level change so no outage
      • IDEA: Kyverno to enforce (prevent ingress creation without applicable annotation)
    • Fully managed service: IP talks of direction of travel being fully managed but not yet in a position to do it. ‘Fargate of ELS’
      • examples: common CI/CD initiative, remove partner access to kube-system etc.
  • CEIP-2970

    • rework and fight with the partially documented Jenkins pipeline DSL