November logbook

2022-11-25

script to push cluster yaml to PR and thus avoid errors
build dentity clusters
- share AMIs
NewRelic training
End of year OKRs review with Felipe

2022-11-24

go to Oxford office for laptop domain repair
rebuild id-infra-use cluster
PRs for all other identoty clusters
1-2-1 with both Felipe and Irfan
NewRelic training

2022-11-23

build cluster id-infra-use
- and rebuild to correct GitHub actions configuration
build clusters id-dev-use and id-dev-euw, but failed, pick up in the morning.
rework onboarding requirements docs

CEIP-2819: Argo spike

Kyverno app

kubectl config set-context –current –namespace=argo-system argocd app create cortex-kyverno –repo https://github.com/elsevier-centraltechnology/cortex-platform-definitions.git –path platform-definitions/platform-common/alpha/cortex-kyverno –dest-server https://kubernetes.default.svc –dest-namespace argo-system

TODO: repo auth reqd

Camuda spike for release process support?

2022-11-22

post-mortem for HM incident 2022-11-17
- https://elsevier.atlassian.net/browse/CEIP-2810
meetings on releases and argo and C3
update Team Interaction Model with latest diags:
- https://elsevier.atlassian.net/wiki/spaces/SRE/pages/119600964234270/Team+Interaction+Model

2022-11-21

slack catchup
deploy restbucks to k8s
- tag image:
  - docker image tag restbucks:1.0.0.BUILD-SNAPSHOT 781632261136.dkr.ecr.eu-west-1.amazonaws.com/restbucks:1.0.0.20221118
- docker login using aws credentials:
  - aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin 781632261136.dkr.ecr.eu-west-1.amazonaws.com
- push image:
  - docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/restbucks:1.0.0.20221118
- helm uninstall -n cws-test restbucks
- helm install -n cws-test restbucks ./helm/ -f helm/values.yaml
much confusion about what a release is
further helm uninstall kyverno for prod to leave helm in clean state

X core-engineering/prod X emealasp/prod X hm-core-platform/cluster-prod-ap X hm-core-platform/cluster-prod X identity/id-infra X ipi/admin X ipi/prod X ipi/sandbox X kong/prod X ssdrapim/cluster-prod X webpresence/prod

X cws/prod/ X business-services/prod

2022-11-18

pm: uninstall Kyverno

HM incident more diagnostics, long call on next steps
announcement to partners
removal of kyverno in all prod clusters

X X core-engineering/prod X X emealasp/prod X X hm-core-platform/cluster-prod-ap X hm-core-platform/cluster-prod-ap - done yesterday identity/id-infra - not yet installed, cluster still being built X X ipi/admin X X ipi/prod X X ipi/sandbox X X kong/prod X X ssdrapim/cluster-prod X X webpresence/prod

X cws/prod/ X business-services/prod

for i in $(aws route53 list-resource-record-sets --hosted-zone-id $(aws route53 list-hosted-zones-by-name --dns-name='cortex.elsevier.systems' | jq -r '.HostedZones[] | .Id' | sed 's#/hostedzone/##g') --query "ResourceRecordSets[?Type == 'NS']" | jq -r '.[] | .Name'); do curl -I -L dashboard.$i; don

morning

HM incident more diagnostics, long call on next steps
announcement to partners
removal of kyverno in all prod clusters

UPDATE TO PRODUCTION CLUSTERS

As you may be aware we had an incident yesterday that appears to have a number of contributing, or at least coincidental, factors. We continue to investigate both internally and with AWS. In the interim we plan to put the following steps in place to ensure that any repeat will not cause your workloads to be locked down.

Immediately, we will be removing Kyverno from your production clusters. This will ensure no future failures cause it to fail in a closed state, locking services out in the process. THIS MEANS YOU SHOULD NOT MAKE ANY CHANGES TO PRODUCTION CLUSTERS, AS RUNNING THE RECONCILER WILL REPLACE IT. As a result the platform policies will not be being enforced so take extra care when modifying your workloads.
We will modify Kyverno to fail open while we work on ensuring it runs stabily. In the event of such a Kyverno failure your workloads will be unimpeded but you will also be without the policy protections as above. This will rollout through the different platform classes (alpha, beta, prod) progressively starting next week.
We will revert a recent upgrade of the aws-efs-csi-driver to previous version (app version 1.3.8) whilst retaining the grouped rollout strategy as the previous one by one strategy caused problems. This too will roll through the platform classes progressively.

Behind all these steps we are also enhancing monitoring and test suites to close some gaps identified.

I will confirm on this thread once Kyverno is removed from all prod clusters.

2022-11-17

CEIP-2162: EMEALASP prod cluster
Blog idea: Onion architectures make me cry
HM prod incident

2022-11-16

CEIP-2752 refinement
- 3 high level user stories achieved
CEIP-2162: EMEALASP prod cluster
- missed pre-reqs (param store + bootstrap reconcile)
- missed addition of new account (undocumented) to https://github.com/elsevier-centraltechnology/tio-terraformcontrol-ce/blob/master/183742092277/eu-west-1/platform-central/variables.tf
Matteo comment about: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html
I’m going to respectfully disagree with the srcco.de post. Yes, liveness probes are dangerous, but the key point of ‘do not depend on external dependencies (like data stores)’-whilst valid-is already built into the Spring check (see docs) and is precisely why I listed the /health/liveness endpoint rather than merely /health . I think it is incorrect to complain that the Redis datasource health check breaks the ‘external’ rule: If I ask for the health of the Redis datasource I presumably want that external check. More importantly, I think the post is simply out of date in regard to readiness checks. I would have completely agreed in a world without startup probes which the post references as being ‘new’-as they were in 2019. K8S 1.16 came out in Sept 19. What I have written, I believe correctly in 2022, reserves readiness probes for directing traffic away from pods unready after initial startup.

ArgoCD expt

encapsulates an ‘application’ and provides full management by UI and CLI inc. install, monitor and delete
- replace terraform, platman and jenkins in one go!
- TODO can we encapsulate application ‘suites’ such as CWS / BTS? (project?)
manage all partner clusters from one platform cluster (not the only option but the one we’d want I think)
permit cortex platform components to be managed more finely, if desired.
- TODO check: Yes we could still have one application that pointed to all upstream components (managed by helm or otherwise) and sync (reconcile) all clusters at once. Or we could do things in smaller chunks.
can manage specific branch, tag etc of upstream code via Kustomize(https://youtu.be/Q97O6iLJguk?t=769)
can ‘diff’ what is currently in place with what is coming if we sync (reconcile) https://youtu.be/Q97O6iLJguk?t=999

2022-11-15

CEIP-2529
- ArgoCD can manage itself, even manage other clusters
- may be simple case but can install controller into each cluster

2022-11-14

Cortex demo

J: daemon sets rolled out by percentage rather than one by one
LM: tio-helmcharts-core-platform
- gen docs alone instead of go install
- push helm charts to artifactory on merge to main
FG: run thru escalate incident process

Planning

Investigate BTS apps
- some with single replica
- Speed coming up
Feel free to get involved on Platman E2E testing
CEIP-2513 Platform definition validations
- Kyverno policy?
- Volunteer to write one?
Self service troubleshooting
Capabilities: 3 in review plus more to write
NPS, plan in mid-Dec for use in Jan
Multi-tenancy:
- no ELS-wide answer, has to be defined at the cluster ownership level
- tickets existing now are ‘generic docs’ [Irfan] to give a steer on doing it yourself
- first point of call will be business unit who defines policy and implementation
  - ‘our’ role (cortex-team) will be to provide specific expertose such as Skipper advice in pre-prod setting.

2022-11-11

finish Java post, at least complete draft

2022-11-10

https://elsevier.atlassian.net/browse/CEIP-2752: Define release requirements
https://elsevier.atlassian.net/browse/CEIP-2753: Define verification requirements

2022-11-09

CWS experimenting with cx.xlarge and then c.large ec2-instance-selector --current-generation false --memory 8 --vcpus 4 --cpu-architecture x86_64 -r eu-west-1
CEIP-2734 - removal of default services Requires: https://elsevier.atlassian.net/wiki/spaces/TIOCE/pages/119483629222295/Kong+APIOps+-+Roadmap#Portal:-First-Deployment
running platman locally: https://github.com/elsevier-centraltechnology/cortex-platform-manager#local-development-gotcha

apollo-camunda-orchestrator

set source to 1.8 and attempt comparison of startup on cluster
- login into aws using saml2aws
- tag image:
  - docker image tag apollo-orchestrator-client:1.2.7.20221109-java17 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.2.7.20221109-java17
  - docker image tag apollo-orchestrator-client:1.2.7.20221109-java8 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java8:1.2.7.20221109-java8
- docker login using aws credentials: aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin 781632261136.dkr.ecr.eu-west-1.amazonaws.com
- push image:
  - docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.2.7-20221109-java17
  - docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java8:1.2.7-20221109-java8
need to check (bakery? isdp?) on how to get JDK / images approved

2022-11-08

CWS update

useful to find that doubling CPU from 2 to 4 makes a big difference to reconcile times
recorded some extra details on the runway tracker and test report
loathed: refusal to recognise PlatMan as a product that must be treated as a complete solution not a box of parts.
- https://tioengineering.slack.com/archives/C0309C8J10A/p1667905392481429
- TS: ’Where can I read the bill of materials included in 2) ‘the platform definition / blueprint’ for a given tier at a given point in time?’ was a sincere question. Do I have to go read code at https://github.com/elsevier-centraltechnology/cortex-platform-manager until I trace it to the terraform? MC: okay so it isn’t just CWS who hasn’t read the README here https://github.com/elsevier-centraltechnology/cortex-platform-definitions
reflection (learned): organisation is more comfortable optimising infra than app. How is it that investing ~4 infra + mgmt + product is more attractive than 1 or 2 devs + mgmt / product cover for similar time?

Kong tickets

curl -H "Kong-Admin-Token: xxxxxxxxxxxx" -X POST https://sandbox.kong-nonprod.cortex.elsevier.systems/_api/default/admins/a.nepal@elsevier.com/roles --data 'roles=super-admin'
aws ssm get-parameters --names /tio/jenkins/kong-infra-token --region eu-west-1 --with-decryption | jq -r '.Parameters[] | .Value'

so putting it together, something like this..

curl -H "Kong-Admin-Token: `aws ssm get-parameters --names /tio/jenkins/kong-infra-token --region eu-west-1 --with-decryption | jq -r '.Parameters[] | .Value'`" -X POST https://infra.kong-nonprod.cortex.elsevier.systems/_api/default/admins/a.nepal@elsevier.com/roles --data 'roles=super-admin'

new relic training

fundamentals: 30mins+14:00-16:30 though many longish interruptions for Slack-warfare
- Objective: Know when and how to use each telemetry data type (Metrics, Events, Logs, Traces).
- ‘When all site or app materials reside in one central location like this, they’re called monolithic applications. And they’re usually built with a workflow called waterfall development.’ !!!
- ‘While cloud hosts are far less complex to operate and scale than physical on-premises servers, maintaining them still takes time.’ !!!
- ‘…monitoring is building your systems to collect data, with the goal of knowing when something goes wrong and starting your response quickly.’ –The Age of Observability
- Monitoring: Plan, Instrument, Observe, Detect and Resolve
- ‘Observability lets you understand why something is wrong, compared to monitoring, which simply tells you when something is wrong.’ –The Age of Observability
- Open instrumentation [1 of 3]: ‘Provide for visibility over the entire surface area of a distributed application.’ - I think this is worded very carefully: true that it ‘provides [the potential] for’ but is it credible that it will actually deliver that? I suspect it just dodges reponsibility and actually provides ‘garbage-in-garbage-out’
- Connected entities: claims entity is a tech agnostic term and in that case is no more than common sense. You have to define some identifier to correate things on. Nothing really inherent to cloud native or observability about that.
  - Some stuff to follow up (maybe)
    - The Age of Observability
    - The 10 Principles of Observability by Buddy Brewer and Alberto Gomez
    - What is Observability? By Alexis Jones
- ‘When you hear the term traces, it’s usually in reference to distributed traces.’ <- a crucial definition of terms, I would have just called ‘distributed logs’ - trace is just a log threshold to me.
- An observability platform that is open, connected, and programmable:
  - ‘Platforms like New Relic interface directly with a wide variety of open source tools, allowing teams to collect telemetry from their many existing open source solutions in one central location.’- pitching to be the ‘one ring’
  - ‘When issues occur, there’s often a domino effect’ -> need to be grouped and fine tune
  - Connected entities: represents itself as ‘automatically creating things like context (browser loads web page loads 5 microservices)–I’ll wager it does not!–still would be good if instrumentation cost is reasonable.
  - Defines Programmability as Monitoring (site loads slowly) with Observability (why it’s slow) to give business impact. Again, what’s not to like?
  - ‘Alert fatigue’: Applied intelligence provides:
    - proactive detection
    - noise reduction
    - intelligent alert routing
    - enriched incident response

crtxctl <30 mins

complete interface: https://www.alexedwards.net/blog/interfaces-explained

2022-11-07

TODO

java
- deploy camunda orchestrator to test-cluster
- test deploy process
- get camunda ui working (requires embedded directory service)
blogs x3
kong tickets x2
new relic training
new cluster (target weds)

deploy camunda orchestrator to test-cluster

create ECR repo in console
create docker image as normal
tag image: docker image tag apollo-orchestrator-client:1.0-SNAPSHOT 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101
login into aws using saml2aws
docker login using aws credentials: aws ecr get-login-password –region eu-west-1 | docker login –username AWS –password-stdin 781632261136.dkr.ecr.eu-west-1.amazonaws.com
push image: docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101
image is in platform nonprod: https://eu-west-1.console.aws.amazon.com/ecr/repositories/private/781632261136/apollo-camunda-orchestrator-java17?region=eu-west-1
kubectl -n cws-test apply -f helm/apollo-camunda-orchestrator-service/spring-config.yaml
kubectl -n cws-test apply -f helm/apollo-camunda-orchestrator-service/newrelic-config-camunda.yaml
helm uninstall -n cws-test camunda-orchestrator-service
helm install -n cws-test camunda-orchestrator-service ./helm/apollo-camunda-orchestrator-service/ -f helm/apollo-camunda-orchestrator-service/values-core-engineering-test-cluster-alpha.yaml
started!

│ Setting Active Processor Count to 2 │ │ Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx4002080K -XX:MaxMetaspaceSize=204511K -XX:ReservedCo │ │ deCacheSize=240M -Xss1M (Total Memory: 4608M, Thread Count: 250, Loaded Class Count: 33693, Headroom: 0%) │ │ Enabling Java Native Memory Tracking │ │ Adding 127 container CA certificates to JVM truststore │ │ Spring Cloud Bindings Enabled │ │ Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=/layers/paketo-buildpacks_bellsoft-liberica/java-security-propert │ │ ies/java-security.properties -XX:+ExitOnOutOfMemoryError -XX:ActiveProcessorCount=2 -XX:MaxDirectMemorySize=10M -Xmx40020 │ │ 80K -XX:MaxMetaspaceSize=204511K -XX:ReservedCodeCacheSize=240M -Xss1M -XX:+UnlockDiagnosticVMOptions -XX:NativeMemoryTra │ │ cking=summary -XX:+PrintNMTStatistics -Dorg.springframework.cloud.bindings.boot.enable=true


- return `managementPort` to default (same as HTTP port) per advice:
'Exposing management endpoints using the default HTTP port is a sensible choice for cloud based deployments' [Ref](https://docs.spring.io/spring-boot/docs/1.5.4.RELEASE/reference/html/production-ready-monitoring.html#production-ready-customizing-management-server-port)

- cws uat nonprod issue causing pod restarts

Caused by: org.camunda.bpm.engine.impl.javax.el.PropertyNotFoundException: Cannot resolve identifier ‘bamArticleNumber’ at org.camunda.bpm.engine.impl.juel.AstIdentifier.eval(AstIdentifier.java:83) at org.camunda.bpm.engine.impl.juel.AstEval.eval(AstEval.java:50) at org.camunda.bpm.engine.impl.juel.AstNode.getValue(AstNode.java:26) at org.camunda.bpm.engine.impl.juel.TreeValueExpression.getValue(TreeValueExpression.java:114) at org.camunda.bpm.engine.impl.dmn.el.ProcessEngineElExpression.getValue(ProcessEngineElExpression.java:47) at org.camunda.bpm.dmn.engine.impl.evaluation.ExpressionEvaluationHandler.evaluateElExpression(ExpressionEvaluationHandler.java:121)


## 2022-11-04

### crtxctl

- 0:15 interface for plugins

### T.I.M.

- 1h review & update with Felipe
- 1:15h write up and PR

## 2022-11-03

### `crtxctl`

- 1h45 for basic app and first command module
- 1h add command to read file from GitHub

### 1-2-1

- (A) Gartner quadrant of multi-tency
- (A) Team Interaction Model: Escalate Incident
    - OCC (2 entry points)
      - Platform Manager
      - Cortex target cluster incidents
    - Ops identified a couple of examples of each and they go to #cortex-critical and OpsGenie
      - OCC looks at them and checks Grafana
      - Anything else goes to #cortex-warning; acknowledge and open ticket (if needed)
    - Partners: 
      - actionable alerts - already handled
      - OCC 
    - Open questions
      - how ops engage with partner: leverage OCC
      - what if OCC slow? can't be helped
    - What about important but non-production environments: best efforts
- Facilitation
- (A) Request Change
    - informal is good but gets us only so far
    - ticket on their board and expectation of delivery date

### IPI Incident (3h)

### T.I.M.

- Escalate incident process, Request change process, Quadrant.
- 4h but monitoring CWS deployment at the same time.

### CWS upgrade

- Things in flight:

while true; do printf “=%.0s” $(seq 1 63); echo; echo date; echo;kubectl get pods –no-headers -n apollo-uat | awk -F ’ ’ ‘{print “- " $2 " " $3}’ | sort | uniq -c; echo; echo; sleep 15; done

- old new relic daemon set suspected of holding back other pods (acceleration after it came up).
- first test a failure, second test a big improvement (not sure if we got under the hour target)

## 2022-11-02

### EIP reconcile

- triggered and eventually ran with Kyverno failure
- learned how to query NewRelic for reconciler status: https://onenr.io/0ERPeZaAajW
- things that have to be monitored all at once:
> TS: The AWS console+GitHub code+New Relic+at least 2 GitHub markdown docs+GitHub actions+shell+k9s+2 slack channels is the minimum requirement to do anything right now and that sucks.

> FG: all of the above plus the partner asking questions on zoom
- one tool to rule them all:
- CLI (more compact, easier copy and paste than web)
- python (excellent integration, easy and quick to write)
- thoughts on commands:
  - `crtxctl login <product> <cluster> <platformClass>`: roll-up of `cortex-prod-admrole`, `cortex-kconf`, `set-eks-cli-access.sh`
  - `crtxctl status`: open or get via API something like https://onenr.io/0ERPeZ23gjW
  - `crtxctl logs`: open something like https://onenr.io/07j9b9EDAQO
  - `crtxctl get <product> <cluster> <platformClass>`: display platform definition
  - `crtxctl edit <product> <cluster> <platformClass>`: open platform definition for edit (using $EDITOR)
  - `crtxctl awsconsole <product>: open AWS console as TIO platform prod _and_ switch role (if this is poss.)
  - `crtxctl reconcile`: invoke reconciler on the latest definition rather like manual workflow in GitHub

## 2022-11-01

- Amith
- PR for Camunda change
- blog for Java 17
- Twistlock announcement
- Saw EOF error from POST to Twistlock authenticate
  - think intermittent issue on the Twistlock side
- Many critical things in kube-system!
- Get `apollo-camunda-orchestrator` running in test cluster
- create ECR repository in console
- create docker image as normal
- tag image: `docker image tag apollo-orchestrator-client:1.0-SNAPSHOT 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101`
- login into AWS using saml2AWS
- docker login using AWS credentials: `aws ecr get-login-password --region eu-west-1 | docker login --username AWS -din 781632261136.dkr.ecr.eu-west-1.amazonaws.com`
- push image: `docker push  781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101`
- Complete blog for Java 17
- Migration from `config-service` to ???