2022-11-25
- script to push cluster yaml to PR and thus avoid errors
- build dentity clusters
- share AMIs
- NewRelic training
- End of year OKRs review with Felipe
2022-11-24
- go to Oxford office for laptop domain repair
- rebuild id-infra-use cluster
- PRs for all other identoty clusters
- 1-2-1 with both Felipe and Irfan
- NewRelic training
2022-11-23
- build cluster
id-infra-use- and rebuild to correct GitHub actions configuration
- build clusters
id-dev-useandid-dev-euw, but failed, pick up in the morning. - rework onboarding requirements docs
CEIP-2819: Argo spike
Kyverno app
kubectl config set-context –current –namespace=argo-system argocd app create cortex-kyverno –repo https://github.com/elsevier-centraltechnology/cortex-platform-definitions.git –path platform-definitions/platform-common/alpha/cortex-kyverno –dest-server https://kubernetes.default.svc –dest-namespace argo-system
TODO: repo auth reqd
Camuda spike for release process support?
2022-11-22
- post-mortem for HM incident 2022-11-17
- meetings on releases and argo and C3
- update Team Interaction Model with latest diags:
2022-11-21
slack catchup
deploy restbucks to k8s
- tag image:
docker image tag restbucks:1.0.0.BUILD-SNAPSHOT 781632261136.dkr.ecr.eu-west-1.amazonaws.com/restbucks:1.0.0.20221118
- docker login using aws credentials:
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin 781632261136.dkr.ecr.eu-west-1.amazonaws.com
- push image:
docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/restbucks:1.0.0.20221118
helm uninstall -n cws-test restbuckshelm install -n cws-test restbucks ./helm/ -f helm/values.yaml
- tag image:
much confusion about what a release is
further
helm uninstall kyvernofor prod to leave helm in clean state
X core-engineering/prod X emealasp/prod X hm-core-platform/cluster-prod-ap X hm-core-platform/cluster-prod X identity/id-infra X ipi/admin X ipi/prod X ipi/sandbox X kong/prod X ssdrapim/cluster-prod X webpresence/prod
X cws/prod/ X business-services/prod
2022-11-18
pm: uninstall Kyverno
- HM incident more diagnostics, long call on next steps
- announcement to partners
- removal of kyverno in all prod clusters
X X core-engineering/prod X X emealasp/prod X X hm-core-platform/cluster-prod-ap X hm-core-platform/cluster-prod-ap - done yesterday identity/id-infra - not yet installed, cluster still being built X X ipi/admin X X ipi/prod X X ipi/sandbox X X kong/prod X X ssdrapim/cluster-prod X X webpresence/prod
X cws/prod/ X business-services/prod
for i in $(aws route53 list-resource-record-sets --hosted-zone-id $(aws route53 list-hosted-zones-by-name --dns-name='cortex.elsevier.systems' | jq -r '.HostedZones[] | .Id' | sed 's#/hostedzone/##g') --query "ResourceRecordSets[?Type == 'NS']" | jq -r '.[] | .Name'); do curl -I -L dashboard.$i; don
morning
- HM incident more diagnostics, long call on next steps
- announcement to partners
- removal of kyverno in all prod clusters
UPDATE TO PRODUCTION CLUSTERS
As you may be aware we had an incident yesterday that appears to have a number of contributing, or at least coincidental, factors. We continue to investigate both internally and with AWS. In the interim we plan to put the following steps in place to ensure that any repeat will not cause your workloads to be locked down.
- Immediately, we will be removing Kyverno from your production clusters. This will ensure no future failures cause it to fail in a closed state, locking services out in the process. THIS MEANS YOU SHOULD NOT MAKE ANY CHANGES TO PRODUCTION CLUSTERS, AS RUNNING THE RECONCILER WILL REPLACE IT. As a result the platform policies will not be being enforced so take extra care when modifying your workloads.
- We will modify Kyverno to fail open while we work on ensuring it runs stabily. In the event of such a Kyverno failure your workloads will be unimpeded but you will also be without the policy protections as above. This will rollout through the different platform classes (alpha, beta, prod) progressively starting next week.
- We will revert a recent upgrade of the aws-efs-csi-driver to previous version (app version 1.3.8) whilst retaining the grouped rollout strategy as the previous one by one strategy caused problems. This too will roll through the platform classes progressively.
Behind all these steps we are also enhancing monitoring and test suites to close some gaps identified.
I will confirm on this thread once Kyverno is removed from all prod clusters.
2022-11-17
CEIP-2162: EMEALASP prod cluster
Blog idea: Onion architectures make me cry
HM prod incident
2022-11-16
CEIP-2752 refinement
- 3 high level user stories achieved
CEIP-2162: EMEALASP prod cluster
- missed pre-reqs (param store + bootstrap reconcile)
- missed addition of new account (undocumented) to https://github.com/elsevier-centraltechnology/tio-terraformcontrol-ce/blob/master/183742092277/eu-west-1/platform-central/variables.tf
Matteo comment about: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html
I’m going to respectfully disagree with the srcco.de post. Yes, liveness probes are dangerous, but the key point of ‘do not depend on external dependencies (like data stores)’-whilst valid-is already built into the Spring check (see docs) and is precisely why I listed the /health/liveness endpoint rather than merely /health . I think it is incorrect to complain that the Redis datasource health check breaks the ‘external’ rule: If I ask for the health of the Redis datasource I presumably want that external check. More importantly, I think the post is simply out of date in regard to readiness checks. I would have completely agreed in a world without startup probes which the post references as being ‘new’-as they were in 2019. K8S 1.16 came out in Sept 19. What I have written, I believe correctly in 2022, reserves readiness probes for directing traffic away from pods unready after initial startup.
ArgoCD expt
encapsulates an ‘application’ and provides full management by UI and CLI inc. install, monitor and delete
- replace terraform, platman and jenkins in one go!
- TODO can we encapsulate application ‘suites’ such as CWS / BTS? (project?)
manage all partner clusters from one platform cluster (not the only option but the one we’d want I think)
permit cortex platform components to be managed more finely, if desired.
- TODO check: Yes we could still have one application that pointed to all upstream components (managed by helm or otherwise) and sync (reconcile) all clusters at once. Or we could do things in smaller chunks.
can manage specific branch, tag etc of upstream code via Kustomize(https://youtu.be/Q97O6iLJguk?t=769)
can ‘diff’ what is currently in place with what is coming if we sync (reconcile) https://youtu.be/Q97O6iLJguk?t=999
2022-11-15
- CEIP-2529
- ArgoCD can manage itself, even manage other clusters
- may be simple case but can install controller into each cluster
2022-11-14
Cortex demo
- J: daemon sets rolled out by percentage rather than one by one
- LM:
tio-helmcharts-core-platform- gen docs alone instead of
go install - push helm charts to artifactory on merge to main
- gen docs alone instead of
- FG: run thru escalate incident process
Planning
- Investigate BTS apps
- some with single replica
- Speed coming up
- Feel free to get involved on Platman E2E testing
- CEIP-2513 Platform definition validations
- Kyverno policy?
- Volunteer to write one?
- Self service troubleshooting
- Capabilities: 3 in review plus more to write
- NPS, plan in mid-Dec for use in Jan
- Multi-tenancy:
- no ELS-wide answer, has to be defined at the cluster ownership level
- tickets existing now are ‘generic docs’ [Irfan] to give a steer on doing it yourself
- first point of call will be business unit who defines policy and implementation
- ‘our’ role (cortex-team) will be to provide specific expertose such as Skipper advice in pre-prod setting.
2022-11-11
- finish Java post, at least complete draft
2022-11-10
- https://elsevier.atlassian.net/browse/CEIP-2752: Define release requirements
- https://elsevier.atlassian.net/browse/CEIP-2753: Define verification requirements
2022-11-09
- CWS experimenting with cx.xlarge and then c.large
ec2-instance-selector --current-generation false --memory 8 --vcpus 4 --cpu-architecture x86_64 -r eu-west-1 - CEIP-2734 - removal of default services Requires: https://elsevier.atlassian.net/wiki/spaces/TIOCE/pages/119483629222295/Kong+APIOps+-+Roadmap#Portal:-First-Deployment
- running platman locally: https://github.com/elsevier-centraltechnology/cortex-platform-manager#local-development-gotcha
apollo-camunda-orchestrator
set source to 1.8 and attempt comparison of startup on cluster
- login into aws using saml2aws
- tag image:
docker image tag apollo-orchestrator-client:1.2.7.20221109-java17 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.2.7.20221109-java17docker image tag apollo-orchestrator-client:1.2.7.20221109-java8 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java8:1.2.7.20221109-java8
- docker login using aws credentials:
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin 781632261136.dkr.ecr.eu-west-1.amazonaws.com - push image:
docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.2.7-20221109-java17docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java8:1.2.7-20221109-java8
need to check (bakery? isdp?) on how to get JDK / images approved
2022-11-08
CWS update
- useful to find that doubling CPU from 2 to 4 makes a big difference to reconcile times
- recorded some extra details on the runway tracker and test report
- loathed: refusal to recognise PlatMan as a product that must be treated as a complete solution not a box of parts.
- https://tioengineering.slack.com/archives/C0309C8J10A/p1667905392481429
- TS: ’Where can I read the bill of materials included in 2) ‘the platform definition / blueprint’ for a given tier at a given point in time?’ was a sincere question. Do I have to go read code at https://github.com/elsevier-centraltechnology/cortex-platform-manager until I trace it to the terraform? MC: okay so it isn’t just CWS who hasn’t read the README here https://github.com/elsevier-centraltechnology/cortex-platform-definitions
- reflection (learned): organisation is more comfortable optimising infra than app. How is it that investing ~4 infra + mgmt + product is more attractive than 1 or 2 devs + mgmt / product cover for similar time?
Kong tickets
curl -H "Kong-Admin-Token: xxxxxxxxxxxx" -X POST https://sandbox.kong-nonprod.cortex.elsevier.systems/_api/default/admins/a.nepal@elsevier.com/roles --data 'roles=super-admin'
aws ssm get-parameters --names /tio/jenkins/kong-infra-token --region eu-west-1 --with-decryption | jq -r '.Parameters[] | .Value'
so putting it together, something like this..
curl -H "Kong-Admin-Token: `aws ssm get-parameters --names /tio/jenkins/kong-infra-token --region eu-west-1 --with-decryption | jq -r '.Parameters[] | .Value'`" -X POST https://infra.kong-nonprod.cortex.elsevier.systems/_api/default/admins/a.nepal@elsevier.com/roles --data 'roles=super-admin'
new relic training
- fundamentals: 30mins+14:00-16:30 though many longish interruptions for Slack-warfare
- Objective: Know when and how to use each telemetry data type (Metrics, Events, Logs, Traces).
- ‘When all site or app materials reside in one central location like this, they’re called monolithic applications. And they’re usually built with a workflow called waterfall development.’ !!!
- ‘While cloud hosts are far less complex to operate and scale than physical on-premises servers, maintaining them still takes time.’ !!!
- ‘…monitoring is building your systems to collect data, with the goal of knowing when something goes wrong and starting your response quickly.’ –The Age of Observability
- Monitoring: Plan, Instrument, Observe, Detect and Resolve
- ‘Observability lets you understand why something is wrong, compared to monitoring, which simply tells you when something is wrong.’ –The Age of Observability
- Open instrumentation [1 of 3]: ‘Provide for visibility over the entire surface area of a distributed application.’ - I think this is worded very carefully: true that it ‘provides [the potential] for’ but is it credible that it will actually deliver that? I suspect it just dodges reponsibility and actually provides ‘garbage-in-garbage-out’
- Connected entities: claims entity is a tech agnostic term and in that case is no more than common sense. You have to define some identifier to correate things on. Nothing really inherent to cloud native or observability about that.
- Some stuff to follow up (maybe)
- The Age of Observability
- The 10 Principles of Observability by Buddy Brewer and Alberto Gomez
- What is Observability? By Alexis Jones
- Some stuff to follow up (maybe)
- ‘When you hear the term traces, it’s usually in reference to distributed traces.’ <- a crucial definition of terms, I would have just called ‘distributed logs’ - trace is just a log threshold to me.
- An observability platform that is open, connected, and programmable:
- ‘Platforms like New Relic interface directly with a wide variety of open source tools, allowing teams to collect telemetry from their many existing open source solutions in one central location.’- pitching to be the ‘one ring’
- ‘When issues occur, there’s often a domino effect’ -> need to be grouped and fine tune
- Connected entities: represents itself as ‘automatically creating things like context (browser loads web page loads 5 microservices)–I’ll wager it does not!–still would be good if instrumentation cost is reasonable.
- Defines Programmability as Monitoring (site loads slowly) with Observability (why it’s slow) to give business impact. Again, what’s not to like?
- ‘Alert fatigue’: Applied intelligence provides:
- proactive detection
- noise reduction
- intelligent alert routing
- enriched incident response
crtxctl <30 mins
- complete interface: https://www.alexedwards.net/blog/interfaces-explained
2022-11-07
TODO
- java
- deploy camunda orchestrator to test-cluster
- test deploy process
- get camunda ui working (requires embedded directory service)
- blogs x3
- kong tickets x2
- new relic training
- new cluster (target weds)
deploy camunda orchestrator to test-cluster
create ECR repo in console
create docker image as normal
tag image: docker image tag apollo-orchestrator-client:1.0-SNAPSHOT 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101
login into aws using saml2aws
docker login using aws credentials: aws ecr get-login-password –region eu-west-1 | docker login –username AWS –password-stdin 781632261136.dkr.ecr.eu-west-1.amazonaws.com
push image: docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101
image is in platform nonprod: https://eu-west-1.console.aws.amazon.com/ecr/repositories/private/781632261136/apollo-camunda-orchestrator-java17?region=eu-west-1
kubectl -n cws-test apply -f helm/apollo-camunda-orchestrator-service/spring-config.yamlkubectl -n cws-test apply -f helm/apollo-camunda-orchestrator-service/newrelic-config-camunda.yamlhelm uninstall -n cws-test camunda-orchestrator-servicehelm install -n cws-test camunda-orchestrator-service ./helm/apollo-camunda-orchestrator-service/ -f helm/apollo-camunda-orchestrator-service/values-core-engineering-test-cluster-alpha.yamlstarted!
│ Setting Active Processor Count to 2 │ │ Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx4002080K -XX:MaxMetaspaceSize=204511K -XX:ReservedCo │ │ deCacheSize=240M -Xss1M (Total Memory: 4608M, Thread Count: 250, Loaded Class Count: 33693, Headroom: 0%) │ │ Enabling Java Native Memory Tracking │ │ Adding 127 container CA certificates to JVM truststore │ │ Spring Cloud Bindings Enabled │ │ Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=/layers/paketo-buildpacks_bellsoft-liberica/java-security-propert │ │ ies/java-security.properties -XX:+ExitOnOutOfMemoryError -XX:ActiveProcessorCount=2 -XX:MaxDirectMemorySize=10M -Xmx40020 │ │ 80K -XX:MaxMetaspaceSize=204511K -XX:ReservedCodeCacheSize=240M -Xss1M -XX:+UnlockDiagnosticVMOptions -XX:NativeMemoryTra │ │ cking=summary -XX:+PrintNMTStatistics -Dorg.springframework.cloud.bindings.boot.enable=true
- return `managementPort` to default (same as HTTP port) per advice:
'Exposing management endpoints using the default HTTP port is a sensible choice for cloud based deployments' [Ref](https://docs.spring.io/spring-boot/docs/1.5.4.RELEASE/reference/html/production-ready-monitoring.html#production-ready-customizing-management-server-port)
- cws uat nonprod issue causing pod restarts
Caused by: org.camunda.bpm.engine.impl.javax.el.PropertyNotFoundException: Cannot resolve identifier ‘bamArticleNumber’ at org.camunda.bpm.engine.impl.juel.AstIdentifier.eval(AstIdentifier.java:83) at org.camunda.bpm.engine.impl.juel.AstEval.eval(AstEval.java:50) at org.camunda.bpm.engine.impl.juel.AstNode.getValue(AstNode.java:26) at org.camunda.bpm.engine.impl.juel.TreeValueExpression.getValue(TreeValueExpression.java:114) at org.camunda.bpm.engine.impl.dmn.el.ProcessEngineElExpression.getValue(ProcessEngineElExpression.java:47) at org.camunda.bpm.dmn.engine.impl.evaluation.ExpressionEvaluationHandler.evaluateElExpression(ExpressionEvaluationHandler.java:121)
## 2022-11-04
### crtxctl
- 0:15 interface for plugins
### T.I.M.
- 1h review & update with Felipe
- 1:15h write up and PR
## 2022-11-03
### `crtxctl`
- 1h45 for basic app and first command module
- 1h add command to read file from GitHub
### 1-2-1
- (A) Gartner quadrant of multi-tency
- (A) Team Interaction Model: Escalate Incident
- OCC (2 entry points)
- Platform Manager
- Cortex target cluster incidents
- Ops identified a couple of examples of each and they go to #cortex-critical and OpsGenie
- OCC looks at them and checks Grafana
- Anything else goes to #cortex-warning; acknowledge and open ticket (if needed)
- Partners:
- actionable alerts - already handled
- OCC
- Open questions
- how ops engage with partner: leverage OCC
- what if OCC slow? can't be helped
- What about important but non-production environments: best efforts
- Facilitation
- (A) Request Change
- informal is good but gets us only so far
- ticket on their board and expectation of delivery date
### IPI Incident (3h)
### T.I.M.
- Escalate incident process, Request change process, Quadrant.
- 4h but monitoring CWS deployment at the same time.
### CWS upgrade
- Things in flight:
while true; do printf “=%.0s” $(seq 1 63); echo; echo date; echo;kubectl get pods –no-headers -n apollo-uat | awk -F ’ ’ ‘{print “- " $2 " " $3}’ | sort | uniq -c; echo; echo; sleep 15; done
- old new relic daemon set suspected of holding back other pods (acceleration after it came up).
- first test a failure, second test a big improvement (not sure if we got under the hour target)
## 2022-11-02
### EIP reconcile
- triggered and eventually ran with Kyverno failure
- learned how to query NewRelic for reconciler status: https://onenr.io/0ERPeZaAajW
- things that have to be monitored all at once:
> TS: The AWS console+GitHub code+New Relic+at least 2 GitHub markdown docs+GitHub actions+shell+k9s+2 slack channels is the minimum requirement to do anything right now and that sucks.
> FG: all of the above plus the partner asking questions on zoom
- one tool to rule them all:
- CLI (more compact, easier copy and paste than web)
- python (excellent integration, easy and quick to write)
- thoughts on commands:
- `crtxctl login <product> <cluster> <platformClass>`: roll-up of `cortex-prod-admrole`, `cortex-kconf`, `set-eks-cli-access.sh`
- `crtxctl status`: open or get via API something like https://onenr.io/0ERPeZ23gjW
- `crtxctl logs`: open something like https://onenr.io/07j9b9EDAQO
- `crtxctl get <product> <cluster> <platformClass>`: display platform definition
- `crtxctl edit <product> <cluster> <platformClass>`: open platform definition for edit (using $EDITOR)
- `crtxctl awsconsole <product>: open AWS console as TIO platform prod _and_ switch role (if this is poss.)
- `crtxctl reconcile`: invoke reconciler on the latest definition rather like manual workflow in GitHub
## 2022-11-01
- Amith
- PR for Camunda change
- blog for Java 17
- Twistlock announcement
- Saw EOF error from POST to Twistlock authenticate
- think intermittent issue on the Twistlock side
- Many critical things in kube-system!
- Get `apollo-camunda-orchestrator` running in test cluster
- create ECR repository in console
- create docker image as normal
- tag image: `docker image tag apollo-orchestrator-client:1.0-SNAPSHOT 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101`
- login into AWS using saml2AWS
- docker login using AWS credentials: `aws ecr get-login-password --region eu-west-1 | docker login --username AWS -din 781632261136.dkr.ecr.eu-west-1.amazonaws.com`
- push image: `docker push 781632261136.dkr.ecr.eu-west-1.amazonaws.com/apollo-camunda-orchestrator-java17:1.0-20221101`
- Complete blog for Java 17
- Migration from `config-service` to ???