2022-07-26
- catchup
- CEIP-2085
- CEIP-2052: Review doc for Felipe
- CEIP-2091 (Thomas Tran)
- Wiki page edited by R Lange
- mtg plan
- Review: https://github.com/elsevier-centraltechnology/core-cicd-reference/pull/6/
- terraform control repository is a Elsevier idea or broader?
- this is one of the 3 repositories I was going to document in a blog post
- Planning
- Dev portal: tickets to be agreed w Felipe
- Jenkins: Shim w SQS or whatever (liaise w Felipe)
2022-07-25: Vacation
2022-07-22
- problematic comms -4h
- reviewing / collating jenkins issues
2022-07-21
- pain points for Jenkins / API Ops
- https://one.newrelic.com/infra/services?account=3412598&begin=1657704331190&end=1657707931190&state=14483165-562f-b091-0bf4-88ff56579113
- TechDesk
- access to logs
- order of jobs
2022-07-20
- CEIP-1865: blog
- For retro:
- Loathe: confluence.. Many pages are stubs with no real content, others (e.g. Kong) have deep hierarchies of endless detail with no progression between them.
- CEIP-1500: monitoring
- FG: diff scenarios
- target clusters
- platform components on target clusters
- reconciler
- platman cluster (not on cortex to avoid chicken and egg)
- IP:
- focus on partner clusters
- eg lots of restarts of kube-system pods such as calico
- spot when partner pods are behaving strangely
- currently between reconciler ruhns are flying blind
- at the moment have many ‘drop rules’ to avoid breaching quota
- eg if something supposed to be daemon set but not running as such => flag error
- configured in tio-terraformcontrol-ce/183…/region/platform-central/newrelic-drop….
- platman cluster could potentially run on cortex (future)
- filter at source vs newelic
- filter at source offers dramatic saving at egress level
- final drop rule (all Metric not on platform cluster) made most diff
- reduce sample rate also offers potential (at cost of slightly slower alert)
- nr allow consumption of prometheus metrics without prometheus srvr
- Metric is created by nr-prom bridge: => largest volume
- kinesis report can provide when nearing quota:
- something like quota blown on single day not month or on sliding window (std deviation)
- look first at ’native’ not ‘prometheus’
- eg K8sVolumeSample or K8SDeploymentSample
- start from platman/components.yaml and see what may be checked (eg expect replica 2, do we have?)
- look at data already in NR and question whether needed (drill down by volume)
- lots of duplication between native container and K8s specific data
- https://elsevier.atlassian.net/wiki/spaces/SRE/pages/119600926471772/Target+Logs+migration+to+New+Relic
- focus on partner clusters
- FG: diff scenarios
2022-07-19
- CEIP-1877: incorporate NPS survey feedback
- Various reviews
- CEIP-2018: Python & JS with Khush
2022-07-18
- Bump various dependabot libs and figure out test strategy
- Jenkins familiarisation, learning to run ochestration on a branch / PR
2022-07-15
- set-cluster-access
- log out existing aws
- need ad group and account access
- X make sure yq installed
- make it work with bash
- write bashrc.includes like zsh one
- CEIP-1701:
- suggestion from FG that we need every step of Platman provisioning published to NR
- get access into the cluster and trace each thing as it happens?
- CEIP-2019
- adopt new module for ourselves, talk to Jack
2022-07-14
- Kong call
- Single subdomain of dev.elsevier.com per BU
- Kong looking to deprecate workspaces
- 1-2-1:
Dev plan!- Add attend product roadmap for first dev goal
- Reach out to DBS, RDP?
- CEIP-2019:
- CEIP-1865: write up resources
review PR
- CEIP-2016:
fix bootstrap for remaining accounts- return to 1701
- notify RDP team
- CEIP-1701:
- one more failure down the line: now kyverno
- CEIP-1877: NPS
2022-07-13
- CEIP-1877:
Cortex NPS pydantic and pytest- CEIP-2018:
- TIP:
awk -v cmd='openssl x509 -subject -issuer -dates -noout' '/BEGIN/{close(cmd)};{print | cmd}' < infra_dp - guide KA
- TIP:
- CEIP-2016:
- CEIP-2019:
- Why have we set these resource limits?
- Kong docs talk about problems arising from constraints of either CPU or RAM but
- RAM is proportionately more important since it provides not only working memory but also db cache
- Therefore recommend to provision double the RAM (in Gb) to CPU (in vCPU)
- To decide the absolute numbers we reviewed the Kong control plane under zero load (https://one.newrelic.com/nr1-core/kubernetes-cluster-explorer/k8s-cluster-explorer/MzQxMjU5OHxJTkZSQXxOQXw1NjMxNzMxODY4NjM2NTA5MzU3?account=3412598&duration=300000&filters=%28domain%20%3D%20%27INFRA%27%20AND%20type%20%3D%20%27KUBERNETESCLUSTER%27%29&state=0c5986cd-2b24-b44f-6a44-79375fdd2285) and found it to be using negligible CPU and almost 600Mb of RAM. Therefore we have provisioned … which we feel is the bare minimum, you should monitor and adjust up accordingly.
- Similar results for data plane: https://one.newrelic.com/nr1-core/kubernetes-cluster-explorer/k8s-cluster-explorer/MzQxMjU5OHxJTkZSQXxOQXw1NjMxNzMxODY4NjM2NTA5MzU3?account=3412598&duration=1800000&filters=%28domain%20%3D%20%27INFRA%27%20AND%20type%20%3D%20%27KUBERNETESCLUSTER%27%29&state=f3da78fe-b1cc-28a8-49a5-d672969ce8b2
- Why have we set these resource limits?
2022-07-12
- Dev10: Forms (MS or Backstage)
2022-07-11
CEIP-2016:
- BTS core bootstrap repo / boostrap.tf add extra module ’nr_xxx’ if nr_enabled include the terraform not the terraform legacy
- download all logs from bts jenkins
- dont worry about dupes, grep those using legacy.bootstrap
- go to RT jenkins and see if patch branch detects drift
- not all counts touched, just ‘canary’ accounts
- then go
Retro
- Testing
- pytest
- TM: terratest is hard (go and long running)
- what about static analysis
- Testing
CEIP-2019:
- recap decision from Fri, TM has implemented, review
CEIP-2018: debug w Khushnood
release mtg:
- No plat release since May (except one patch)
- IP pushing for vendor upgrades
- Plat calc estimates pods supportable and how many are system vs user
- FG clarify
2022-07-08
- CEIP-2019:
- confirm w team to provide and document explicit inputs
- d3
- platform calculator (backstage)
- getting started w backstage
- kt on auth
2022-07-07
- Dev duty, w KA
- scheduled d3 timeline for tomorrow
- 1:1
- CEIP-2019: Kong data plane
2022-07-06
- Backstage getting started
- CEIP-1835: Summarised 2 actions as a result of investigation and got agreement to revisit in Aug
- CEIP-1701:
- Felipe identified that
nr_*do exist in SSM Jenkins log - Yet cluster creation cronjob still fails
"clusterName":"sandbox","awsAccount":"xxxxxxxxxxxx","error":"Unable to link AWS account" - Searched github for this error string and find this
- So concluded reconcileAWSProviderLink is failing
- Looked at the possible failures in that code
- Checked CloudHealth
- Via a bit of a leap (because the RDP account is old) we deduced that ’legacy’ bootstrap ran. That does not include the newrelic link.
- Searched again in Github, this time in the RT bootstrap repo, and there is no reference to ‘relic’
- Searching in BTS bootstrap finds potential places to fix
- Felipe identified that
2022-07-05
- Chat w Khushnood on GitOps, Kong, API gateway
- CEIP-1701:
- up-kconf core-elsevier-platform
- trigger cluster again
- fail: notes added to JIRA
- CWS
- prep
brew install maven brew tap adoptopenjdk/openjdk brew install --cask adoptopenjdk8 brew install --cask eclipse-java - compile
producesmvn clean install
implying use of a module-aware JDK (v9 or later).[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.0:compile (default-compile) on project apollo-orchestrator-util: Fatal error compiling: java.lang.ExceptionInInitializerError: Unable to make field private com.sun.tools.javac.processing.JavacProcessingEnvironment$DiscoveredProcessors com.sun.tools.javac.processing.JavacProcessingEnvironment.discoveredProcs accessible: module jdk.compiler does not "opens com.sun.tools.javac.processing" to unnamed module @31d26296 -> [Help 1]
reportsmvn -v
so need to set JAVA_HOME:Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63) Maven home: /usr/local/Cellar/maven/3.8.6/libexec Java version: 18.0.1.1, vendor: Homebrew, runtime: /usr/local/Cellar/openjdk/18.0.1.1/libexec/openjdk.jdk/Contents/Home
couple of test failures, skip those and move onexport JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/ - run
failed to write to /opt/logs/camunda.log (doesn’t exist on mac) so I took the opportunity to replacejava -jar apollo-orchestrator-client/target/apollo-orchestrator-client.jarlog4j2.xmlwithapplication.propertiesthat can be overridden by environment variables. That included removing the classpath reference inbootstrap.yaml(bootstrap.yaml seems to be the preferred way to override at runtime when using Spring Cloud).
This is an example of the dependency on the config service.java.lang.IllegalArgumentException: Could not resolve placeholder 'ldap.authentication.url' in value "${ldap.authentication.url}"
- prep
2022-07-04
- CEIP-1930: Update bakery diags and submit PR
- Finished!
- CEIP-1911: Review comments and update
- Finished!
- Kong
- conversation w DBS, Terry, Felipe led to desire to push Kong for more robust solution.
- CWS
- conversation on startup probes is positive: need to write up.
- need to analyse the logs and do a clear proposal email.
- APIM alerts and escalations
- PR for runbook link