Objective: Get to know Partners and build team performance and maturity

Rating: Very successful, lack of candidates or progress of existing candidates is beyond my control.

Key results:

  1. Each candidate Partner helped to quickly reach a ‘go-no-go’ decision based on their specific needs providing both documentary and face to face support.

    Measures will be quantitative through Net Promoter Score and qualitative.

    NPS of 53% in April 23 has risen to 67% in November (provisional data).

    Often daily brief chats to ask questions or ensure alignment. Progress was good and rapid.

  2. Each Partner supported through completing Cortex Assessment

    Measures will be quantitiative through Net Promoter Score and qualitative.

    NPS of 53% in April 23 has risen to 67% in November (provisional data). Several partners came with challenges that we certainly got better at identifying through the creation of the ‘Litmus test’ as a pre-qualification check resulting from Sierra and Raven assessments. PPE and DKP went quickly as a direct result of this though DKP seems to be stuck on the TPR1 stage.

    Assessments completed this year:

    • A&G Identity - Migration as of 25 Jul
    • A&G IPI - Closed 25 Jul 23
    • A&G RDP - Migration 19 Sept 23
    • A&G SSDR Raven - ASSESSMENT 25 Jul 2023
    • A&G DKP - ASSESSMENT 25 Jul 2023
    • DBS PPE - CLOSURE 23 Nov 23

    Assessments stopped / placed on hold:

    • CM Entellect - ASSESSMENT25 Jul 2023
    • CM Knovel - ASSESSMENT 25 Jul 2023
    • CM Life Sciences Innovation Lab - ASSESSMENT 25 Jul 2023
    • DBS EPay - Cortex Tracker
    • DBS Project Sierra - B2B - Cortex Tracker
    • DBS Project Sierra - Magento - Cortex Tracker
    • DBS Project Sierra - Merchandising - Cortex Tracker
  3. Identify team metrics with management, agree and document ways of measuring them.

    • Roof: Metrics manually collected monthly
    • Moon: Automated collection of metrics

    Metrics and means to collect them identified. https://elsevier.atlassian.net/wiki/spaces/TIOCORTEX/pages/119601117614938/Cortex+Metrics Management agreement sought and achieved. Manual collection happened with others proven able to do so without difficulty. https://github.com/elsevier-centraltechnology/tio-terraformcontrol-ce/pull/1157#pullrequestreview-1762000758 Proof of concept for automated collection delivered but later set aside in preference to using the same technology subsequently selected for automation in other areas. https://github.com/elsevier-centraltechnology/cortex-operations/pull/279

  4. Ongoing improvement to Ways of Working both in terms of streamlining practice and documenting it.

    Measured by evaluating how complete and up to date Ops Procedures site is.

    • Created simple push-to-publish site to pull together all documentation for the Ops persona (distinct from Partner personas).
    • Reviewed all existing docs for obsolescence and iso 27001 compliance.
    • Achieved adoption by the team at the same time as regularly contributing updates to ensure it remains relevant. Evidenced by providing 33 of the team’s 75 documentation commits in 2023.

Objective: Grow understanding of dependencies of, and interaction between, Cortex application components.

Rating: successful given this was most under my control even though a large and industry-wide problem.

Key results & measures

Cortex managed components have been selected from the vast range of solutions available in the industry. Deploying these together once is the problem solved by the Build team. However, post-deployment, it is not always clear which component a specific resource belongs to, whether it is up to date or not, or even obsolete.

  • Tim researched any industry solutions for this problem and though finding things like troubleshoot, trivy and kbom at different points all seemed to be both incomplete / early solutions and tied too tightly to their existing tools.
  • Software Bill of Materials (SBOM) standards (CycloneDX and SPDX) were beginning to gain traction at the end of 2022.

Key results

  1. Traceability from source code to deployed environment of all resources in Cortex-managed namespaces.

    • The chain of traceability of Cortex components flows from open source repositories through their open source container images and helm charts into our ‘gitops’ repos and on into the K8s clusters.
      • The initial problem with this was that we did not know whether the correct cluster wide resources, such as crds, were deployed.
      • A further problem is that, only at the cluster do we get to apply vulnerability detection through Twistlock.
    • By introducing an enriched set of metadata in parallel to our deployment pipeline the SBOM toolingmade it possible to:
      • know at a glance the ownership of those tricky, untagged or global resources such as crds.
      • find when clusters have diverged from the blueprint.
      • cross-reference components against vulnerability lists earlier in the process.
    • The SBOM PoC delivered a partially automated stop-gap until the proposed ArgoCD deployment to become available but it did not seen appropriate to develop further at that time. What was outstanding was a UI / cronjob to make results available on a periodic basis.
      • A full automation could be integrated into the Cortex Advisor roadmap in 2024.
    • Twistlock analysis in alpha clusters effectively delivers the vulnerability checking, albeit that we tend only to get time to address this retrospectively.
  2. Ability to report on a given cluster’s state with respect to the reference blueprint. Include both confirmation that expected resources are present and identification of unexpected resources.

    PoC delivered via a command line tool aware of the Cortex Platform Definitions repository. HTML (comprehensive) and Slack (summary) outputs also delivered.

  3. Ability to define reports that span more than one component or partner. For example: find all partner applications using docker shim in order to support containerd, or find all uses of a specific component deprecated in advance of a Kubernetes upgrade.

    The extensiblility mechanism delivered as part of the Cortex Control (crtxctl) tool has proved sufficiently simple for (almost?) all team members to be able to use to deliver features such as this quickly and efficiently. Most recently a K8s 1.26 readiness test developed in less than 2 business days of approximately 50% full time equivalent (FTE).

  4. Blueprint expressed in a industry standard form to permit merging with other data sources or integration with other tools.

    CycloneDX selected as the older and more widely adopted standard. It also has a more natural way to break up the SBOM into parts, perhaps aligned to each upstream vendor.

  5. Provides extensibility mechanism so investigation of novel issues can be quickly added to existing capabilities.

Achieved, as shown by the K8s 1.26 migration readiness example.

Objective: Improve application diagnostic capabilities of Cortex

Rating: very successful given head winds with Build and scope of problem.

Key results & measures

Identify and develop application level diagnostics and troubleshooting capabilities. These might fall under the Cortex Inspector or Cortex Adviser initiatives.

Key results

  1. Review of available tools / approaches that could deliver these diagnostic capabilities.

    Achieved:

    • Three tools evaluated in conjunction with Ops and Build teams.
    • Documented as an RFP and a number of webinars.
  2. Document evaluation of more than one approach and seek consensus on the best one to choose within the team

    • Roof: Ops team consensus
    • Moon: Build and Ops team consensus

    Ops team find the tool attractive and quick to pick up. Some qualified endorsement by 3 of Build team but full endorsement proved elusive. Management also persuaded after seeing the selected tool could also address wider problems of browser automation.

  3. Performance or error conditions detectable in Cortex core applications (those with the ‘cortex-critical’ priority class).

    • Roof: A pattern defined for one component that can be replicated for others
    • Moon: 5 components covered using the pattern previously defined.

    Initial, end-to-end, pattern and implementation defined. Tests delivered for two components, two further in progress.

  4. Increase confidence in component releases

    • Roof: Team has confidence to release components with test coverage with minimal manual oversight.
    • Moon: Team has confidence to enable fully automated release of covered components.

    Confidence is good for covered components. However, it must be said that the depth of knowledge of components needed has been shown higher than that which exists. Although this has contributed to delay, it must be seen as a good thing since it drives both the depth of understanding in the team and provides working examples. This has been recognised in retros.

Objective: Produce Cortex training materials

Rating: very successful given context of organisational divisions and inertia.

Training objective is: “Product Managers and Application Leads will be able to evaluate the suitability of their applications to run on Cortex by identifying the similarity / disimilarity of their workloads with those Cortex can handle.”

Key results & measures

  1. Define an initial scope with the focus on partners not yet on Cortex.

    • There was an initial agreement to focus on ‘What is Cortex?’ with Cortex management.
    • This recognised that:
      • some partners, the ones needing most Ops support, approached Cortex with minimal understanding of it as ‘a platform’, not fully grasping either the benefits or demands that platform makes of them.
      • follow on work would be needed to deliver the bigger goal of ‘Cortex University’ and it is tricky to know how to do that fully without duplicating big chunks of existing Udemy, or other courses.
  2. Deliver an initial training on the Elsevier training platform including video, text and ‘knowledge-securing’ quizes.

    Achieved.

  3. Pilot training with users from Partner teams and gather feedback for the roadmap.

    • Roof: review with sponsors and at least 3 engineers actually take the course
    • Moon: At least one member of each new potential Partner in Q4 uses training

    Roof more or less achieved (2 business sponsors, engineers outstanding at time of writing).

  4. Prepare specific roadmap items for the future.

    • Roof: general idea of areas that it would be useful to cover.
    • Moon: actionable descriptions of specific course(s) to deliver.

    The general idea certainly exists, taking developers into the worlds of containerisation and observability. As noted by Corey Quinn in recent coverage of AWS Reinvent, enabling developers to take on these additional tasks is a big challenge. As yet, we have not found the appetite (budget) to tackle it in a structured way, so it remains the case that we point to fairly generic Udemy courses. The most promising routes at this stage seem to me to be:

    • ‘immersion days’: a part prepared, part Q&A on using a Cortex cluster.
    • the capability tests as a repository of working examples that could also become courseware if desired.

Behavioral Feedback

Put ourselves in customers’ shoes:

I think this is probably the quality I demonstrate most, and a way in which can contribute most to the team. I say this because my own background is more development or product than infrastructure so I am probably closer to our customers. Recent examples include:

Observing that it's strange to require Cortex Partners to supply a sufficient range of instance types to run their workloads. I think it would better to allow partners to specify that they have memory or compute heavy workloads, maybe specify that they have successfully used 'xlarge' nodes and have the Platform extrapolate that to, for example: m6a.xlarge,m5a.xlarge,m5.xlarge,m6i.xlarge,m5ad.xlarge,m5n.xlarge,m5d.xlarge,m5dn.xlarge and m6id.xlarge
Similarly, many partners have the need to ship application logs to their observability interface such as NewRelic. To my mind this should be done by specifying the desired tool, credentials etc and let Cortex make it happen. Advanced usage might include a way to specify some filtering criteria. Instead we delivered a fluentbit operator and then asked then how they'd like to use it! Yes, the Ops team walked them through the process and supported them to find the answers but it was a back to front way of doing it.

Humble, authentic and collaborative

I learned early in my career that discussions about technology are very difficult. Not only are problems often abstract and therefore hard to describe and discuss but there are typically many valid ways to solve them, with preferences coloured by familiarity experience. Therefore I seek always to hear what colleagues are saying before progressing to how we should proceed. I apply the ‘pigs vs chickens’ principle that the view of the person on the hook for delivery has greater weight than the opinionated bystander and seek to voice and clarify when I hear miscommunication.

Overall

Rating: Very Strong Performance

Comment: Overall, I feel the evidence presented shows me meeting and sometimes exceeding the agreed goals. I feel this is particularly successful given the broad and open problem domains that need to be explored, scoped and bounded before they may be solved.