By Adarsh Shah
I talked about What Platform Engineering(PE) is, When is it useful, and the Challenges I have seen working with it in my previous article. In this article, we will go through solutions that have worked for my teams and me in resolving those challenges. I have led platform engineering teams at various companies of different sizes (from startup to enterprise), and these challenges and solutions are based on my experience.
Let’s go through each of the challenges and what I have done to solve them.
As mentioned in the previous article, one of the DevOps movement’s critical aspects is to reduce silos between various teams. Creating another team sounds like the opposite of that, and you should be careful about not creating another silo. See below some of the practices that you should consider to avoid it.
PE team members should be focused on enabling other teams, providing the tools to get their job done. They shouldn’t be in the business of doing everything for them. This sometimes becomes challenging as it requires a cultural change in the organization, especially if you already have silos within your organization, but I would highly recommend it.
For example, provide templates and example pipelines and let the dev teams create and manage their pipelines. This way, developers still own their application pipelines but get the tools needed to accelerate development.
Another example is that the PE team should create & maintain Common Terraform Modules that can be consumed by other teams to provision their infrastructure & also own it instead of PE teams doing the provisioning for them.
In cross-team internships, we encourage members of various teams (devs, ops, infosec, etc.) to join the PE team for a short period (maybe a few weeks or a sprint) and pair with the team members to understand how the team operates and understand the internals of a Platform. After they are done, they take the knowledge back to their team and become subject matter experts for the aspects they learned about the platform during the internship and share it with their team. This is a practice that I have used at multiple places successfully.
Internships are not just one way. PE team members should also join other teams to understand how they operate and the challenges they face. This also helps in understanding how the customers use the platform that can help in improving it.
As a PE team member, you should look for opportunities to pair with devs from other teams. For example, if someone is having an issue using the platform and wants help from the PE team, this might be an opportunity for pairing with them and troubleshooting the issue together. So the next time a similar issue happens, they know how to troubleshoot instead of solving the issue for them.
Conduct constant knowledge sharing sessions like lunch and learn, brown bag, etc. especially for new significant features. This helps educate your customer on the platform features and sessions that help understand the platform’s internals if there is enough interest.
Documentation is crucial for both the PE team members and the consumers, especially keeping documentation updated as things change. Here are some of the tips for creating better documentation.
In the last couple of places, one thing that worked for me is to create an onboarding story for a new PE team member and anyone new joining a team that will consume the platform. The onboarding story usually includes making a small change to one of the applications and using the platform to deploy it to production, preferably on their first day being part of the team. Every time someone uses the onboarding story, ask them how their experience was, did they easily find all the information that they needed to deploy the application, etc. and then use that feedback to improve the onboarding story, documentation, or the platform.
Runbooks defined as per AWS Well-Architected Framework documentation:
Enable consistent and prompt responses to well-understood events by documenting procedures in runbooks.
Read more about Runbooks here: https://wa.aws.amazon.com/wat.concept.runbook.en.html
Here are some great examples: https://containersolutions.github.io/runbooks/
Runbooks help the on-call person during an incident or a different kind of event. Even if they are not an expert, they can leverage the knowledge from others documented in the runbook.
There are more chances you will keep the documentation updated if it's closer to the code. If you are using GitHub, try to keep documentation in your Github repo. You can provide references to the documentation in your wiki if you use one. This way, it is still searchable in the wiki where all your other documentation resides. As part of the Pull Request checklist, include the documentation, so the person who is reviewing the Pull Request verifies that none of the documentation updates needed due to the code change were missed.
PE team shouldn’t be a service center that has a ticket-based system where customers create tickets for things that need to be done, and the PE team does the work in a black box. The platform should be self-service, and there should be better collaboration between teams in defining/building features.
Team/Individual goals and incentives have a big part in how folks work together. If the goals and incentives don’t align, they will be incentivized to do things that don’t help the common company goals. I can’t stress enough how important this point is. Its leadership's responsibility to make sure goals and incentives are aligned for teams and individuals.
We talked about avoiding silos between teams above, but how about silos within the team. PE team usually works with a lot of different programming languages, tools, techniques. Tools and practices in this area of software engineering change continuously. Due to these reasons, I have seen silos being created within the team. One of the classic examples is that only 1 of the team members knows how a particular part of the platform works.
Here are some of the practices that I have seen work.
Pair Programming is a practice that helps not just in improving productivity but also in reducing silos by enabling knowledge sharing between various members of the team. You can use a technique like Pair Stairs to ensure everyone is getting to pair with all other team members.
Pair Stairs is a way to make sure there is enough pair rotation between the team members while doing pair programming. The vertical and horizontal axis in the below diagram are the team members. The numbers in the boxes are the no of days the two team members have been pairing. For example, as seen in diagram #1 below, Martha and Destiny have been pairing for three days continuously now, and it's probably better to rotate them with someone else. It's a good idea to keep 1 of the two people on the same story (if there is more work to be done with it) and rotate the other person. This way, there is continuity but also rotation to share knowledge about the story.
Skills Matrix exercise helps track how proficient the team members are in various languages/tools that the team is using. It can then be used to ensure the team members’ skills keep improving so the silos within the team can be reduced.
See below an example skills matrix. All the team members rate themselves on all of the languages/tools to start. Every day after the team stand-up, look at the matrix and see if it can be improved by giving folks who are beginners in a language/tool opportunity to work with it, potentially by pairing them up with someone else in the team who is an expert.
As you can see in the animation in Diagram #2, after David (2) pairing with Martha (3) multiple times and working with other Kubernetes experts on Kubernetes stories between Sprint 2 & 6, David becomes an expert on it.
Since the team is building a product, they need to support it. Support is a great way to learn more about the issue you are investigating and how to troubleshoot it. It’s a great way to share knowledge within the team.
The support pair is responsible for answering any questions the customers have and looking at any incidents. Making sure that there is rotation, so everyone gets to do support is recommended. Same as pairing on stories, pairing on support helps in sharing knowledge. Also, look for opportunities to pair with customers to troubleshoot an issue and train them to do that themselves.
As mentioned multiple times above, since the PE team uses many different languages, tools, and techniques, it is good to have a glossary of terms to define what a particular word means to the team. This is so that the team has a shared understanding of various terms to avoid confusion and improve communication.
For example, in one place, the term “environment” meant different things in different contexts. Defining that and agreeing to use the term as per the definition helped communication within the team. Every time someone misused it, one of the team members would correct them, and over time we all started using the term appropriately, which helped the team immensely.
If you deploy & operate your application in Production, incidents are bound to happen. It's essential to learn from it. As John Allspaw says in the below tweet, “Incidents are unplanned investments; their costs have already been incurred. Your org’s challenge is to get ROI on those events.”
When you do PostMortems, make sure they are blameless. Read more about blameless post mortems here.
The real value of blameless postmortems is in the dialogue during these debriefings.
- John Allspaw
I have seen this as a challenge for most PE teams since they are usually dealing with various stakeholders from various teams, so defining and prioritizing features is not straightforward. Here are some of the practices that worked for me in the past.
Product thinking is about promoting holistic thinking instead of individual features when building the platform. As per the article Product over Project thinking, product-mode allows teams to reorient quickly, reduce their end-to-end cycle time, and validate actual benefits by using short-cycle iterations while maintaining their software’s architectural integrity to preserve their long-term effectiveness. It has benefits that help in defining and prioritizing features by thinking about the long term.
Path to Production exercise (or Roadmap) is a way to define everything needed to get your application to production and operate it in production efficiently. This includes all Non-functional requirements, Disaster Recovery, Support, etc. See below the example output of an exercise. You can see the timeline on the vertical axis and the various categories on the horizontal axis. The Now, Soon, Later, Near Release, and Post Release should be defined based on your timeline. The categories will also be different for you based on what you are building. Each entry in the table is essentially an epic or a super-epic that can then be prioritized and broken down into stories. I walk through this table with various stakeholders periodically and prioritize them based on their requirements.
Like with any product, make sure your customers are at the center. Observe how they use the platform and speak to them about improvements needed. Look out for opportunities for doing this during pairing sessions, lunch and learns, etc. Once you observe and talk to them, take that feedback back to your team and improve the customer's experience.
Come up with KPIs for features and use them to look back and measure performance for the feature. For example, did your customers get a lower feature lead time after releasing a new prominent feature?
Innovation & experimentation are a big part of what this team should do. Ensure that you continuously improve on the practices, tools & techniques while building/enhancing your customer’s features.
Since this team’s features are more technical as the customers are technical teams, I have seen PE teams struggle with using Agile principles and practices.
Story writing and analysis of the stories will be more technical. If you have a Business Analyst, they need to understand the technical jargons and what the team is working on. Keep that in mind and try to come up with a way to describe what’s expected out of the story. There tend to be many more spikes(proof of concepts) with a PE team since they constantly try out new practices, tools, and techniques. Make sure you timebox those spikes as a lot of times, they end up being open-ended and never-ending. Timeboxing is a simple and effective way to manage spikes as it ensures that you don't spend too long on the spike that isn't worth the effort, and you are looking to conclude within the timeframe.
All other Agile practices are essential for Platform Engineering. I have seen PE teams either have excuses for not being Agile or struggle with it. With Agile practices, it's important to do what works for you, and because the work being more technical, you need to make sure you identify practices that work for your teams.
Decide on the Team Norms (how is the team going to communicate, standup timing, etc.), which will help everyone in the team be on the same page.
Reporting plan, progress & status of the features, dependencies, blockers, etc. are critical in keeping your customers informed and happy. Make sure you have a good process and cadence for reporting these. The goal here should be to make it easy for your customers and you to understand where you are and enable communication.
As the practices, technologies, and tools used by the PE team change frequently, the team needs to embrace rapid adoption, which can be challenging as the team members need to continually keep learning and keep their customers updated & educated on the changes. Here are some of the ways you can solve that.
Showcases are a way to show progress to your customers. They are a great way to showcase what you have been working on (even if it’s not fully ready yet), get early feedback, and start conversations with your customers, especially when things are frequently changing. Also, provide an easy way to look at the team’s Path to Production (or Roadmap) and talk about it regularly with customers, so they know what’s coming in the future.
Enable your customers to use and troubleshoot the platform’s problems using techniques like the pairing, knowledge sharing, cross-team internships, etc. mentioned above.
Make sure the features you are building are maintainable in the long term. Start small & simple but also think about maintainability. Remember, the focus is to build a self-service platform so the customers can use it themselves rather than needing the team to use the features.
Constantly breaking stuff in production doesn’t give confidence to the customers. Ensure you have good practices in place to communicate (why, how, and when) outages & new releases.
For example, Have good versioning practices for terraform modules with beta versions to try out, and stable versions that customers know are reliable.
I have also given talks on Platform Engineering that you can watch here & here. I hope this article was helpful. If you have any questions or comments you can reach out to me via twitter or email: email@example.com.