Role Overview
In this role you will be part of the team building and supporting the Internal Developer Platform (IDP) where all Apptio applications are deployed. You will interact with GitHub, Linux, Kubernetes, ArgoCD, Docker, Confluence, Jira, Slack, and AWS.
The Platform and Reliability Engineering (PRE) team at Apptio is responsible for enhancing and maintaining our IDP and driving the adoption of platform best practices across our engineering teams. The team is distributed across the United States, Poland, and Australia.
Responsibilities
* Lead the design, architecture, and delivery of major platform initiatives and self‐service capabilities that improve developer velocity at scale.
* Define and enforce platform engineering standards, patterns, and best practices across the PRE team and the broader engineering organization.
* Act as a technical mentor and coach to junior and mid‐level Platform Engineers, providing guidance through design reviews and knowledge sharing sessions.
* Drive architectural decisions for the Internal Developer Platform, including evaluating and introducing new tooling, frameworks, and cloud‐native technologies.
* Define and improve the observability strategy for platform services, including ownership of KPI dashboards, SLOs, alerting, and incident response frameworks.
* Participate in and help lead swarm collaboration sessions, bringing technical clarity and driving decisions during complex incidents.
* Collaborate closely with Apptio product developers, architects, and engineering leadership to influence the roadmap and advocate for platform‐first solutions.
* Participate in and help coordinate the on‐call rotation, contributing to improving on‐call health and reducing toil.
* Lead platform maintenance initiatives including patching strategies, upgrade planning, and capacity management.
* Contribute to team strategy by identifying gaps, proposing improvements to team processes, and helping shape the PRE roadmap.
Required Technical and Professional Expertise
* 5+ years of experience in Platform Engineering, DevOps, SRE, or an adjacent role, with demonstrated progression in scope and technical ownership.
* Strong proficiency with at least one programming or scripting language (preferably Golang).
* Deep experience designing, operating, and troubleshooting distributed application platforms at scale.
* Advanced expertise with Kubernetes, including multi‐cluster management, custom controllers, admission webhooks, networking, RBAC, and resource optimization.
* Proven experience designing and implementing Infrastructure‐as‐Code (IaC) solutions in production environments.
* Strong expertise with cloud provider services, particularly AWS (EKS, IAM, VPC, S3, RDS, and related services).
* Advanced experience with container technologies (Docker, OCI standards) and container security best practices.
* Demonstrated ability to lead technical initiatives end‐to‐end, including requirements gathering, design, implementation, and stakeholder communication.
* Experience mentoring and coaching engineers of varying experience levels.
Preferred Technical and Professional Experience
* Experience with monitoring and observability technologies (Prometheus, Grafana, Splunk, Datadog, OpenTelemetry, etc.) including defining SLOs and building alerting strategies.
* Deep knowledge of the CNCF landscape and hands‐on experience with CNCF projects such as Cilium, Karpenter, Crossplane, Cert‐Manager, and others.
* Experience with platform security practices including supply chain security, vulnerability scanning, and policy enforcement.
* Familiarity with FinOps principles and cloud cost optimization strategies at the platform level.
#J-18808-Ljbffr