Skip to content

Tooling Standards

Purpose

This document defines approved technologies and patterns for cloud-native development. The goal is clear: build solutions that run anywhere without vendor lock-in.

Our philosophy: cloud-agnostic, cloud-native. Choose portable technologies and document patterns, not specific vendor products. If you can't move your workload to another cloud provider in a reasonable timeframe, you're doing it wrong.

Principles

Patterns over products: We document the capabilities you need, not which vendor's product to use. Need a message queue? Great - use Service Bus on Azure, SQS on AWS, or Pub/Sub on GCP. The pattern (asynchronous messaging) matters more than the brand name.

Portability first: Technologies in this guide work on any major cloud provider. If a tool only runs on one cloud, it doesn't belong here.

Managed services preferred: Let cloud providers handle operational complexity so you can focus on business logic. Self-hosting databases, message queues, or caches should be the exception, not the default. Valid exceptions include: regulatory requirements (data residency), cost-prohibitive managed pricing at scale, or specific performance requirements unavailable in managed offerings. Document exceptions in an ADR.

Standards-based: Choose technologies built on open standards and interoperable protocols. Avoid proprietary APIs when standard alternatives exist.

Core Cloud-Native Technologies

These are specific, approved technologies that work everywhere. They're portable standards, not vendor products.

Technology Purpose Version Documentation
Docker Container runtime and image format 24.0+ Docker Docs
Kubernetes Container orchestration 1.28+ Kubernetes Docs
Helm Kubernetes package manager 3.12+ Helm Docs
Envoy Service mesh proxy and load balancer 1.28+ Envoy Docs
OpenTelemetry Observability (traces, metrics, logs) 1.0+ OpenTelemetry Docs
OPA Policy enforcement and authorization 0.60+ OPA Docs

Docker is required for all services. It's the container standard - if your code doesn't run in a container, it doesn't deploy to our infrastructure. All services must be containerized and stateless regardless of architecture pattern (microservices or monolith).

Kubernetes is our orchestration platform. Whether you're running on AKS (Azure), EKS (AWS), or GKE (Google Cloud), the YAML manifests and Helm charts work the same way.

Serverless K8s

Cloud providers offer "serverless Kubernetes" services like Azure Container Apps and AWS ECS on Fargate. These are options that can be considered if the overhead of managing a Kubernetes cluster isn't something the team can take on at the time. If this is something you want to consider, let's talk about it.

Envoy handles service-to-service communication. Use it as a sidecar proxy for traffic management, load balancing, and observability. It's the data plane for Istio, Consul Connect, and other service meshes.

OpenTelemetry is the observability standard. It captures traces, metrics, and logs with vendor-neutral APIs and exports them to any backend (Jaeger, Prometheus, our APM of choice, etc.). See our observability guides for implementation details.

OPA enforces policy decisions. Use it for authorization logic, admission control in Kubernetes, and any decision-making that should be separate from application code.

Managed Services: Pattern-Based

These are capabilities you should use via managed cloud services. Pick what fits your cloud provider, but the pattern is required.

Databases

Pattern When to Use Examples (not prescriptive)
Managed relational database Structured data, ACID transactions, SQL queries Azure SQL Database, Amazon RDS (PostgreSQL/MySQL), Google Cloud SQL
Managed document database Semi-structured data, flexible schemas, JSON documents Azure Cosmos DB, Amazon DocumentDB, Google Firestore
Managed time-series database Metrics, IoT data, high-write workloads Azure Data Explorer, Amazon Timestream, Google Cloud Bigtable

Key requirements:

  • Automated backups and point-in-time recovery
  • High availability across availability zones
  • Encryption at rest and in transit
  • Connection pooling support
  • Read replicas for scaling read-heavy workloads

Do not self-host databases unless you have a specific technical requirement that managed services can't satisfy. Managing database infrastructure is undifferentiated heavy lifting.

Messaging

Pattern When to Use Examples (not prescriptive)
Message queue Task distribution, command processing, async workflows Azure Service Bus, Amazon SQS, Google Cloud Tasks
Event streaming Event sourcing, real-time analytics, log aggregation Azure Event Hubs, Amazon Kinesis, Google Cloud Pub/Sub
Pub/Sub messaging Event broadcasting, fan-out patterns Azure Service Bus Topics, Amazon SNS, Google Cloud Pub/Sub

Key requirements:

  • At-least-once delivery guarantees
  • Dead-letter queue support
  • Message retention for replay
  • Ordering guarantees (when needed)
  • Idempotent message handling in consumers

See Scale and High Availability for patterns on handling message queue failures and backpressure.

Secrets Management

Pattern When to Use Examples (not prescriptive)
Managed key vault API keys, connection strings, certificates, encryption keys Azure Key Vault, AWS Secrets Manager, Google Secret Manager

Key requirements:

  • Automatic secret rotation support
  • Access auditing and logging
  • Integration with Kubernetes (CSI driver or external secrets operator)
  • Role-based access control (RBAC)
  • Encryption key management (customer-managed keys)

Never commit secrets to source control. Never store secrets in environment variables unless they're injected at runtime from a vault. See Flexible Application Configuration and Security in Development for implementation guidance.

Caching

Pattern When to Use Examples (not prescriptive)
Managed Redis Session storage, distributed caching, rate limiting Azure Cache for Redis, Amazon ElastiCache (Redis), Google Cloud Memorystore

Key requirements:

  • High availability (replication across zones)
  • Persistence options for durability
  • TLS encryption for data in transit
  • Eviction policies appropriate for use case
  • Connection pooling in client libraries

See Scale and High Availability for caching strategies and cache invalidation patterns.

Development and Collaboration

These are capabilities every team needs. Choose tools that fit your organization, but make sure they provide these patterns.

Version Control

Capability Required Why Examples
Git-based repository Source control, branching, code review GitHub, Bitbucket, Azure DevOps Repos, GitLab

Key requirements:

  • Pull request/merge request workflow
  • Code review tools and approvals
  • Protected branches (main/master)
  • Webhook support for CI/CD integration
  • Branch policies (require reviews, status checks)

See Managing Our Source for branching strategies and workflow.

Documentation

Capability Required Why Examples
Wiki or documentation platform Architectural decisions, runbooks, team knowledge Confluence, Notion, GitBook, GitHub Wiki, Azure DevOps Wiki

Key requirements:

  • Search functionality
  • Version history
  • Access control (public vs. team vs. private)
  • Markdown support
  • Link validation

See Importance of Documentation for documentation standards.

Issue Tracking

Capability Required Why Examples
Issue and project tracking Feature planning, bug tracking, sprint management Jira, GitHub Issues, Azure Boards, Linear, GitLab Issues

Key requirements:

  • Backlog management
  • Sprint/iteration planning
  • Integration with VCS (link commits to issues)
  • Custom workflows (if needed)
  • Labels/tags for categorization

CI/CD Pipeline

Capability Required Why Examples
Pipeline automation Build, test, deploy automation GitHub Actions, Azure Pipelines, GitLab CI/CD, CircleCI

Key requirements:

  • Triggered by source control events (push, PR)
  • Matrix builds (test multiple versions/platforms)
  • Artifact storage and versioning
  • Deployment approvals and gates
  • Secret injection (vault integration)

See Builds and Deployments for pipeline patterns and Versioning Our Solutions for artifact versioning.

Metrics and Monitoring

Capability Required Why Examples
Prometheus-compatible metrics Application and infrastructure monitoring Prometheus, our APM of choice, Grafana Cloud, Azure Monitor, AWS CloudWatch (with exporters)

Key requirements:

  • PromQL query language support (or equivalent)
  • Alerting rules and notification routing
  • Dashboard creation and templating
  • Long-term metrics storage
  • Integration with OpenTelemetry

See Observability: Metrics for instrumentation patterns and Observability: Distributed Tracing for OpenTelemetry implementation.

Local Development Tools

Developers need minimal tooling to build and test locally. These tools should be installed on every development machine.

Required tools:

  • Docker Desktop or Podman: Container runtime for local development and testing
  • kubectl: Kubernetes CLI for interacting with clusters
  • Helm: Kubernetes package manager for deploying charts
  • Git: Version control client
  • Language-specific SDK: Go 1.21+, .NET 8+, Node.js 20+, Python 3.11+ (as needed)

Recommended tools:

  • k9s or Lens: Kubernetes cluster UI/TUI for easier debugging
  • jq: JSON processing for working with API responses and logs
  • curl or httpie: HTTP client for API testing
  • yq: YAML processing for working with Kubernetes manifests

Local environment:

Use Docker Compose for running dependencies locally (databases, message queues, caches). Every repository should include a docker-compose.yml file that spins up required services with sensible defaults.

Example local stack:

# docker-compose.yml
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: localdev
      POSTGRES_USER: myapp
      POSTGRES_DB: myapp
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  azurite: # Azure Storage emulator
    image: mcr.microsoft.com/azure-storage/azurite
    ports:
      - "10000:10000" # Blob
      - "10001:10001" # Queue

Developers should be able to run docker compose up and have a working local environment. Configuration should use environment variables or local config files (never committed) that point to these local services.

Tool Governance

Technology choices aren't permanent. Tools move through lifecycle states based on adoption, support, and business needs.

Lifecycle States

Evaluating: We're testing this technology on non-critical projects to see if it fits our needs.

  • No production usage yet
  • Spike projects and proof-of-concepts only
  • Document findings and decision criteria
  • Time-boxed evaluation period (typically 3-6 months)

Approved: This technology is vetted and ready for production use.

  • Documented in this guide
  • Support and training available
  • Used in production by at least one team
  • Maintained by vendor/community with regular updates

Deprecated: We're phasing this technology out, but existing usage is supported.

  • No new projects should use this
  • Existing projects should plan migration
  • Security patches and critical fixes only
  • End-of-life date communicated

Retired: This technology is no longer supported.

  • No exceptions for new usage
  • Existing usage must be migrated
  • Documentation archived
  • Lessons learned captured

Proposing New Tools

Have a technology that should be in this guide? Here's how to propose it:

  1. Document the need: What problem does this solve that current tools don't?
  2. Evaluate portability: Does it work on all major cloud providers? Is it based on open standards?
  3. Assess operational impact: What's the learning curve? What new skills are needed? What's the operational overhead?
  4. Pilot it: Use it in a non-critical project and document the experience.
  5. Present findings: Share results with architecture review board (or equivalent) including pros, cons, and migration costs.
  6. Decide together: Tool adoption affects everyone - make it a team decision, not an individual one.

Annual Review

Every year, we review this guide to ensure it stays relevant. Questions we ask:

  • Are version requirements still current?
  • Have better alternatives emerged?
  • Are deprecated tools ready for retirement?
  • Should any approved tools move to deprecated status?
  • What new technologies should we evaluate?

Technology moves fast - our standards should evolve thoughtfully, not chaotically.

Cross-References

Implementation details for these tools and patterns are documented elsewhere:

Summary

Good tooling standards balance consistency with flexibility. Be specific about portable standards (Docker, Kubernetes, OpenTelemetry) and pattern-based about cloud services (managed databases, message queues).

The goal isn't standardization for its own sake - it's enabling teams to build reliable, portable, cloud-native solutions without reinventing the wheel every time.

Choose patterns over products. Build for portability. Let managed services handle undifferentiated heavy lifting. Focus on business value, not infrastructure complexity.


← Previous: Scale & High Availability | ↑ Back to Home