Tooling Standards¶

Purpose¶

This document defines approved technologies and patterns for cloud-native development. The goal is clear: build solutions that run anywhere without vendor lock-in.

Our philosophy: cloud-agnostic, cloud-native. Choose portable technologies and document patterns, not specific vendor products. If you can't move your workload to another cloud provider in a reasonable timeframe, you're doing it wrong.

Principles¶

Patterns over products: We document the capabilities you need, not which vendor's product to use. Need a message queue? Great - use Service Bus on Azure, SQS on AWS, or Pub/Sub on GCP. The pattern (asynchronous messaging) matters more than the brand name.

Portability first: Technologies in this guide work on any major cloud provider. If a tool only runs on one cloud, it doesn't belong here.

Managed services preferred: Let cloud providers handle operational complexity so you can focus on business logic. Self-hosting databases, message queues, or caches should be the exception, not the default. Valid exceptions include: regulatory requirements (data residency), cost-prohibitive managed pricing at scale, or specific performance requirements unavailable in managed offerings. Document exceptions in an ADR.

Standards-based: Choose technologies built on open standards and interoperable protocols. Avoid proprietary APIs when standard alternatives exist.

Core Cloud-Native Technologies¶

These are specific, approved technologies that work everywhere. They're portable standards, not vendor products.

Technology	Purpose	Version	Documentation
Docker	Container runtime and image format	24.0+	Docker Docs
Kubernetes	Container orchestration	1.28+	Kubernetes Docs
Helm	Kubernetes package manager	3.12+	Helm Docs
Envoy	Service mesh proxy and load balancer	1.28+	Envoy Docs
OpenTelemetry	Observability (traces, metrics, logs)	1.0+	OpenTelemetry Docs
OPA	Policy enforcement and authorization	0.60+	OPA Docs

Docker is required for all services. It's the container standard - if your code doesn't run in a container, it doesn't deploy to our infrastructure. All services must be containerized and stateless regardless of architecture pattern (microservices or monolith).

Kubernetes is our orchestration platform. Whether you're running on AKS (Azure), EKS (AWS), or GKE (Google Cloud), the YAML manifests and Helm charts work the same way.

Serverless K8s

Cloud providers offer "serverless Kubernetes" services like Azure Container Apps and AWS ECS on Fargate. These are options that can be considered if the overhead of managing a Kubernetes cluster isn't something the team can take on at the time. If this is something you want to consider, let's talk about it.

Envoy handles service-to-service communication. Use it as a sidecar proxy for traffic management, load balancing, and observability. It's the data plane for Istio, Consul Connect, and other service meshes.

OpenTelemetry is the observability standard. It captures traces, metrics, and logs with vendor-neutral APIs and exports them to any backend (Jaeger, Prometheus, our APM of choice, etc.). See our observability guides for implementation details.

OPA enforces policy decisions. Use it for authorization logic, admission control in Kubernetes, and any decision-making that should be separate from application code.

Managed Services: Pattern-Based¶

These are capabilities you should use via managed cloud services. Pick what fits your cloud provider, but the pattern is required.

Databases¶

Pattern	When to Use	Examples (not prescriptive)
Managed relational database	Structured data, ACID transactions, SQL queries	Azure SQL Database, Amazon RDS (PostgreSQL/MySQL), Google Cloud SQL
Managed document database	Semi-structured data, flexible schemas, JSON documents	Azure Cosmos DB, Amazon DocumentDB, Google Firestore
Managed time-series database	Metrics, IoT data, high-write workloads	Azure Data Explorer, Amazon Timestream, Google Cloud Bigtable

Key requirements:

Automated backups and point-in-time recovery
High availability across availability zones
Encryption at rest and in transit
Connection pooling support
Read replicas for scaling read-heavy workloads

Do not self-host databases unless you have a specific technical requirement that managed services can't satisfy. Managing database infrastructure is undifferentiated heavy lifting.

Messaging¶

Pattern	When to Use	Examples (not prescriptive)
Message queue	Task distribution, command processing, async workflows	Azure Service Bus, Amazon SQS, Google Cloud Tasks
Event streaming	Event sourcing, real-time analytics, log aggregation	Azure Event Hubs, Amazon Kinesis, Google Cloud Pub/Sub
Pub/Sub messaging	Event broadcasting, fan-out patterns	Azure Service Bus Topics, Amazon SNS, Google Cloud Pub/Sub

Key requirements:

At-least-once delivery guarantees
Dead-letter queue support
Message retention for replay
Ordering guarantees (when needed)
Idempotent message handling in consumers

See Scale and High Availability for patterns on handling message queue failures and backpressure.

Secrets Management¶

Pattern	When to Use	Examples (not prescriptive)
Managed key vault	API keys, connection strings, certificates, encryption keys	Azure Key Vault, AWS Secrets Manager, Google Secret Manager

Key requirements:

Automatic secret rotation support
Access auditing and logging
Integration with Kubernetes (CSI driver or external secrets operator)
Role-based access control (RBAC)
Encryption key management (customer-managed keys)

Never commit secrets to source control. Never store secrets in environment variables unless they're injected at runtime from a vault. See Flexible Application Configuration and Security in Development for implementation guidance.

Caching¶

Pattern	When to Use	Examples (not prescriptive)
Managed Redis	Session storage, distributed caching, rate limiting	Azure Cache for Redis, Amazon ElastiCache (Redis), Google Cloud Memorystore

Key requirements:

High availability (replication across zones)
Persistence options for durability
TLS encryption for data in transit
Eviction policies appropriate for use case
Connection pooling in client libraries

See Scale and High Availability for caching strategies and cache invalidation patterns.

Development and Collaboration¶

These are capabilities every team needs. Choose tools that fit your organization, but make sure they provide these patterns.

Version Control¶

Capability Required	Why	Examples
Git-based repository	Source control, branching, code review	GitHub, Bitbucket, Azure DevOps Repos, GitLab

Key requirements:

Pull request/merge request workflow
Code review tools and approvals
Protected branches (main/master)
Webhook support for CI/CD integration
Branch policies (require reviews, status checks)

See Managing Our Source for branching strategies and workflow.

Documentation¶

Capability Required	Why	Examples
Wiki or documentation platform	Architectural decisions, runbooks, team knowledge	Confluence, Notion, GitBook, GitHub Wiki, Azure DevOps Wiki

Key requirements:

Search functionality
Version history
Access control (public vs. team vs. private)
Markdown support
Link validation

See Importance of Documentation for documentation standards.

Issue Tracking¶

Capability Required	Why	Examples
Issue and project tracking	Feature planning, bug tracking, sprint management	Jira, GitHub Issues, Azure Boards, Linear, GitLab Issues

Key requirements:

Backlog management
Sprint/iteration planning
Integration with VCS (link commits to issues)
Custom workflows (if needed)
Labels/tags for categorization

CI/CD Pipeline¶

Capability Required	Why	Examples
Pipeline automation	Build, test, deploy automation	GitHub Actions, Azure Pipelines, GitLab CI/CD, CircleCI

Key requirements:

Triggered by source control events (push, PR)
Matrix builds (test multiple versions/platforms)
Artifact storage and versioning
Deployment approvals and gates
Secret injection (vault integration)

See Builds and Deployments for pipeline patterns and Versioning Our Solutions for artifact versioning.

Metrics and Monitoring¶

Capability Required	Why	Examples
Prometheus-compatible metrics	Application and infrastructure monitoring	Prometheus, our APM of choice, Grafana Cloud, Azure Monitor, AWS CloudWatch (with exporters)

Key requirements:

PromQL query language support (or equivalent)
Alerting rules and notification routing
Dashboard creation and templating
Long-term metrics storage
Integration with OpenTelemetry

See Observability: Metrics for instrumentation patterns and Observability: Distributed Tracing for OpenTelemetry implementation.

Local Development Tools¶

Developers need minimal tooling to build and test locally. These tools should be installed on every development machine.

Required tools:

Docker Desktop or Podman: Container runtime for local development and testing
kubectl: Kubernetes CLI for interacting with clusters
Helm: Kubernetes package manager for deploying charts
Git: Version control client
Language-specific SDK: Go 1.21+, .NET 8+, Node.js 20+, Python 3.11+ (as needed)

Recommended tools:

k9s or Lens: Kubernetes cluster UI/TUI for easier debugging
jq: JSON processing for working with API responses and logs
curl or httpie: HTTP client for API testing
yq: YAML processing for working with Kubernetes manifests

Local environment:

Use Docker Compose for running dependencies locally (databases, message queues, caches). Every repository should include a docker-compose.yml file that spins up required services with sensible defaults.

Example local stack:

# docker-compose.yml
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: localdev
      POSTGRES_USER: myapp
      POSTGRES_DB: myapp
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  azurite: # Azure Storage emulator
    image: mcr.microsoft.com/azure-storage/azurite
    ports:
      - "10000:10000" # Blob
      - "10001:10001" # Queue

Developers should be able to run docker compose up and have a working local environment. Configuration should use environment variables or local config files (never committed) that point to these local services.

Tool Governance¶

Technology choices aren't permanent. Tools move through lifecycle states based on adoption, support, and business needs.

Lifecycle States¶

Evaluating: We're testing this technology on non-critical projects to see if it fits our needs.

No production usage yet
Spike projects and proof-of-concepts only
Document findings and decision criteria
Time-boxed evaluation period (typically 3-6 months)

Approved: This technology is vetted and ready for production use.

Documented in this guide
Support and training available
Used in production by at least one team
Maintained by vendor/community with regular updates

Deprecated: We're phasing this technology out, but existing usage is supported.

No new projects should use this
Existing projects should plan migration
Security patches and critical fixes only
End-of-life date communicated

Retired: This technology is no longer supported.

No exceptions for new usage
Existing usage must be migrated
Documentation archived
Lessons learned captured

Proposing New Tools¶

Have a technology that should be in this guide? Here's how to propose it:

Document the need: What problem does this solve that current tools don't?
Evaluate portability: Does it work on all major cloud providers? Is it based on open standards?
Assess operational impact: What's the learning curve? What new skills are needed? What's the operational overhead?
Pilot it: Use it in a non-critical project and document the experience.
Present findings: Share results with architecture review board (or equivalent) including pros, cons, and migration costs.
Decide together: Tool adoption affects everyone - make it a team decision, not an individual one.

Annual Review¶

Every year, we review this guide to ensure it stays relevant. Questions we ask:

Are version requirements still current?
Have better alternatives emerged?
Are deprecated tools ready for retirement?
Should any approved tools move to deprecated status?
What new technologies should we evaluate?

Technology moves fast - our standards should evolve thoughtfully, not chaotically.

Cross-References¶

Implementation details for these tools and patterns are documented elsewhere:

Observability: Metrics - Prometheus instrumentation and RED/USE methods
Observability: Distributed Tracing - OpenTelemetry implementation and sampling
Observability: Logging - Structured logging and aggregation patterns
Scale and High Availability - Caching strategies and message queue patterns
Flexible Application Configuration - Configuration management and secrets handling
Security in Development - Security tooling and policy enforcement with OPA
Builds and Deployments - CI/CD pipeline patterns
Versioning Our Solutions - Artifact versioning and release management
Managing Our Source - Git workflows and branching strategies
Deliver Solutions That Work - Testing strategies and quality assurance

Summary¶

Good tooling standards balance consistency with flexibility. Be specific about portable standards (Docker, Kubernetes, OpenTelemetry) and pattern-based about cloud services (managed databases, message queues).

The goal isn't standardization for its own sake - it's enabling teams to build reliable, portable, cloud-native solutions without reinventing the wheel every time.

Choose patterns over products. Build for portability. Let managed services handle undifferentiated heavy lifting. Focus on business value, not infrastructure complexity.

← Previous: Scale & High Availability | ↑ Back to Home