Cloud Tagging Best Practices for Better Cost Allocation, Part 2

Learn cloud tagging strategies that work at scale and how to tag resources with Infrastructure-as-Code (IaC)

Table of Contents
  • Do not remove - this placeholder list is
  • Automatically populated with headings
  • On published site

In the previous blog, we discussed tagging 101, how tagging could help with cloud cost reporting,  how to promote a tagging strategy that supports cost attribution in your business and engineering context, and cultivating the culture of cloud resource ownership at scale.

This blog continues the series and discusses tagging strategies that work at scale and how to tag resources with Infrastructure-as-Code (IaC). We will add suggestions for key-value pairs (tags) that could fit in your environments and suggestions for a tags hierarchy. Use this as a reference for your tagging enhancements.

Cloud Tagging Hierarchy

Tagging cloud resources is the first step toward cost visibility and attribution. Tags are needed at several levels to ensure precise attribution.

From a business perspective, you could tag based on your organization structure (usually based on cost centers and organization hierarchies), which could look something like below:

From an infrastructure perspective, a cloud resource tagging hierarchy looks like the following:

  • Account/project level tags
  • Cloud Infrastructure resource level tags
  • Microservice tags (K8s)

Let's explore each in detail.

ACCOUNT-LEVEL TAGS

These tags can help with cost attribution at an account level (AWS) / Project level (GCP) and are the founding steps towards budgeting and forecasting, as well as building chargeback/show-back models.

Note: Having tags defined as required in the list below will help your organization’s tagging strategy thrive. We’ve seen large enterprises use 1:1 mapping for accounts and teams. This works well when there’s a landing zone structure in place; otherwise, it adds a lot of overhead. 1:1 mapping also helps build precise cost attribution and chargeback models. (Network costs and other shared costs are quite tricky to handle, and this mapping helps with that.)

  • service_owner tag: If the account is owned by a service/team and not shared across teams (Required)
  • point_of_contact tag: Email of the owner of the account or requester of the account (Required)
  • tf_managed/ cf_managed: Indicates if cloud resource is managed via Terraform(tf)/CloudFormation/other Infrastructure as Code (IaC) tool (Required — if managed by Terraform, set it to true; otherwise, false)
  • cost_center tag: This tag identifies the cost center that the resources belong to (Required)
  • executive_sponsor tag: Could represent costs and expenditures from an executive perspective. This tag can promote budget alignment for cloud spending (Required)
  • cob (cost_of_business): Indicates whether it’s direct production costs vs. R&D (non-prod)
    • Values: opex or cogs (Operational Expense or Cost of Goods Sold)
  • account_name: Name the account with naming conventions that the organization uses (Required)

Note: These tags should be enough when some accounts/projects are owned by engineering teams and not shared with others. This level of granularity won’t be sufficient for cost attribution for shared accounts/microservices. In the below sections, we’ll cover how to overcome that problem and allocate costs for shared services.

CLOUD INFRASTRUCTURE RESOURCE-LEVEL TAGS

An account in the Cloud contains more than a few teams working on it. In these scenarios with multiple teams, tagging has to be performed at a resource level to build cost attribution.

Note: From a cloud resource deployment perspective, the recommendation is to use environment (dev, qa, staging, prod, etc.) to identify resources from specific environments.

  • service_name tag: Name of the service the resource belongs to (Required)
    • Example: front-end
  • service_owner tag: Name of the Eng. team that owns the service (Required)
    • A manager or an IC responsible for the service
    • Pro tip: A team alias works best in this scenario
    • If it’s a shared service, we need to add multiple owners separated by “,”
  • shared_service tag: Takes boolean values (yes or no as a value). If the value is yes, we’ll have to add all the service owners under the service_owner tag separated by “,”
    • While this does not solve the problem of cost attribution to teams, central cloud teams will know who consumes the resource
    • Shared services cost attribution will be discussed further in the next section
  • cost_center tag: Could be used to identify resources under a business unit
  • cob (cost_of_business): Indicates whether it’s direct production costs vs. R&D (non-prod)
    • Values: opex or cogs (Operational Expense or Cost of Goods Sold)
  • managed_by tag: Team alias/IC email of the team that manages the service
  • point_of_contact tag: UserName(everything before @Organization.net in your official email and not alias) of the primary POC for that service (Conditional)
  • requestor tag: The name of the team that requests the service; might be required in cases of creating an account (Conditional)
  • env tag:  dev, test, qa, prod, etc. If infra. falls under one of these categories, adding this tag is super important (Conditional)
  • name tag: Any special names with which the team can identify a resource meaningfully. This is for service owners to decide how they can name their services to identify quickly (Conditional)
    • Example: us-west-2a-front-end-01
  • tf_managed / cf_managed tag:  Used to indicate it’s managed via Terraform (Required — if managed by Terraform, set it to true; otherwise false. In an ideal world, all of our infra. should be tf only.)

Tip: Resource Cleanup-related Tags (Save dollars with these tags + automation)

Pro tip #1: This tag helps to clean up resources that are no longer needed after a particular.

  • remove_after_date tag (Required for resources created outside of IaC (regular) process and other temp. environments): If there’s any additional infrastructure created with response to incident response or for testing purposes, this tag helps to remove cloud resources after the specified time period when no longer needed.
    • Example: remove_after_date = “12/21/2021”

Pro tip #2: This tag helps to shut down resources that are no longer needed after a particular date.

  • shut-down tag (boolean): This tag is to be used for non-prod workloads where resources can be turned off during non-business hours and weekends.
    • For instance, if this is set to true, then a lambda function or some automation script can turn off a resource with this tag at 5 pm and then bring that back at 8 am. (You can set the schedule that works best for you.)

In my previous life, we stopped QA clusters and EC2 Instances with the help of K8s labels in combination with cronjobs and AWS EC2 tags in combination with lambda function, respectively.

Security-related Tags

This is a curated list of tags that could come in handy with security teams with regards to incident response, etc. While these tags do not directly contribute towards cost allocation, they can help your security teams implement guardrails and automate processes.

  • criticality tag: This tag may be useful for security teams to let them know of the criticality of the environments and resources. This could help set up some automation for incident response based on the criticality of the vulnerability.
    • low
    • medium
    • high
    • business unit-critical
    • Mission-critical
  • dr (disaster recovery) tag: This tag may be useful for cloud infra teams to identify failover environments during dr. This could potentially help to identify the costs of running dr.
    • mission-critical
    • critical
    • essential
  • security:incident_response
    • pii (Required): This tag helps identify if env contains PII. This can help identify costs for securing envs and enable IAM guarding policies.
      • False
      • True

MICROSERVICES TAGS (K8s LABELS)

Kubernetes Cluster-level Tags

All our Kubernetes clusters must contain the following tags at the instance level.

  • cluster_name tag: Organization-* name for the cluster (Required)
    • Example: prodn1 or Organization-prodn1
  • tf_managed tag: Used to indicate it’s managed via Terraform (Required if managed by Terraform, set it to true, otherwise false. In an ideal world, it’s recommended and also a best practice to launch all the cloud infrastructure via Infrastructure-as-Code.)

Kubernetes Resource-level Tags (popularly known as Labels)

In today’s modern-day microservice architecture, container resources must be labeled to achieve precise cost attribution. K8s tagging is approached via labels. All our Kubernetes resources (pods, nodes, etc. ) must contain the following tags or labels in a Kubernetes world.

  • service_name label: Name of the service the resource belongs to (Required)
    • Example: back-end
  • point_of_contact label: Name of the primary POC for that service. A manager or an IC responsible for the service. (Required)
  • service_owner label: Name of the engineering team that owns the service. (Required)
  • shared_service tag: Takes boolean values – yes or no as a value. If the value is yes, we’ll have to add all the service owners under the service_owner tag separated by “,”. (Conditional)
  • env label: dev, test, qa, prod, etc. If infra. falls under one of these categories, you must add this tag. (Conditional)
  • name label: Any special names with which teams can identify their resources meaningfully. (Optional; this is for service owners to decide how they can name their services to identify easily.)
    • Example: region-az-service-#
  • remove_after_date: If any additional infrastructure is created in response to an incident response or for testing purposes, this tag helps remove that piece after the specified period.
    • Example: remove_after_date = “12/21/2021”

How to Tag AWS Resources via Terraform (tf)

AWS resources could be tagged by an Infrastructure-as-code (IaC) tool; either Terraform or CloudFormation works. Terraform has gained popularity due to its cloud-agnostic nature of building IaC and better state management and integrations with cloud vendors.

Terraform Example

We must create a block like the below in main.tf where we call the module from Terraform-utils.

module “account_tags” {
source  = “github.com/Organization-dev/terraform-utils/modules/tags/account-tags?ref=v0.1″account_name = “devtest”
service_owner = “Ops”
requestor          = “ops@Organization.net”
}

We do this by populating default_tags with the module which we defined in main.tf above.

provider “aws” {
region  = “us-west-2”
profile = “Organization-devtest”default_tags {
tags = module.account_tags.tags
}
}

How to Tag Your K8s Resources

Here’s how to tag/ label your K8s resources via manifest files Label section.

apiVersion: v1
kind: Pod
metadata:
name: front-end-app
labels:
env: dev
app: nginx
service_owner: team-xyz
service_name: nginx

Shared Services Cost Attribution

Shared services range from taxes, support fees, credits, and databases to microservices. This could be a shared S3 bucket, an RDS database consumed by multiple teams, a K8s microservice used to process data, an EMR cluster that processes data, etc.  It’s a pain to build chargeback models with shared services.

Cost attribution for taxes, support fees, and credits

  • These charges from the CSP are not split to accounts/projects; they’re usually charged at the payer account level with no granular attribution.
  • A fair way to chargeback (allocate these charges back to engineering teams) is to split these charges based on a team’s cloud spending proportionally.
    • Example: If Team A spends 10% of the total bill, 10 % of taxes, support fees, and credits must be allocated to Team A.

Cost attribution for shared services like RDS and K8s (microservices)

Shared services chargeback often gets tricky, and it’s not easy. Every situation can be unique; there is no one-size-fits-all solution. Here are some pointers that can help address this problem.

A meaningful, quantifiable metric is vital in building a shared-service chargeback framework. Examples of such metrics could be:

  • Time taken for the query to run: This metric can help build a chargeback model for associating costs to customers. This will help identify which customers are profitable.
  • Amount of data transferred/ processed: This metric helps chargeback costs on a system heavy on data processing.
    • A way to chargeback here is to proportionally distribute costs to users based on the amount of data processed.

For shared microservices, use systems metrics (% CPU utilization, % Memory utilization) together with # of API invocations and/or Network IOPS.

  • Use a combination of these metrics to proportionally distribute resource costs such as EC2, S3, RDS, etc.
  • You could use a formula that looks like this:
    • shared_service_cost =  cloud_resource_cost * (percent_distribution_of_metric1 * weight_of_metric1 + percent_distribution_of_metric2 * weight_of_metric2 +…)
    • cloud_resource_cost refers to the actual price of cloud resources such as EC2, S3, RDS, etc.
    • weight_of_metric needs consensus from within engineering teams and service owners
    • Use the metrics that make sense for your workloads. The proportional weight of metrics is a prerequisite to building this model.

Note: This would need additional metrics like CPU, Memory, Network IOPS, and storage to be deployed to a centralized monitoring system like Prometheus/CloudWatch, etc, to report percent_distribution_of_metrics.

AWS has a blog that helps with tenant-based cost allocation implementation here, and such a model can be applied for K8s, RDS, and S3  shared services cost attribution.

Parting Thoughts

Tagging cloud resources is essential for granular cost visibility and allocation. Tags are needed at multiple levels, including account/project-level and cloud infrastructure resource-level tags, to ensure precise cost attribution. The tags recommended here are what we’ve seen work best at scale in large enterprises.

In addition, tags can help with security and automation for incident response management and execution on cost savings opportunities. Furthermore, shared services cost allocation can be tough; we have discussed strategies for tagging shared services.

Remember, though, that to be truly useful to the business, tags should be used in the context of business hierarchies and rolled up to teams, departments, business units, divisions, etc. When your infrastructure or business organization changes, it should be relatively easy to update your tags accordingly. Manually tagging and retagging resources is a surefire way to waste engineering resources and lower morale. Your cloud cost management tool should be able to make tag management simple.

Contact us to learn more about how Yotascale simplifies cloud tagging management.