Thoughts on service provider rooted identities

When I started in datacenters, we had places where we used network addressing and layer 3/layer 4 firewalls as the primary way to manage service identity for internal services where we needed it. Even at that time, there were some annoying attack scenarios you had to deal with in multitenant environments and compromise scenarios attacking your layer 2 and 3 infrastructure (MAC spoofing, ARP poisoning, etc.) that meant even then it wasn’t a totally reliable source for identity information. Luckily, for those reasons and other trends, people have largely moved away from source IP as a source of workload identity.

Network Service

With the advent of high automated infrastructure systems and the birth of “cloud” infrastructure we saw other approaches to providing identity to compute workloads emerge. The most well known this is probably the AWS EC2 metadata service. Fundamentally this method boils down to using a link-local address to allow the workload to access an API to get information about itself over the network. This include operational characteristics and user data used to power cloud-config. Cloud-config powers a lot of the run time first boot features like executing arbitrary scripts, or configuring the system based on data provided by the metadata service.

However, if configured with an IAM role and instance profile, this service also provides a set of AWS IAM credentials that can be used to call permitted AWS services (based on the access configuration of that role) that are effectively accessible to any process that can source network traffic from the system by default. Another endpoint this metadata service provides is an identity document. This is a cryptographically signed document containing configuration information about the instance. Because this identity document is made available only to the workload itself this link local address people also use possession of this identity document as proof of workload identity.

This approach has some short comings, many of which these service providers have worked on strategies to mitigate. AWS on its documentation page for the metadata service (at the time of writing) calls out one of the largest ones. Any workload that can communicate over the network to that link local address can see all that data and those credentials. Outside of obvious attacks resulting in system compromise, there are a surprising number of attack vectors that allow you to make a vulnerable workload or systems perform a web request on behalf of the attacker – and with this network metadata approach all of them potentially expose your service identity/credential data.

It is easy to build broad API support for this credential discovery approach. Both the service provider themselves can do this in their libraries and SDKs as AWS does and others can do this using presigned URLs with instance role credentials or validating the instance identity document.

Application Environment

Another approach we see is folks providing credentials to applications, particularly beyond the IaaS level, is injecting data in the processes running environment. This broadly comes in two flavors, actual process environment variables or a file/filesystem provided within the filesystem presented process. I feel like the dangers of storing credentials and secrets in your processes environment variables has been documented extensively, but in summary both infrastructure systems and often times applications themselves (or their dependent libraries or frameworks) don’t protect the process environment as if they contain secret data in many cases. It is still common to find logs that freely log environment variables or send them as part of exception reports or debug error messages/screens. It is more rare to find applications that allow attackers to read arbitrary files from the file system, but they do exist – generally by honest developer mistake or misconfiguration rather than by design as many of the environment variable exposures appear to be.

These patterns are commonly used with Kubernetes secrets or config maps, and container PaaS/function as a service platform like Lambda. Particularly with the filesystem approach this feels like the most prudent path at this point as a service provider to me.

It is noteworthy that these Application Environment methods require more awareness and integration between the scheduler and the workload. In high level cases like function or container as a service, this probably feels natural. However, it gets more complex if you want to support many operating systems and depends heavily on the features available in your scheduler/hypervisor layer.

Why bother?

Very few workloads we deploy today are an island. Most communicate with service providers we leverage consuming PaaS services or other services we write and deploy separately in a modern microservice environment. Traditionally we dealt with long lived credentials that we stored treated as secrets used in the production configuration management or deployment processes.

There are other solutions to machine identity out here. Enterprise organizations have been dealing with Active Directory and their related machine account dynamics with Windows for decades. Even earlier than that systems like Kerberos had solutions to this problem. Most of the solutions that solve this do not fare well when applied to the Service Provider use case. I do not know any service providers today that operate with multiple customers sharing an Active Directory environment for instance. I think many customers would be uncomfortable with this scenario if they did.

As technology and security practices have evolved, the benefits of moving to short lived credentials have been wildly discussed. It caps the duration of risk related to someone getting access to credentials that have been used in your production environment. This includes attack vectors like someone compromising a production backup, snapshot, or set of log files that inadvertently included credentials for your environment.

Many of the advancements people think of as cloud are highly automated infrastructure and modern deployment practices. Giving customers a secure root of identity solution can encourage them to adopt other useful infrastructure automation patterns, like treating servers as short lived and replaceable rather than long lived and maintained. This sort of identity can be foundational to first boot scripts interacting with other services from your provider or reaching out to your centralized identity provider to get other credentials it needs to access other systems. Getting people to think about their infrastructure differently, rather than any technical capability is the most impactful improvement most organizations make during a cloud transformation.

Infrastructure for side projects

Today’s cloud landscape has no shortage of options when it comes to deploying HTTP-based applications. For small ideas and side projects I still have a hard time justifying running an entire container cluster, so I find myself reaching for options like Google Cloud Run or AWS Fargate because it provides me with a  clear path towards something like Knative if my small idea starts to grow into something more substantial. At some point if you have enough of these small workloads running it will start to create cost drivers toward building a cluster multiple workloads could share.

The minimum footprint for a side project is a small amount of CI/CD configuration in Gitlab, GitHub, Google Cloud Build or AWS CodeBuild, a deployment target (or several if you run a proper multi stage deployment process including dev, staging, and production). This makes it quick and inexpensive to get things up and running quickly so you can iterate on your idea. 

My toolchain of choice continues to change often because the technology landscape around PaaS and Serverless deployment options is evolving rapidly. AWS Lambda, Google Cloud Functions and similar offerings do not feel like the right fit and the tradeoff between ease of use and vendor gravity doesn’t seem worth it to me at this moment. I have built many solutions using these services and for many use cases they are a better fit than something container based, but I have committed technology sins by using serverless functions in places I should not have. That scar tissue is still fresh.