Securing Your AI Agent Infrastructure
Governance, security, and cost control matter with AI more than ever
This is part of my series of blog posts on creating an AWS Bootstrap Script to set up secure AI agent infrastructure.
I explained my journey down the path of using AI to code a bootstrap script to secure an AWS organization in a prior post called What I’ve Vibe Coded in 2.5 Weeks. But that post was more about the value of AI in creating the script compared to the non-AI attempt than the actual contents of the script itself and the architecture of what I am deploying. I was able to do almost as much or more in 2.5 weeks than I did in four years in a prior attempt.
In this and the related posts I want to explain what is in the script and why. I started this journey to securely deploy batch jobs (now AI agents) because I wanted to quickly tell someone how to write a batch job and provide some code to do that. As I have explained before an AI agent is like a batch job but uses AI somehow inside of it to carry out its task.
Giving someone code to run a batch job or AI agent is simple. I’ve written some blog posts on agents already. Writing some code to do something was never that hard, even before AI. The problem is that if I just gave someone code to run a batch job where would they run it? I was promoting deterministic batch jobs for security use cases but if people did not deploy them in a secure manner (like OpenClaw, poorly written AI vibe-coded software, on a Mac Mini on an unsecured network with access to way too much data) they might get hacked and blame me because I provided the code and the idea.
So I set out to show how to create a secure environment to run the batch jobs with proper networking, encryption, and IAM controls. But then it became a question of how do I deploy those things? And are the tools I use to deploy those things secure? And where do I start if I have a brand new AWS Organization? And what if I’m in a large company where different people are responsible for deployment of different resources like a network team and an encryption team? And how do I set up a production and development environment so I’m not testing and changing code in production directly (secure software development 101)?
If you are a new vibe coder those are all the things that organizations do and in some cases must do to be compliant with regulations. There are reasons for all of that to keep software secure. So what this script does that I have been writing is set up an AWS account with a script that automatically deploys a lot of security controls. It basically sets up my whole environment and AWS account to handle things a large company will need and a person trying to secure a smaller company should be doing as well.
What I realized when I tried to share my other posts with people new to security and AWS or even someone experienced in AWS was that they were a bit hard to follow. They are kind of a stream of consciousness where I was hacking on and researching how to deploy an AWS organization and various security controls. It was complicated and I hit a lot of bugs and cryptic error messages along the way and wrote about those. I also tested out some new ideas behind the scenes I never wrote about.
I’m going to start from the top in this series and explain how I’m deploying an organization these days with some new insights and ideas in a more step by step manner that will hopefully be easier to follow. I’m not going to go into the nitty gritty details so you’ll need to refer to my prior posts for some of that. It will incorporate a few new ideas.
I will start by explaining why. Why do I need to deploy all these things instead of just firing up an EC2 instance and running my agents on it or using a Lambda function without a VPC? Well for one thing all those vulnerabilities found by AI agents won’t be accessible if they are behind network controls or have other controls like encryption and MFA in place. A layered approach to security, otherwise known as Defense in Depth will help you more than a single security control alone.
Risk, Security, and Governance
The goal of this method of architecting an organization is to build in security and governance to reduce risk from the ground up. For starters, you have to understand those terms and why they matter.
Risk: any vulnerability, misconfiguration, or architectural flaw that exposes your cloud environment to security breaches, compliance violations (i.e. fines and fees for not following industry or government regulations), or unexpected financial costs.
Security: The controls, rules, and policies that prevent malicious access, unintended harm, or a data breach in your AWS account (or in other words the goal of security is to reduce risk).
Governance: Processes used to ensure adherence to your security and internal organization rules and policies as well as those mandated by external regulations that could lead to fines and fees if you don’t follow them (to reduce risk).
The whole point of all of this is to reduce risk. Why do we care about risk? Because organizations exist to carry out a mission whether that’s a business mission or a non-profit or a government agency or branch of the military. The risk is what prevents achieving objectives and a successful mission. Risk can derail the mission entirely or severely degrade the outcome. It can also mean you miss out on other opportunities while dealing with the aftermath of a data breach or unexpected financial costs.
I’ll let you do your own research about the data breaches occurring in your industry, compliance regulations and fees, and businesses that closed down due to unexpected costs. Search on data breaches or data breaches in [industry] in Google and click the news link for starters. Also search on compliance action or compliance settlement. Search for AI vibe coded vulnerabilities and security problems caused by MCP servers. It’s all out there if you just take like two minutes to research it. Most people are more excited about what they built and how it works than the consequences of what can go wrong if attackers can get into that software.
If you’re a vibe coder with an idea and you’re coding it up and throwing it out there, consider what will happen if the data is accessed or stolen. Would you be liable? Can someone sue you? Are you losing some secret sauce that explains how you do your work? Are you giving away your IP to an AI model? Are attackers able to get onto your machine and steal whatever is on it like access to your email to reset your banking password and so on? Can they delete your emails, send emails on your behalf to trick other people, or steal cryptocurrency insecurely stored on your system?
You may think it doesn’t matter today because you are making money but if you keep doing that at some point you might not be if you continue down a bad path. Small companies especially are often put out of business by a data breach. While the larger companies can withstand it the costs are astronomical in some cases and could be put to better use doing other things. In addition, some systems have been breached in ways that threaten our national security.
Mission-oriented architecture
There are various ways to deploy guardrails and rules to protect your organization. Many people have attempted to implement heavy-handed guardrails and restrictions that prevent organizations from getting things done in a timely manner. These attempts to control an organization too tightly generally result in total failure. Eventually your controls are completely kicked to the curb by people who need to ensure the organization can meet its objective.
If your guardrails are so stringent or unmanageable that your organization cannot function you have introduced another type of risk that is blocking your mission.
That said, if your controls are not stringent enough, they don’t really work. Period.
The balance of those two statements is what makes this problem especially challenging. We need to let agents go off and work autonomously but they have shown over and over to not be trustworthy. They will do things you don’t expect and that’s just the nature of the technology.
We need extra security in this age of AI with rogue agents doing all kinds of crazy things. Agents really require some detailed sandboxing controls but that starts with your cloud infrastructure - properly segregating non-production and production for one thing or what I like to call environments.
If you skip logging you may not be able to figure out what happened when something goes wrong. After a security breach that can cost extra money due to not being able to define the scope of a breach. If you don’t secure your logs properly an attacker or rogue agent might delete them. That’s worse than not having them in the first place. You paid for them and you can’t tell what happened.
The other problem that makes a solution for this challenging is that every organization is different. Organizations are deploying different applications, managing projects, and come in all shapes and sizes. Some organizations are small startups or budget-conscious non-profits. Others are medium sized businesses with a number of different departments. Large companies may have multiple lines of business. In addition, companies have varying degrees of centralization or distributed ownership of security and cost controls.
The attempt to support all these organizations, applications, and business strategies may lead to an overly complex solution in order to support every variation that may exist. The introduction of complexity that weakens security. We want to keep things as simple as possible - but not too simple. If things are too simple, security holes exist. I’ve written about those phrases many times before including in my book Cybersecurity For Executives in the Age of Cloud.
Unfortunately organizations may still throw caution to the wind with well thought out controls when executives and developers don’t understand why they exist. It takes top down leadership and direction to do more than say “we care about security.” Executives need to have some concept of what that entails.
Otherwise it’s all talk and people at the bottom who don’t buy in will just do whatever they want. Trust me, I know. I’ve seen it work both ways at the companies where I’ve worked. And I’ve seen developers with good intentions do risky, insecure things when they didn’t even understand what they were doing and how it might enable a data breach. This is especially true as people start vibe coding all kinds of software.
No matter how hard you try to train everyone - including your AI agents - prompting doesn’t stop the unwanted actions. Controls do. But the controls need to be appropriately architected so people can do what they are supposed to do and need to do without being allowed to do the egregious and unnecessary things that put your organization at risk.
Fundamentals
In my book I explain a lot of the fundamentals that go into defining an organizational architecture that prevents data breaches. It’s like a security 101 class that defines the different aspects of security and how they apply to the cloud - but really they are the same principles that apply to any environment including one running AI agents. I also cover some of this in my class materials in the Learn Security portion of this blog.
Before you can build a proper control you need to fully understand the threats. My book explains the security fundamentals to consider as you define organizational governance and security controls and why they matter based on real world threats that exist and are affecting organizations every day.
Although AI seems to be changing everything right now there are somethings that don’t change - security, software engineering, and cost management principles. These same principles will be missing and improperly implemented if you’re using AI alone to implement your AWS infrastructure.
Application Versus Infrastructure Security
There are different layers and levels of security. My organizational bootstrap script is focused on security of the cloud infrastructure. The security of an individual application or framework is more dependent on the application specific code and configuration. That’s a whole other complex problem to solve.
The main thing missing from my book in the age of AI is prompt injection, but really that’s just another of the many forms of injection anywhere untrusted input can enter your application. If you’re not familiar with that concept I wrote about it here in this post on Cybersecurity For My Mom. Bad things will get injected into systems anywhere and any way an attacker can inject them to get the system to do their bidding.
You can also learn a lot more about application security in these posts on Secure Code By Design. They cover a lot of security basics like data validation, error handling, data types and other important security concepts you need to think about when writing code - and your agents are definitely not implementing al these things properly. I know because I check the code and even if I explicitly give my agent rules to do these things they don’t always do them.
There are complexities that come into play when trying to maintain the costs on an application by application basis. In my scenario with this organizational structure I’m creating, you can create a separate organizational unit or separate account for each cost center within your AWS Organization. In general aligning your invoices with these boundaries will simplify your life in terms of being able to code a single invoice to a single cost center. If you must split into more detail your other option at this point is tagging, but right now tags do not exist for every resource.
If you are not familiar with AWS Organizations check out the documentation as I’m going to presume you know something about this and not re-explain it all. Though you may get the gist of it by reading my posts.
https://docs.aws.amazon.com/organizations/
Naming conventions
Using a bootstrap script ensures things are named consistently in each account and environment. I can quickly identify the environment in which a resource exists and who created it when looking at logs. In addition, I can define policies based on resource names if they are all named consistently. I’ve written a lot of other posts where I explored naming conventions and how they can help you more easily identify resources and write policies:
https://medium.com/cloud-security/aws-resources-organization-and-naming-conventions-262676d6e202
https://medium.com/cloud-security/cloud-network-architecture-and-naming-conventions-97f5a2b89ab9
I haven’t perfected my naming convention in this script (yet) as described in the above posts but it does define the the environment in which a resource exists in a consistent way. It would be easy to add some code to define who deployed the resources as I did in my prior posts. I might do that later and I will if I start allowing someone else to work in my environment.
CloudFormation
Although I love CloudFormation I didn’t use it in this script because I was in a hurry and am only one person. If I was in a larger organization I’d probably define my resources more specifically with CloudFormation.
I’ve written a lot about CloudFormation if you are trying to learn how to use it. A lot of my posts were about the cryptic error messages I get back from it. Some of the things that I’ve written about in those posts have been implemented or fixed by AWS already.
If I use a specific bootstrap script for the initial deployment of my organization and environments that only comes from source control I know the deployment is repeatable and consistent. For a single person that’s probably good enough.
One thing I’m missing by not using CloudFormation is drift detection. There are a few other benefits provided by CloudFormation but for a single person this script is pretty good. I could transition it to CloudFormation later. I don’t think that would be too difficult with AI.
However, by not using CloudFormation that’s one less VPC endpoint I need to pay for and VPC costs are my second highest costs in my particular organization. That cost would be negligible for a large organization but for my small business every endpoint adds up so for me this script will do for now. That was one of the main reasons I didn’t use it here. You probably should if you have any business with more than one employee or person in your AWS account managing the deployment of things in this script.
And by the way, I want VPC endpoints with true network layer controls for my most critical assets, not application layer or IAM controls, and I have explained why in other posts. Application layer and IAM controls let an attacker get farther up the stack. I would use a VPC endpoint everywhere if I could afford it.
Goals of My AWS Organization Bootstrap Script
Why bother with a bootstrap script? Because I can quickly define my entire organization with code and redeploy it if something goes wrong. I can also extend it as needed with additional environments. As I will show you, my script deploys all new environments in a consistent way and ensures all new accounts meet my security, cost control, and governance needs.
I’m using the concept of “environments” to maintain security and cost boundaries between different groups of accounts. I can put resources in an environment that are not allowed to access any other environment. I can quickly deploy complex networking and initial AMIs, IAM roles, repositories, etc. for each environment in a consistent manner. I am going to put my AI agents in a separate environment than the one running my production website and where I manage primary domain names in my account.
I also have the concept of a management environment where I deploy the organization-wide resources used for AWS organizations delegated accounts for things like IPAM, Guard Duty, logging, cost management and security services that support a delegated administrator. These are the services that can be used to view and manage all the accounts in your organization. I definitely do not want my agents to have access to that, but I don’t want to do that in my primary management account where service control policies (SCPs) do not apply.
I want my script to be flexible enough to use for small or large companies to demonstrate concepts. It’s not that they would use this exact script but the way it is architected and the processes and concepts this script uses to manage organizational security. It also needs to deploy controls consistently so there are no human errors along the way and easily redeploy everything if something goes wrong.
There’s also a teardown script - often overlooked when deploying things. How do you back them out and remove them? This is more complicated sometimes than it seems. I need to do more testing of my teardown script but I did use it to tear down and redeploy some resources a few times.
To summarize the goals of this bootstrap script are as follows:
Governance
Security
Cost management
Flexible controls that allow users (and agents) to operate within a certain boundary.
Hard boundaries, not AI guardrails which are easily bypassed.
Works for any size organization
Works for any kind of project or application
Supports different levels of autonomy by different organizational units
Segregation of resources that should not interact with or access each other
Separation of duties - supports different administrators for different types of resources
Consistent resource naming
This blog series is going to be a living document. I may revise the posts over time as the architecture changes or new resources are added. I’ll make a note of any changes in a change log which I’ll put somewhere. I haven’t figured that out yet but maybe I’ll put it in it’s own post if I need it.
What I’ll do is go through the different resources the script deploys and why. I will explain why I do not let AI deploy the base security controls for my organization and environment directly in my account. My posts won’t contain the minutia and details of my past posts in terms of code and I can’t just give you a prompt to do it all. I’ll explain why that won’t really work and how you can generate your own bootstrap script and what you should be aware of as you do so.
For starters - be aware that if you are using the AWS free tier you lose some of that when you switch to using an AWS Organization. Read the documentation before you start vibe coding. You still need to know how things work. You can just build it a lot faster with AI.
Here’s my prior post on how AI is making a difference in getting this all done if you are interested. I can implement my ideas so much faster. Check out the Vibe Coding section of this blog if you want to see how I’m doing this.
For more posts like this subscribe and follow AWS Security.
— Teri Radichel
This is part of my series of blog posts on creating an AWS Bootstrap Script to set up secure AI agent infrastructure.
In the next post I explain how I am writing a bootstrap script to secure my AWS environment where I run AI agents.





