"Can you help me understand our AWS infrastructure?" turned into a multi-hour audit. The account runs in eu-north-1 (Stockholm) and hosts a freelancer management platform plus a secondary marketplace product.
What followed was systematic discovery across VPCs, subnets, security groups, databases, Lambda functions, ECS services, and CI/CD pipelines. The infrastructure was well-organized - consistent naming conventions made the mapping straightforward.
Initial Discovery
First step was verifying credentials and getting oriented. A quick call to AWS STS get-caller-identity confirmed access through the kt-admin IAM user. From there, systematic discovery across all major services. The infrastructure turned out to be well-organized with consistent naming conventions: company prefix, environment indicator (prod or test), and resource type suffix.
Architecture Overview
VPC Architecture
The infrastructure follows AWS best practices with proper network segmentation. The production VPC uses the 10.32.0.0/16 CIDR block, while the test VPC mirrors the structure at 10.132.0.0/16.
Public subnets are prefixed with "if-" for internet-facing. These include the porch subnets for bastion/jump servers and web subnets for the application load balancer. Each uses different CIDR blocks across the three availability zones.
Private subnets use the "nif-" prefix for non-internet-facing. The app subnets host Lambda functions and ECS tasks, while the db subnets contain RDS and Redis instances.
Each subnet tier spans three availability zones - eu-north-1a, 1b, and 1c - for redundancy. The naming convention is clear and consistent throughout.
One notable finding: all private subnets route through a single NAT Gateway in AZ-1. This is a cost/availability tradeoff - if eu-north-1a has issues, private subnets lose outbound internet access. For production workloads with strict uptime requirements, AWS recommends one NAT per AZ.
Security Groups Analysis
The security model uses a layered approach with security-group-to-security-group references instead of CIDR ranges where possible. This is cleaner than managing IP lists.
The SSH access chain works like this: seven specific whitelisted IPs can access the ssh-access security group on TCP port 22. That security group is then referenced by the jumpserver security group, which allows all traffic from the ssh-access group. This creates a clean chain of trust.
Key security groups include ssh-access for the SSH whitelist, jumpserver for bastion hosts, marketplace-g1-alb for the public ALB accepting HTTP and HTTPS from anywhere, marketplace-g1-app for ECS tasks accepting traffic only from the ALB security group, main-platform-db allowing PostgreSQL on port 5432 from the jumpserver, and redis-cache allowing Redis on port 6379 from Lambda security groups.
The Lambda functions have outbound-only security groups - no inbound rules needed since they're invoked through API Gateway, not direct network access.
Database Discovery
Four RDS instances, each serving a specific purpose.
Production Databases
The main-platform-microservices database runs PostgreSQL 13.20 and serves as the primary database for the platform. It shows the highest activity with 7-9 average connections, peaking at 38. It hosts both the main-platform_prod and main-platform_prod_g2 schemas, used by the main-platform-api and main-platform-admin Lambda functions. The G2 suffix suggests a parallel schema for migration or A/B testing - both pointing to the same host but different databases.
The payment database also runs PostgreSQL 13.20 but is isolated for financial data compliance. It shows roughly 1 average connection, peaking at 8. The payment-main-platform and payment-registration Lambda functions use it to store transactions, bank reports, and PAIN XML data.
The marketplace-g1-cluster runs Aurora MySQL 5.7 for the legacy marketplace. It shows interesting metrics: almost zero database connections yet 13% CPU usage. Heavy Redis caching explains the paradox - the database name "legacy_db" suggests a Nordic bidding platform origin.
Test Database
The test-general database runs PostgreSQL 13.20 as a consolidated test database for all services. It uses a single main-platform_test schema for cost optimization versus prod's isolated databases.
The Marketplace Mystery
The marketplace database metrics were puzzling: almost zero DB connections yet the service is clearly active with 2,000-8,000 ALB requests per day. Investigation revealed heavy Redis caching from the ECS task definition environment variables. Most read operations hit Redis first, only falling back to Aurora for cache misses. The write IOPS of 27,000 over 7 days suggests the application is alive but primarily cache-driven.
Lambda Functions Inventory
30 Lambda functions running on provided.al2 runtime - likely Rust or Go compiled binaries. The production functions follow a consistent naming pattern with services like main-platform-api, main-platform-admin, payment-main-platform, payment-registration, currency-converter, invoice-generator, and marketplace-g1.
Each service has the same function pattern: a website function for HTTP handling, console for CLI or scheduled tasks, worker for async queue processing, and preHook for deployment warmup. Test functions mirror the production structure with "-test-" in their names.
CI/CD Pipeline Discovery
All deployments use GitHub Actions OIDC - no stored AWS credentials. Each repository has a dedicated IAM role with a trust policy scoped to that specific repo. The mappings connect each GitHub role to its repository: marketplace-g1, main-platform-api, main-platform-admin, main-platform-webapp, payment, payment-registration, invoice-generator, and currency-converter.
Deployment Methods by Service Type
Lambda APIs deploy using Serverless Framework via serverless deploy. ECS services use Docker builds pushed to ECR, then aws ecs update-service. Static sites sync to S3 with CloudFront cache invalidation using aws s3 sync.
The marketplace service hasn't been deployed since September 2024, based on the last ECR image timestamp. Combined with the low database activity, this suggests a stable legacy system with minimal active development.
Prod vs Test Comparison
A comparison revealed some drift between environments.
Security Groups Gap
Production has several security groups not present in test: currency-converter, marketplace-g1-alb, marketplace-g1-app, marketplace-g1-db, marketplace-g1-efs, invoice-generator, payment-db, payment-document-reader, main-platform-db, and a launch-wizard-1 group that appears to be from manual creation and is a cleanup candidate.
Test has a general-db security group not in production, which supports the consolidated test database approach. The missing marketplace infrastructure in test is intentional - it's a separate product not requiring a parallel test environment. The launch-wizard-1 security group looks like a manual EC2 launch artifact and should be reviewed.
NACL Rule Differences
The web tier has 42 rules in both environments. The porch tier has 42 rules in prod versus 43 in test. The app tier has 38 rules in prod versus 35 in test. The db tier has 14 rules in prod versus 19 in test. Minor drift but worth investigating to prevent surprises when promoting code to production.
Service-to-Database Mapping
Findings Summary
What's Done Well
The infrastructure demonstrates proper network segmentation with a four-tier architecture from porch to web to app to db, with NACLs and security groups at each layer. OIDC authentication means no AWS credentials stored in GitHub - just federated identity with repo-scoped permissions. Database isolation keeps payment data separated from the main workload for compliance. Consistent naming using the client-environment-service pattern everywhere makes discovery easy. Everything is IaC managed through CloudFormation and Serverless Framework stacks, not click-ops.
Items to Review
The single NAT Gateway is a low severity finding - it's cost optimization, but creates a single point of failure for private subnet outbound traffic. Two expired ACM certificates were found in us-east-1 - medium severity. The launch-wizard-1 security group is a low severity manual creation that should be IaC managed or removed. The marketplace being inactive since September 2024 is informational - may be intentional for a legacy/stable product. Two direct IPs in some DB security groups have direct database access - low severity.
Cost Optimization Opportunities
The marketplace Aurora could potentially migrate to RDS MySQL, saving roughly $15/month if Aurora features aren't needed. NAT Gateway redundancy could be improved by adding two more gateways for high availability at roughly $32/month each, or kept at one for cost savings. Test database consolidation is already implemented - good practice.
Business Context (Inferred)
Based on the infrastructure, Main Platform is a Swedish freelancer/contractor management platform with core functionality for managing freelancers, contracts, and timesheets, payment processing for salary disbursements, PAIN XML support for EU bank transfers, and invoice generation for billing.
Marketplace appears to be a separate or legacy freelancer marketplace focused on the Swedish market with BankID integration via Zignsec, email marketing through Mailchimp, the "legacy_db" database name suggesting a Nordic bidding/gig platform, heavily cached with low direct database load, and possibly acquired or an earlier product version.
The infrastructure supports both products with shared services like Redis and IAM roles but isolated databases and deployment pipelines.
Takeaways
First, systematic discovery pays off. Running parallel queries across services builds a complete picture faster than sequential exploration. VPCs, subnets, security groups, databases - they all connect.
Second, naming conventions are documentation. The client-env-service pattern made mapping trivial. No guessing which resource belongs where.
Third, metrics tell the story. Database connection counts and IOPS revealed that marketplace is cache-heavy before we even looked at the code. CloudWatch is the source of truth.
Fourth, OIDC beats stored credentials. The GitHub Actions integration is clean. Each repo has scoped permissions, no key rotation needed, audit trail is built-in.
Fifth, prod/test parity matters. The security group drift between environments is small but worth addressing. Test should mirror prod for accurate pre-deployment validation.
The infrastructure is solid. The team that built this knew what they were doing.