What started as "let's replace this PHP service" turned into a multi-day infrastructure journey. Simple PHP Lambda for currency conversion - fetch rates, cache in Redis, return results. The Node.js rewrite was easy. The AWS plumbing was not.
The Architecture
The old currency-converter service was a PHP Lambda sitting behind API Gateway, using Redis for caching exchange rates. Simple enough:
The new client-payment service would be Node.js 20 with TypeScript, same pattern but modern tooling. The domain would change from currency-converter.example.com to payment-api.example.com for production and payment-api-test.example.com for testing.
Creating Dedicated Security Groups
The first issue: the new service was using shared security groups. In an AWS VPC, security groups control which services can talk to each other. Sharing security groups means sharing access - not ideal for isolation.
Using the AWS CLI, I created a dedicated security group for the test environment with the name sgr-client-test-client-payment and a description indicating it's for the client-payment Lambda in the test VPC. The command returned a new group ID.
The same process for production created another security group named sgr-client-prod-client-payment in the production VPC, returning its own group ID.
With those security groups created, I updated the serverless.yml configuration. The custom vpc section now has stage-specific settings: dev has no VPC configuration, test uses the new security group ID along with three subnet IDs for the test VPC, and prod uses the production security group ID with its corresponding three subnet IDs.
The Redis Connection Mystery
First deployment: health checks worked. Currency conversion: timeout.
Testing with a curl request to the convert endpoint with authorization, asking to convert 10,000 SEK to EUR, returned only an internal server error message.
Lambda logs showed 10-second timeouts - the function was hanging trying to connect to something. The Redis URL looked correct, pointing to the replication group endpoint with the ".ng." segment in the hostname.
But wait - the GitHub secret had a different URL. It was pointing to the node endpoint, which doesn't have the ".ng." segment. That's the difference between a replication group endpoint and a node endpoint.
For ElastiCache with replication enabled, you need the replication group endpoint, not the node endpoint. Fixed the GitHub secret. Redeployed. Still timing out.
The Real Problem: Security Group References
The new security group existed, but Redis didn't know to trust it. Security group inbound rules in AWS work by reference - you allow traffic from specific security groups, not CIDR blocks.
Checking the Redis security group inbound rules with the AWS CLI revealed the problem. The query returned a list of UserIdGroupPairs showing which security groups are allowed to connect on TCP port 6379. The list included client-api, client-admin, payment-registration-client, and payment-client - but the new client-payment security group was nowhere in the list.
Redis was correctly rejecting connections from unknown security groups.
The fix required adding the new security group to the Redis inbound rules. Using authorize-security-group-ingress, I added the new security group as an allowed source for TCP port 6379. Then I added a description to the rule so it's clear what service it allows.
The same process was repeated for production, authorizing the production security group to access the production Redis cluster.
Testing again with the same curl request now returned a successful response. The JSON showed status success, the current date, an exchange ratio, and the converted amount: 10,000 SEK converts to 934 EUR at the current rate. Both test and prod working.
CI/CD Race Conditions
During development, deployments kept failing intermittently with an error message saying "Cannot delete ChangeSet in status CREATE_IN_PROGRESS".
The problem: rapid commits to main triggered multiple GitHub Actions workflows simultaneously. CloudFormation can only process one stack update at a time.
The fix: add concurrency control to the workflow files. In the deploy-test.yml workflow, I added a concurrency block with a group name of "deploy-test" and cancel-in-progress set to false. The same pattern was applied to deploy-prod.yml with its own group name "deploy-prod".
Setting cancel-in-progress to false is key - instead of canceling the running job, new jobs wait in queue. Deployments run sequentially, CloudFormation stays happy.
Decommissioning the Old Service
With the new service stable, time to remove the old currency-converter. The decommissioning process follows a specific order.
First, I removed the security group reference from the Redis inbound rules using revoke-security-group-ingress, removing the old service's security group from the allowed list.
Second, I deleted the API Gateway custom domain mapping using the delete-api-mapping command, then deleted the custom domain name itself.
Third, I deleted the CloudFormation stack for the old service. This removes the Lambda function, API Gateway resources, and IAM roles all at once.
Fourth, I deleted the Route53 DNS records. This required a change-resource-record-sets command with a change batch containing DELETE actions for both the A record and AAAA record, which were alias records pointing to the API Gateway domain.
Fifth, I cleaned up the SSM parameter store by deleting the app-secret parameter for the old service.
The CloudFormation stack deletion took several minutes - Lambda VPC ENIs (Elastic Network Interfaces) take time to clean up. AWS eventually releases them, but it's not instant.
The Downstream Impact
The day after decommissioning, errors started appearing in client-api. The logs showed a message indicating it could not resolve the host currency-converter.example.com, with the exception occurring in the CurrencyExchange.php file.
The main API was still pointing to the old URL. Checking the serverless.yml for client-api revealed the problem: the CURRENCY_CONVERTER_ENDPOINT environment variable was still set to the old currency-converter.example.com domain for both test and production environments.
The fix was straightforward: update the environment variables to point to the new endpoints. Test now uses payment-api-test.example.com and production uses payment-api.example.com.
Lesson: when replacing a service, grep for the old URL across all repositories. DNS errors in production are not fun.
Final State
The migration transformed several aspects of the service. The runtime changed from PHP 8.1 running on Bref to Node.js 20. The language shifted from PHP to TypeScript. The domain moved from currency-converter.example.com to payment-api.example.com. The security group configuration changed from shared to dedicated, with a specific security group ID for the new service. And the CI/CD pipeline now has queue-based concurrency control instead of running without any concurrency management.
The new service is live, the old service is gone, and downstream dependencies are updated. The actual code change was small - the infrastructure work was everything.
Key Takeaways
First, security groups are identity-based. Redis doesn't allow "any Lambda". It allows "Lambda with security group X". New services need explicit references added.
Second, ElastiCache endpoints matter. Node endpoints versus replication group endpoints make a real difference. The ".ng." in the hostname is the difference between working and timeout.
Third, CloudFormation is single-threaded per stack. Multiple concurrent deploys cause race conditions. GitHub Actions concurrency blocks solve this.
Fourth, Lambda ENI cleanup is slow. VPC-enabled Lambdas create ENIs that persist after function deletion. Security groups can't be deleted until ENIs are gone, which can take up to 20 minutes.
Fifth, grep for old URLs before decommissioning. Dependencies don't update themselves. One forgotten serverless.yml means production errors.
Sixth, order of deletion matters. Security group references first, then domains, then stack, then DNS, then SSM. Follow reverse dependencies.
The migration took longer than expected - not because the code was complex, but because distributed systems have distributed failure modes. Every connection is a contract, and contracts need updating when parties change.