Strange issue with AWS VPC Private link endpoint

Hai guys, sorry for discontinuing this blog for a long time; got distorted with work pressure and a lot of changes in my life! This Covid-19 lockdown time made me think about my blog and planned to start it again

Recently I was working for one of the clients to set up an AWS account for there internal product. Meantime one of my team members got into a strange issue while creating private link endpoint from an endpoint service. I got involved in it and got to know that it some issue with AWS availability zone assignment! I will be explaining how this issue come in to notice and what AWS asked us to do resolve the issue.

Before starting on the issue; let me explain what exactly I am trying to achieve. My client is having multiple products and multiple teams working on different projects on the AWS platform. One of the projects wanted to access one of the services running on a different AWS account, which is fully running on the private network and it’s not exposed to the public network.

To achieve this connectivity, utilized AWS service called Private linking using VPC endpoint services and VPC endpoint interface. High-level architecture will look like this.

A screenshot of a cell phone

Description automatically generated
AWS Pvt Link

How to create an endpoint service in AWS VPC:

  • Create a Network Load Balancer for your application in your VPC and configure it for each subnet (Availability Zone az1, az2, az3) in which the service should be available.
  • Create a VPC endpoint service configuration and specify your Network Load Balancer created above.
  • Grant permissions to specific service consumers (AWS accounts) to create a connection to endpoint service.

Steps to enable service consumers to connect to endpoint service:

  • Creates an interface endpoint with endpoint service name
  • Choose respective VPC and availability zone. We used CloudFormation with default option, this means; it will create in all zones as Account B NLB and Account A is having 3 subnets with az1, az2, az3.
  • To activate the connection, accept the interface endpoint connection request. It’s set to automatically accepted in account B so no actin required in our case.
  • Attached a security group with outgoing tariffing enabled for service ports on VPC CIDR.

So, till here all looks good; but it’s not! When tried to access or telnet endpoint DNS name on service port from account A it’s getting a timeout error.

Root cause:

When validated, I have noticed the endpoint interface created in account A only created interface with 2 availability zone. Asper AWS documents, CloudFormation should have created endpoint interface with 3 availability zones as NLB in Account B and account B is having 3 availability zones!

I have taken this issue with AWS and they came back with a reply saying

When creating endpoint service CloudFormation do not have the option to give AZs. It takes AZs from the NLBs attached.

If you add a subnet later to the NLB in different AZ that change wont take effect on endpoint service. i.e. when you add a subnet to the NLB AFTER you created the Endpoint Service.

But we didn’t add or update any subnet in any of the accounts, it was same old VPC and subnets in both of the accounts! AWS also asked us to delete and create endpoint service and endpoint interface again.

I have also noticed, when we create endpoint interface from AWS console, I do not have any issue and It takes AZs from the NLBs attached and it works as expected.

Issues with Amazon Elastic Load Balance or ELB and fixes

I was using Amazon cloud services for past some time. Amazon is one of the amazing service provider and mean time many restrictions and limitations when you are looking for flexible cloud infra and services. On my experience on Amazon EC2, and ELBs; I have noticed some issues which may be very difficult to identify.

AWS ELB

ELBs behave very strange some time when servers go offline. As Expected Health checks in ELBs, will manage EC2 downtime as explained. But I have noticed, once all servers in ELB goes down for maintenance or for any rezone it will be marked as Out of service in ELB page.

Yes, this looks normal. The issue comes when these servers come up; at this moment ELB behaves very strange. ELB will show servers are in service after health checks, and even you can see health checks are hitting all servers. But if you try to access ELB URL it will respond like unavailable or it will say service is down.

So surprised rite! Yes this what issue I have noticed. The solution for this was remove the server from ELB and add it back. When I checked with some AWS experts, they said it will come up and will be live but if it’s not coming up remove servers and add it back.

So the conclusion is keep sharp eye on it using monitoring tools. Hop this will be helpful for some one who will be facing similar king of issues with ELBs. Thank you, will come with new topic soon.