We maintain a dedicated infrastructure to host development environments (Dev environments) for various projects, independent of client resources, to streamline the development process.
The Incident: One day, we started receiving reports that web resources were inaccessible. The root cause was traced to our Kubernetes master server, the primary manager of the cluster where these Dev environments are hosted. The master server became unavailable, and since all Dev domain names were routed through a load balancer managed by the Kubernetes master, the environments became unreachable—despite the fact that they were still physically running on Kubernetes worker nodes.
Our Response: To minimize downtime and reduce the impact on development workflows, we took the following immediate steps:
- Traffic Redirection:
As the Ingress controller for the Kubernetes master was also accessible from all worker nodes, we re-routed the primary domain name for Dev environments to all worker nodes simultaneously. This allowed us to balance traffic and make the environments accessible again for client demonstrations.
- Restoring Full Functionality:
While the Dev environments became accessible, the CI/CD pipelines were still down due to the master server’s unavailability. To address this, we added multiple Kubernetes master nodes (ensuring an odd number for quorum and high availability). We updated the cluster configuration to route the primary domain name across these new masters, ensuring both availability and load balancing.
Outcome:
- Development environments were restored and operational for client use within a short time frame.
- CI/CD processes resumed once the new master nodes were integrated.
Lessons Learned and Improvements: This incident highlighted some vulnerabilities in our Dev cluster architecture. To prevent similar issues in the future, we implemented the following:
- Improved High Availability:
Added redundant Kubernetes master nodes to ensure the cluster remains operational even if a single master node fails.
- Enhanced Monitoring and Alerts:
Upgraded our monitoring system to detect potential master node issues earlier.
- Disaster Recovery Testing:
Regularly conducted failover tests to ensure smooth recovery during unexpected events.
Ensuring continuous development is a critical priority for our company, and this incident reinforced the importance of resilient infrastructure to support round-the-clock workflows.