Unexpected host reboots in us-west-3 region
Incident Report for Lambda Labs GPU Cloud
Resolved
This incident has been resolved.
Posted Feb 13, 2024 - 17:47 PST
Update
We are deploying a remediation for the discovered issues, this may cause a brief reboot of VMs.
Posted Feb 13, 2024 - 09:43 PST
Monitoring
We have identified the root cause and implemented a fix, we are monitoring the situation.
Posted Feb 12, 2024 - 18:54 PST
Identified
We believe we have identified the root cause, have implemented a fix.
Posted Feb 12, 2024 - 18:50 PST
Update
Root cause investigation is still ongoing.
Posted Feb 12, 2024 - 16:25 PST
Investigating
We're seeing hosts rebooting again, moving status back to investigating.
Posted Feb 12, 2024 - 15:32 PST
Monitoring
Change is deployed, hosts now appear to be stable, we have moved to actively monitoring them.
Posted Feb 12, 2024 - 15:19 PST
Identified
Root cause has been identified, have pushed a change to address it.
Posted Feb 12, 2024 - 15:16 PST
Update
We believe we have identified the root cause, are in the process of confirming.
Posted Feb 12, 2024 - 14:40 PST
Investigating
We are seeing issues with H100 hosts in us-west-3 rebooting, are actively investigating.
Posted Feb 12, 2024 - 14:10 PST
This incident affected: Infrastructure (Virtual Machines).