Some background:
- Large highly available, distributed vRealize Automation 6.2.2 environment that spans 3 datacenters
- 2 DEM workers per dc
- 2 Proxy agents per dc
- 1 Dem Orchestrator
The problem:
Last night we had a customer ask for 18 VMs on the same request. They received 14 of those VMs the other 4 changed to state 'BuildingMachine' but our workflow stubs never kicked off. Shortly after the audit log shows the state changing to 'Disposing' and lastly 'Finalized'. We have a hunch that there was an issue with concurrent number of workflows running that caused timeouts to occur, but when I looked into the logs on the DEM workers I wasn't able to gather anything that would indicate as such, and additionally there was no error given for why they failed. The request did return a status of failed and provided the customer with the 14 successfully provisioned VMs.
So my question is, how can I help track down what occurred here if there was no helpful error and I cannot find anything in the logs? Am I looking in the incorrect spots(DEM worker logs)? We are thinking that we may need to scale out the number of DEM workers in our environment to make sure that this does not happen again, but before we do that it would be nice to know if concurrent workflow executions were actually the issue we ran into.
If anyone has any experience with tracking this issue down that would be helpful, thanks!