Hosting workflow in WCF and concurrency matters
Recently I looked at an application that leverages a lot of Workflow Runtime and Windows Communication Foundation bits shipped with .NET 3.0. It essentially had a state machine workflow which was hosted by a WCF service. Each operation invoked in the service fired an event to a local service which in turn resumed the execution of the state machine workflow. Also the state machine was considerably long running and therefore SQL server persistence service was used.
Just like any other typical WCF service, it was designed to service multiple users concurrently. Consequently two users could potentially send the same request at the same time which in turn fires the same event twice to the underlying workflow. But workflow runtime being very smart in this case does not allow two instances of the same workflow executing in parallel. Therefore the request that gets hold of the workflow goes through and the second one waits until the first one finishes its unit of work. But what happens if the first one transits the state of the workflow to another state (which is what was happening in the application I'm talking about BTW). When the second request tries to fire the event, workflow runtime detects that the event is not valid according new state and throws an EventDeliveryFailedException exception.
From the service's perspective it's important to let the clients know what went wrong. Specially in this case giving a clue would be helpful to the client to retry if that's required. At this point you might be thinking that it's not a big deal as you could catch EventDeliveryFailedException exception and translate it to a proper fault pretty easily. But in reality, EventDeliveryFailedException exception is too generic to detect the exact cause of the problem. For example, you might get an EventDeliveryFailedException exception if you are:
1. Trying to fire an event to the workflow which is invalid according to current workflow state.
2. Trying to fire an event that does not exist in the workflow (probably due to multiple versions running side by side)
3. Trying to fire an event whose event arguments are not marked as serializable (this would probably happen only when you are debugging).
4. Trying to fire an event while the workflow is owned by another thread (user) and your ownership timeout expires.
Therefore catching EventDeliveryFailedException exception itself does not seem to help. The answer lies within the InnerException property of EventDeliveryFailedException exception. Workflow runtime creates a distinguishable exception for each of those scenarios (at least for scenarios I was testing) and assigns it to InnerException property. For example, if the workflow could not deliver the event because of the current state of the workflow, the inner exception is set to System.Workflow.Runtime.QueueException exception. If it was because of the ownership timeout on the other hand, the inner exception is set to System.Workflow.Runtime.WorkflowOwnershipException exception. However, unfortunately, these exception types are marked as internal and therefore we cannot catch those exceptions directly. Consequently, if we want to send a more sensible fault to the client we would have to check the type names and create the appropriate fault to be returned from the service.
After playing around 3.0 bits I checked out how this scenario is handled by Workflow Services available in 3.5 bits. The 3.5 runtime automatically generates a generic fault saying "Operation is currently not available on the service". But I think it's always good to send custom fault messages which clients can specifically deal with under heavy concurrent environments (specially when retries are possible).
Wrapping up this post, I would love to see those inner exception types as public types so that the workflow developers can be consistent with the exception handling code. Also it's preferable to have a knob in 3.5 Workflow Services to hook up custom faults (a fault that appears in the contract) to notify concurrency issues.