Error Handling in Service Broker procedures

October 31st, 2007

Error handling in T-SQL traditionally has been a sort of misterious voo-doo for most developers, with it’s mixture of error severities, SET settings like XACT_ABORT, ARITHABORT or ARITHIGNORE and the options to handle the error on the server or at the client. For a long time now the best resource I know on this subject was, and perhaps still is, Erland Sommarskog set of articles at http://www.sommarskog.se/error-handling-I.html and http://www.sommarskog.se/error-handling-II.html. But the introduction of Service Broker activated procedures adds some new issues to consider when designing your application and this post is about what these issues are these and how to best cope with them. 

@@ERROR or TRY…CATCH?

Well first things first, lets get this out of our way: check @@error or use BEGIN TRY… BEGIN CATCH? In my opinion, the advantages offered by the TRY… CATCH blocks are overwhelming and you should steer away from checking @@ERROR after each statement. Even if you may find some immediate scenario that would seem simpler using an explicit check for @@error, writing code using TRY…CATCH blocks pays of on the long run. For me the most important differentiatin factor between the two is actually the human factor: code writen with @@error checks is difficult to read and impossible to maintain. Keeping the discipline in writing code to check for @@error after each statement is tedious and errorprone. @@error checks are not ‘programming in the future tense‘ as Scott Meyers would put it: future versions in the database engine could introduce behavior changes that raise new errors your application is not prepared to deal with. The TRY…CATCH blocks solve so many of these issues that they are clearly the winner in my book.

Service Broker Error Handling

Why is Service Broker error handling something special? Shouldn’t the normal common wisdom of writing Transact-SQL apply to Service Broker as well? Well yes, but as it turns out there is a twist to the story. Service Broker is an integrated part of the SQL Server database engine and it follows all the normal Transact-SQL language parsing, compilation and execution rules like any other database engine componet, including everything related to errors and exception handling. But the typical Service Broker application is deployed in a quite different fashion than the non-broker applications: it’s an activated procedure that runs in the server background threads. The first major difference is that there is no client application to surface the error to, so any error that happens will have to be dealt with by the procedure. This means that the very purpose of error handling in a Service Broker application is different than in an ordinary procedure: Service Broker error handling has to capture any error and somehow react to this error at the level of Service Broker semantics, namely conversations. Most times this means that the conversation that received the message that triggered the error has to be ended with an error so that the partener service is notified about this problem and, sometimes, some compensation logic has to be run to undo the effects of this conversation had so far. The partener will receive the error message and in turn run its own compensation to complete (with error) the business transaction represented by the conversation..

Lets roll a quick example. Supose we have a trivial service whose role is to insert received message content into a target table (similar to what an audit service perhaps would do). Lets go ahead and build our DDL objects:

use tempdb;

go

       

– A table for the purpose of the demo

create table [target_table] (

      [id] int not null primary key,

      [key] nvarchar(256),

      [value] nvarchar(256));

go

       

– Our initiator service

create queue [sender_queue];

create service [sender_service]

      on queue [sender_queue];

go

       

– Our target service

create queue [target_queue];

create service [target_service]

      on queue [target_queue]

      ([DEFAULT]);

go

       

create procedure [usp_target]

as

begin

      set nocount on;

      declare @handle uniqueidentifier;

      declare @messageType sysname;

      declare @payload xml;

     

      begin try

            begin transaction;

            waitfor (receive top (1)

                  @handle = conversation_handle,

                  @messageType = message_type_name,

                  @payload = cast(message_body as xml)

                  from [target_queue]), timeout 1000;

            if (0 != @@rowcount)

            begin

                  if (@messageType = ‘DEFAULT’)

                  begin

                        insert into [target_table] (

                              [id], [key], [value])

                              select e.value(‘@id’, ‘int’) as [id],

                                    e.value(‘@key’, ‘nvarchar(256)’) as [key],

                                    e.value(‘@value’, ‘nvarchar(256)’) as [value]

                                    from @payload.nodes(‘//payload/entry’) p(e);

                  end;

                  end conversation @handle;

            end

            commit;

      end try

      begin catch

            declare @error int, @message nvarchar(4000);

            select @error = ERROR_NUMBER(), @message = ERROR_MESSAGE();

            end conversation @handle with error = @error description = @message;

            commit;

      end catch;

end

go

 

       

– attach the procedure to the target_queue as an activated procedure

alter queue [target_queue]

      with activation (

            status = on,

            max_queue_readers = 1,

            procedure_name = [usp_target],

            execute as owner);

go

 

As you can see, the error handling in the stored procedure is handled by a BEGIN TRY/BEGIN CATCH block, and the error handling consist of ending the conversation with an error, sending back the error number and message to the original sender. Of course, in a real application it would probably make sense to clearly define error numbers and messages reported by the target service, as is not really useful to send back to the sender some internal database engine codes and messages. So now lets go ahead and test our service:

– send a message to the target service

declare @handle uniqueidentifier, @payload xml;

begin transaction;

begin dialog conversation @handle

      from service [sender_service]

      to service ‘target_service’, ‘current database’

      with encryption = off;

       

select @payload = N‘<payload><entry id=”1″ key=”A” value=”B”/></payload>’;

send on conversation @handle (@payload);

commit;

go

 

If we look into the target_table we find the entry we sent (1, A, B). The response that came back from the target service is sitting in the sender_queue and is a mundane EndDialog message, indicating the succesfull completion of the ‘assignment’.

Now if we simply run the very same SEND example, we’re gonna trigger an error, because the same values would violate the primary key constraint on the target_table. So go ahead and run this again:

– send a message to the target service

declare @handle uniqueidentifier, @payload xml;

begin transaction;

begin dialog conversation @handle

      from service [sender_service]

      to service ‘target_service’, ‘current database’

      with encryption = off;

       

select @payload = N‘<payload><entry id=”1″ key=”A” value=”B”/></payload>’;

send on conversation @handle (@payload);

commit;

go

 

Now the sender_queue has received an error message:

<Error xmlns=“http://schemas.microsoft.com/SQL/ServiceBroker/Error”>

  <Code>2627</Code>

  <Description>Violation of PRIMARY KEY constraint ‘PK__target_table__0AD2A005′. Cannot insert duplicate key in object ‘dbo.target_table’.</Description>

</Error>

So our error handling works as we expected and the service is handling error conditions fine. So is this the end of the story?

Doomed Transactions

 

As it turns out, we’ve only skimmed the surface and we’re about to open one nasty can of worms. To proove this lets produce the one most common mistake that happens during development: a malformed payload:

– send a message to the target service

declare @handle uniqueidentifier, @payload nvarchar(max);

begin transaction;

begin dialog conversation @handle

      from service [sender_service]

      to service ‘target_service’, ‘current database’

      with encryption = off;

       

select @payload = N‘<payload><entry id=”1″ key=”A” value=”B”></payload>’;

send on conversation @handle (@payload);

commit;

go

The payload in this case is incorrect, the <entry> tag is not closed. So what happens with our message? Nothing apparently, the message simply stays in the target_queue and refuses to be processed, as a simple check using  select * from [target_queue] shows! The obvious culprit should be that message failed to be processed 5 times and the poison message detection has intervened and deactivated the target_queue so lets go ahead and check it:

select [name], is_receive_enabled from sys.service_queues

Now I reckon that as I was writing this piece, I had a big surprise to see that the queue is … not disabled! In fact I had to delete a paragraph I had already wrote,  and come back and revisit this behavior. So what happened? It turns out that the activation mechanism was actualy fooled by our erroneous message to put the queue in the NOTIFIED state, and leave it as such, as a check on the relevant DMV shows: select * from sys.dm_broker_queue_monitors. The NOTIFIED state occurs when an activated procedure was launched but the RECEIVE verb has not been run on the activated queue. But our activated procedure was not running, as a quick check on select * from sys.dm_broker_activated_tasks shows. A second check on the ERRORLOG (or the system event viewer) revealed the problem:

2007-10-31 16:31:52.57 spid52s The activated proc [dbo].[usp_target] running on queue tempdb.dbo.target_queue output the following: 'The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.'

So our procedure wad run, had hit an exception, but our exception handler was not capable of dealing with the problem. The error in the ERRORLOG indicates that we are dealing with an uncommittable transaction issue. So apparently we need to augment our error handling CATCH block to deal with such problems, and the XACT_STATE documentation shows how this can be achieved. So we’re gonna modify our CATCH block to deal with doomed transactions and we’re also gonna modify our RECEIVE code slightly not to cast to XML during the RECEIVE statement, since that apparently is fooling the activation machinery which behaves as if the RECEIVE never occured. Here is our modified procedure:

– Our modified procedure

alter procedure [usp_target]

as

begin

      set nocount on;

      declare @handle uniqueidentifier;

      declare @messageType sysname;

      declare @messageBody varbinary(max);

      declare @payload xml;

     

      begin try

            begin transaction;

            waitfor (receive top (1)

                  @handle = conversation_handle,

                  @messageType = message_type_name,

                  @messageBody = message_body

                  from [target_queue]), timeout 1000;

            if (0 != @@rowcount)

            begin

                  if (@messageType = ‘DEFAULT’)

                  begin

                        select @payload = cast(@messageBody as xml);

                        insert into [target_table] (

                              [id], [key], [value])

                              select e.value(‘@id’, ‘int’) as [id],

                                    e.value(‘@key’, ‘nvarchar(256)’) as [key],

                                    e.value(‘@value’, ‘nvarchar(256)’) as [value]

                                    from @payload.nodes(‘//payload/entry’) p(e);

                  end;

                  end conversation @handle;

            end

            commit;

      end try

      begin catch

            – Test XACT_STATE for 0, 1, or -1.

            – If 1, the transaction is committable.

            – If -1, the transaction is uncommittable and should

                be rolled back.

 

            – Test whether the transaction is uncommittable.

            if (XACT_STATE()) = -1

            begin

                  rollback transaction;

            end;

 

            – Test wether the transaction is active and valid.

            if (XACT_STATE()) = 1

            begin

                  declare @error int, @message nvarchar(4000);

                  select @error = ERROR_NUMBER(), @message = ERROR_MESSAGE();

                  end conversation @handle with error = @error description = @message;

                  commit;

            end

            end catch;

end

go

We alter the proceure, turn activation back on and… yet again nothig, the message is still sitting in the target_queue. This time though the queue was correctly disabled, we check the select [name], is_receive_enabled from sys.service_queues view and sure enough the queue is disabled. But wait a minute! I’m sure TRY/CATCH has its place, but this is not quite what we wanted to achieve, a simple common XML formatting error is taking down our service! How come?

Error severities

As it turns out, XML casting and validation errors are raised with severity 16, and this severity will always put the transaction into an uncommittable state. For Service Broker, this is quite bad, as parsing and shreding XML is a day to day operation for most service applications. Unfortunately, there’s simply no workaround for this issue, which has the implication that one cannot use the SQL Server built-in XML methods to validate the received XML programmatically, one has to ensure that the XML is valid upfront. For XML payloads fortunately there is a solution: declare the message validation on the message type and even enforce an XML schema. This will cause validation to occur prior to the message being enqueued and will ensure that the activate dprocedure has to deal only with valid input. In our case, this means to change the contract of the service to use an XML well formatted validation message:

create message type [valid_payload] validation = well_formed_xml;

create contract [validated_contract] ([valid_payload] sent by initiator);

go

 

alter service [target_service] (add contract [validated_contract]);

 

We also need to change our activated procedure to deal with the new message type name, the ‘valid_payload’ type:

            if (0 != @@rowcount)

            begin

                  if (@messageType = N‘valid_payload’)

                  begin

                        select @payload = cast(@messageBody as xml);

                        insert into [target_table] (

                              [id], [key], [value])

We can now reset the broker in tempdb (to get rif of the erroneous message that would prevent the target_queue from ever becomming enabled) and enable back the target_queue:

alter database tempdb set new_broker with rollback immediate;

go

alter queue target_queue with status = on;

go

So now we can send again the erroneous message, this time using the validated contract and message type:

– send a message to the target service

declare @handle uniqueidentifier, @payload nvarchar(max);

begin transaction;

begin dialog conversation @handle

      from service [sender_service]

      to service ‘target_service’, ‘current database’

      on contract [validated_contract]

      with encryption = off;

 

select @payload = N‘<payload><entry id=”1″ key=”A” value=”B”></payload>’;

send on conversation @handle message type [valid_payload] (@payload);

commit;

go

And, as expected, the Service Broker responds immedeately with an error on our conversation because the message did not pass the XML payload validation:

<Error xmlns=http://schemas.microsoft.com/SQL/ServiceBroker/Error>

  <Code>-9615</Code>

  <Description>A message of type ‘valid_payload’ failed XML validation on the target service.  XML parsing: line 1, character 51, end tag does not match start tag This occurred in the message with Conversation ID ‘…’, Initiator: 1, and Message sequence number: 0.</Description>

</Error>

So we were able to cope with the case of invalid XML, but how about other cases? What if the message is still valid XML, but has some invalid attributes? Ie. the payload ‘<payload><entry id=”foobar” key=”A” value=”B”/></payload>’will pass the well_formed_xml test, but will still cause the service to disable itslef, because ‘foobar’ cannot be casted to int. You could raise the bar even further by specifying an XML schema for validation, but not all of us are so confident in our XML schema knowledge to be sure that we covered everything. And besides, there are binary messages too, XML payload is not a requirement for Service Broker. The truth is that the severity of such trivial errors forcing the transaction to rollback is quite a burden on the application, since it means that pretty much any malformed message can disable a service. And there simply isn’t any buletproof defense against it.

Post Rollback corrective actions

Many developers I talked with have been at this point already and realised that they simply cannot cover all the angles on this problem so they tried another approach: what if one rolls back, but then does some corrective action. Surely we could write some code in the catch block that remedies the problem, like dequeue the message and error the conversation immedeately, without trying to process it’s payload. Usually this looks something like the following, int he BEGIN CATCH block:

            – Test whether the transaction is uncommittable.

            if (XACT_STATE()) = -1

            begin

                  rollback transaction;

                  – take corrective actions

                  declare @error int, @message nvarchar(4000);

                  select @error = ERROR_NUMBER(), @message = ERROR_MESSAGE();

                  begin transaction;

                  receive top(1) @messageBody = message_body

                        from [target_queue]

                        where conversation_handle = @handle;

                  end conversation with error = @error description = @message;

                  commit;

            end;

 

But this is riddled with problems. The moment the transaction was rollbed back, the code has absolutely no guarantee over the state of this conversation and its messages. Another instance of the activated procedure might had already picked up the very same message and tried to process it. This problem is true for any processing of this kind, it is not specific to Service Broker, but the typical Service Broker environment, with it’s activated procedures running in parallel and trying to grab the next unlocked message it is very likely to happen in production. What’s worst is that is is unlikely to happen in the test development environement, unless the test environment matches the concurrency of the production environment.

Conclusions

So it turns out error handling in Service Broker is a bit trickier than expected and presenting some challenges. Some recommendations stand valid:

  • Do use the BEGIN TRY/BEGIN CATCH blocks, they are simplifying the code tremendously
  • Use the Service Broker conversation semantics to represent business transactions, and use the END CONVERSATION … WITH ERROR to signal errors to the other service at the end of a conversation when you catch
  • Do harden your service broker contract as much as possible using XML validation schemas to catch malformed messages as early as possible, before they enter your application queues

But other thant this, dealing with errors and the problems caused by uncommittable transactions and its effects on Service Broker queues is by no means trivial, and is a subject I actually still looking for an acceptable answer.

8 Responses to “Error Handling in Service Broker procedures”

  1. Hi Remus,

    I can’t test it right now, but can a doomed transaction become “un-doomed” by rolling back to a savepoint that was taken before the error that caused the transaction to become uncommittable? If so, the solution is obvious – take a savepoint directly after reading a message from the queue, and rollback to that savepoint if an error turns out to be a poison message.

    Best, Hugo

  2. Anthony Brown says:

    I have handled errors in a different way, similar to what Hugo suggested – can you please let me know if this is a valid way of doing it.

    Basically I begin a transaction before receiving any messages off the queue. When I get a message off the queue i also store the conversation_handle. I then do processing on the message in a try..catch block. In the catch block, i rollback the transaction (which puts the message back on the queue), begin a new transaction, and then get the message off the queue again using the same conversation_handle. I then log the error in a table and send an error message back to the initiating queue.

    So, I am using code something like this

    begin transaction

    – get 1 message from the queue
    waitfor (receive top (1)
    @conversationhandle = conversation_handle,
    @messagetype = message_type_name,
    @message = cast(message_body AS xml)
    from target_queue), TIMEOUT 1000;

    – keep looping while we have a message to process
    while (@conversationhandle is not null)
    BEGIN

    begin try

    – do processing on the message
    end try
    begin catch

    – error handling
    rollback;
    begin transaction
    waitfor (receive top (1)
    @conversationhandle = conversation_handle,
    @messagetype = message_type_name,
    @message = cast(message_body AS xml)
    from target_queue
    where conversation_handle = @conversationhandle), TIMEOUT 1000;

    – log some error somewhere
    – and send an error message back to the initiating queue
    goto exit_conversation
    end catch

    exit_conversation:

    – End the conversation and commit
    END CONVERSATION @conversationhandle;
    commit;

    select @conversationhandle = null;

    BEGIN TRANSACTION;

    – lets see if there are any other messages waiting to be processed and get that if there is
    waitfor (receive top (1)
    @conversationhandle = conversation_handle,
    @messagetype = message_type_name,
    @message = cast(message_body as xml)
    from target_queue), TIMEOUT 1000;
    end
    commit;

  3. remus says:

    Unfortunately rollback to savepoints are not supported on doomed transactions. The funny story is that if you have *both* Roger’s books about Service Broker (the ‘beta’ and the ’2005′ versions) the ‘beta’ one shows how to use rollback to savepoint, because in Beta 2 this was working! By RTM time though the functionality had changed.
    As for rolling back an then receiving the same message it works only if there’s one and only one queue reader. In a concurrent environment the rolled back message is up for grabs as soon as it is unlocked and a different instance of the procedure may grab it and start processing it ‘anew’, so the risk of deactivating the queue is still there.

  4. DuCKa says:

    This blog has saved me hours! Thanks muchly :)

  5. [...] were few and far between and depended upon user input, this was not trivial.  You can see this article for a tutorial on [...]

  6. Chris says:

    Great articles! Thanks for your help.

    I came across another site that talked about message poisoning. Was I misunderstanding them or should I end the conversation before/after the rollback in the code above, as well? Thanks.

  7. remus says:

    @Chris: If you end *before*, it will be rolled back so is a no-op. If you end *after*, the conversation was already unlocked by the rollback and it can be already locked and changed by another thread. Getting poison message to be handled *automatically* is very tricky. My recommendation is to handle all cases possible in the commit path, and when faced with a true poison message that causes rollback, analyze it manually and fix the procedure to cause it to go on the commit path. Cycle, rinse repeat until all errors are properly handled.

  8. Chris says:

    Hi, Remus. I am having some issues with my deployment of service broker, and I am stumped. I am getting the error – “…The conversation handle “” is not found…”. The service broker send/receive occurs on the same instance within the same db. I am using an activation procedure from the queue. I am using a trigger on a table to start the messaging. Below is my code.

    Trigger:

    DECLARE @XML XML;
    IF EXISTS(SELECT * FROM inserted)
    BEGIN
    BEGIN TRANSACTION
    SELECT @XML = (SELECT ID FROM inserted FOR XML AUTO);
    – Send the XML records to the Service Broker queue:
    DECLARE @DialogHandle UNIQUEIDENTIFIER

    BEGIN DIALOG CONVERSATION @DialogHandle
    FROM SERVICE BrokerServiceName
    TO SERVICE ‘BrokerServiceName’
    ON CONTRACT BrokerContractName
    WITH ENCRYPTION = OFF;
    SEND ON CONVERSATION @DialogHandle
    MESSAGE TYPE BrokerMessageType
    (@XML);
    COMMIT TRANSACTION

    Activation SP:

    DECLARE @XML XML,
    @MessageBody VARBINARY(MAX),
    @MessageTypeName SYSNAME,
    @EmpID UNIQUEIDENTIFIER,
    @ConversationID UNIQUEIDENTIFIER,
    @ID UNIQUEIDENTIFIER;
    DECLARE @Queue TABLE (
    ConversationID UNIQUEIDENTIFIER,
    MessageBody VARBINARY(MAX),
    MessageTypeName SYSNAME);

    BEGIN TRY

    BEGIN TRANSACTION

    WHILE (1 = 1)
    BEGIN
    WAITFOR (
    RECEIVE conversation_handle, message_body, message_type_name
    FROM BrokerQueue
    INTO @Queue
    ), TIMEOUT 5000;
    – If no messages exist, then break out of the loop:
    IF NOT EXISTS(SELECT * FROM @Queue) BREAK;
    DECLARE Cursor1 CURSOR FAST_FORWARD
    FOR SELECT * FROM @Queue;
    OPEN Cursor1;
    FETCH NEXT FROM Cursor1
    INTO @ConversationID, @MessageBody, @MessageTypeName;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    – Let’s only deal with messages of Message Type
    IF @MessageTypeName = ‘BrokerMessageType’
    BEGIN
    SET @XML = CAST(@MessageBody AS XML);
    SELECT @EmpID = tbl.rows.value(‘@ID’, ‘UNIQUEIDENTIFIER’)
    FROM @XML.nodes(‘/inserted’) tbl(rows);

    exec somestoredprocedure @EmpID;

    END

    END CONVERSATION @ConversationID

    FETCH NEXT FROM Cursor1
    INTO @ConversationID, @MessageBody, @MessageTypeName;
    END
    CLOSE Cursor1;
    DEALLOCATE Cursor1;
    – Purge the temporary in-proc table:
    DELETE FROM @Queue;

    END

    COMMIT TRANSACTION

    END TRY

    BEGIN CATCH

    IF (XACT_STATE()) = -1
    BEGIN

    ROLLBACK TRANSACTION

    END

    IF (XACT_STATE()) = 1
    BEGIN

    declare @error int, @message nvarchar(4000);

    select @error = ERROR_NUMBER(), @message = ERROR_MESSAGE();

    end conversation @ConversationID with error = @error description = @message;

    commit transaction;

    END

    END CATCH

    Can you see anything that would cause this error? Am I doing something wrong here? Any assistance/direction you can provide would be greatly appreciated. Thank you.