Life always exceeds your expectations, and this works twoways - positive and negative. This story is about situation that went wrong at first but resulted in something good.
One quite big organisation (ah, screw it - it's a bank, I wont name it, but you should know this fact) met a problem with their message processing system and hired my team to speed it up. Another fact that you should know - if you're passionate and active person you'd better avoid working in bank, these type of business have a rule of thumb "if something works - don't touch it" (I'll post old russian engeneering joke about this rule someday). Though it's not that bad principle but it has other side, it's a paraphrase for conservatism and slow response in critical situations. The whole story emerged after bank taken another bank and unexpectedly discovered that month and a half later their interactions with external systems will double its numbers. Most of those interactions give life to messages that come to IBM MQ and then they are consumed by some internal systems. Absolutly normal you can say and it would be if not some strange architecture decisions made in some internal systems we ought to speedup:
- messages live in a twisted workflow: windows service reads them and writes to MS SQL DB,
- after that another service reads them from db and tries to process.
So instead of one-step-service which reads message from MQ, tries to process it and mark as processed (in a transactional way of course), they created another message queue on a platform that frankly speaking is not so good for it - MS SQL Server. That's how it should look like:
but instead we got this thing to work with:
Later we finally were informally informed why they did so - lack of MQ admins, it was much easier to manage (and cheaper BTW) SQL Server than IBM MQ. When I finally met with the code I understood another reason - guys from bank's internal development good at describing business logic but know little about thread pool, TPL internals, blocking and transactions. Yes, with all of this background they managed to made it work but the speed, it was very very slow, slow and untrustworthy - such systems, working but slow and unreliable, held together with tape, we call in Russia "crutches". It worried noone until the bank as I mentioned before aquired another one...
After first analysis of code we considered a lot of ideas how to improve, that was a stage of experimenting. I tried:
- RabbitMQ - nice and open, robust messaging for application, highly recommended
- SQL Server Queue - nightmarish thing, initially "invented" for Service Broker, hard to setup, hard to profile, highly recommended for masochists,
- SQL DB based queue, but done right - that was original system idea but remastered,
Rabbit was rejected - if they couldn't cope with IBM MQ, hardly they do it with RabbitMQ. And again bringing some separate system (even slim and fit like rabbit, but requring some even minimal administration) to some organisations is a pain. SQL Server Queues (for Service Broker) is a painful thing too, so we chose to rebuild things on a sql table queue basis. The zest: it's quite common to use table for queues - I googled a lot of solutions, but I found none failsafe (stateful) and able to handle multiple consumers, In following posts I'll tell how we managed to make it threadsafe, failsafe and scalable in details. Consider this an announce for a few not so wide known things about TPL and SQL server blocking subsystem.


No comments:
Post a Comment