SQL Server clustered columnstore Tuple Mover
December 2nd, 2013The updateable clustered columnstore indexes introduced with SQL Server 2014 rely on a background task called the Tuple Mover to periodically compress deltastores into the more efficient columnar format. Deltastores store data in the traditional row-mode (they are B-Trees) and as such are significantly more expensive to query that the compressed columnar segments. How more expensive? They are equivalent to storing the data in an uncompressed Heap and, due to small size (max 1048576 rows per deltastore rowset), they get little traction from parallelism and from read-aheads. It is important for your upload, initial seed and day-to-day ETL activities to achieve a healthy columnstore index, meaning no deltastores or only a few deltastores. To achieve this desired state of a healthy columnstore it is of paramount importance to understand how deltastores are created and removed.
How are Deltastores created
Deltastores appear through one of the following events:
- Trickle INSERT: ordinary INSERT statements that do not use the BULK INSERT API. That includes all INSERT statements except INSERT ... SELECT ....
- UPDATE, MERGE statements: all updates are implemented in clustered columnstores as a delete of old data and insert of new (modified) data. The insert of new data as result of data updates will always be a trickle insert.
- Undersized BULK INSERT operations: insufficient number of rows of data inserted using the BULK INSERT API (that is using one of IRowsetFastLoad, use the SQL Native client ODBC Bulk operation extensions or the .Net managed SqlBulkCopy class) and INSERT ... SELECT ... statements. These BULK INSERT APIs are often better known by the various tools that expose them, like bcp.exe or the ‘fast-load’ SSIS OleDB destination. Insufficient rows means that the number of rows submitted in a batch is too small to create a compressed segment in a partition. The exact number is, I think, around 100000 rows. Bellow this margin, even when BULK INSERT API is used, the rows will be inserted as a deltastore. Above 100000 a compressed segment is created, but do note that a clustered columnstore consisting of 100k rows segments will be sub-optimal. The ideal batch size is 1 million rows, and your ETL process should thrive to achieve this batch size per partition per thread (ie each inserting thread should insert 1 million rows for each partition it inserts into).
Of course, the business process may simply not have enough data in an ETL pass to create the desired 1 million rows batch size, or even the minimum 100k rows. That is fine, rows will be saved as deltastores and the deltastores will be later compressed, when they fill up (reach the max size of 1048576 rows). What is important though is that during initial population (seeding) the upload process must create healthy clustered columnstores. The best option is to upload the data into a row-mode structure (heap or a b-tree) and then issue a CREATE CLUSTERED COLUMNSTORE INDEX statement to build the clustered columnstore. This option is the best because only CREATE INDEX on existing data will create appropriately relevant (ie. useful) global dictionaries for the compressed segments.Having good global dictionaries reduces the size of compressed segments significantly, for improved query performance. If you’re afraid to run CREATE CLUSTERED COLUMNSTORE INDEX on a large data set due to possible log growth issues, remember two things: CREATE INDEX is a prime candidate for minimally logging and a clustered columnstore index will not generate a huge row-by-row log even in fully logged mode, because it only writes compressed data.
Global dictionaries are created only during index build
If you cannot upload the initial data in row-mode and then create the clustered index, the alternative is to BULK INSERT into an empty clustered columnstore. This will result in a worse quality columnstore because it cannot use global dictionaries, only local dictionaries can be used. This, in general, results in worse compression. Of course the compression will still be order of magnitude better than row-mode, but will probably not be as good as a fresh index build could achieve. But the bigger problem you’ll have with initial population of an empty clustered columnstore will be achieving the desired 1 million rows per partition per thread batch size. Think what that means in terms of an SSIS DefaultBufferSize and DefaultBufferMaxRow!
The absolutely worse option for initial population of data is to upload it in small batches into an existing columnstore index, let it settle into deltastores let the Tuple Mover compress it.
How are Deltastores compressed
When a deltastore fills up (ie. it reaches the max size of 1048576 rows) is going to be closed and will become available for the Tuple Mover to compress it. The Tuple Mover will create big, healthy segments, true. But it is not designed to be a replacement for index build. What does that mean?
create event session cci_tuple_mover
on server
add event sqlserver.columnstore_tuple_mover_begin_compress,
add event sqlserver.columnstore_tuple_mover_end_compress
add target package0.asynchronous_file_target
(SET filename = 'c:\temp\cci_tuple_mover.xel', metadatafile = 'c:\temp\cci_tuple_mover.xem');
The Tuple Mover doesn’t expose much with regard to monitoring, but it does expose two XEvents for when it starts to compress a segment and when it completes a segment compression. When I create
a couple of ready deltastores and wait for the Tuple Mover to compress them, this is what I see in the XEvents session:
with e as (select cast(event_data as xml) as x
from sys.fn_xe_file_target_read_file ('c:\temp\*.xel', 'c:\temp\*.xem', null, null))
select x.value(N'(//event/@name)[1]', 'varchar(40)') as event,
x.value(N'(//event/@timestamp)[1]', 'datetime') as time
from e
order by time desc;
event time
---------------------------------------- -----------------------
columnstore_tuple_mover_end_compress 2013-12-02 11:26:25.047
columnstore_tuple_mover_begin_compress 2013-12-02 11:26:20.247
columnstore_tuple_mover_end_compress 2013-12-02 11:26:05.040
columnstore_tuple_mover_begin_compress 2013-12-02 11:26:00.183
You can see that the Tuple Mover takes about 5 seconds to compress a row group, which is the ballpark time span you should be expecting too. Wider rows (many columns) will take more. It cannot benefit from parallelism. It is also important to notice that the Tuple Mover does nothing for about 15 seconds between the two rowgroups it compressed. This is not waiting for some resource, is literally how it works: it compresses one rowgroup at a time per table, and then it sleeps. In addition, the Tuple Mover only wakes up and looks for ‘work’ every few minutes. This is intentional behavior, not some poor resource utilization problem. The Tuple Mover is not trying to race to keep up with a massive bulk insert. For such situations, the ETL pipeline should create the rowgroups straight into compressed form, or you should use index rebuild/reorganize operations. Tuple Mover works at a pace that is designed to keep up with a normal rhythm of business trickle INSERT, restricting itself to minimal resource consumption. Had the Tuple Mover be an aggressive task willing to commit all cores to 5-6 seconds of 100% CPU rinse-cycle-and-repeat one can find itself unexpectedly facing a non-responsive box right when the resources are needed for who knows what important task. In fact, no matter how aggressive or shy the task is, there will always be someone to complain that is the wrong choice for it particular workload. Ghost Cleanup, anyone?
Running the Tuple Mover on Demand
If you do operations that end up with a massive number of deltastores, the Tuple Mover will only catch up at a pace of 5-6 segments per minute, maybe even less. If your index has hundreds or even thousands of CLOSED deltastores then you should consider an index REBUILD operation. Waiting for Tuple Mover to catch up will take hours, during which queries from this ‘columnstore’ will be slow. After all, a columnstore consisting of nothing but hundreds of deltastore will not benefit from any of the columnstore optimizations (no compression, no reading of relevant columns only, no segment elimination) and the data will have to be pushed up into the query execution as ‘fake’ columnar format, since the deltastores are a storage concept not understood higher in the execution chain.
ALTER INDEX … REORGANIZE on a columnstore will force a full speed pass of Tuple Mover
So what if you end up with hundreds, thousands of CLOSED deltastores? As I explained, the best option is to rebuild the index, because it will build new global dictionaries, more relevant for the actual data present in the table. The next best thing is REORGANIZE. REORGANIZE on a clustered columnstore index has the semantics of “run Tuple Mover now on this table, at full speed”. When REORGANIZE is run the engine has a clear green light to use all resources at its disposal to compress the CLOSED segments and will proceed accordingly.
BTW, did you know the columnstore index offline rebuild operations are actually semi-online? They only acquire S locks on the index during the rebuild phase and then escalate to a short SCH-M lock at the end, when they swap the old and new index. Queries can read the old index during the rebuild, but no updates/deletes/inserts are allowed.
Awesome staff, thank you so much for all these dentais!
It is worth whitepaper. Even more – there should be a whitepaper before RTM. You explain here concepts that everybody who considers clustered columnstore index must know before he starts to design an application. The impact here is much stronger than in using/not using minimal logging, for example – that received extensive coverage in a white paper (about loading data into SQL Server) and several articles in Storage Engine team blog.
Regarding the comparison to Ghost Cleanup – there is huge difference. Big deltastore directly affects user experience – queries take much longer. Ghost cleanup doesn’t directly affect user experience. Of course, if you disable it for good… but who does it? Personally I’ve used trace flag for disabling Ghost Cleanup only twice in my file – and in both cases it was due to corruption issues, not performance. Blogged about the most complicated of those cases:
http://sqlblog.com/blogs/michael_zilberstein/archive/2013/03/22/48339.aspx
I would still like to see Tuple Mover more aggressive in certain cases. Not always of course – agree with you. But when server is idle, for example. And/or as a result of some trace flag. Conceptually I don’t think that forcing user to REORGANIZE or REBUILD index as part of some scheduled load task is a good approach. Those operations used to be DBA task – not db or SSIS programmer’s ones.