SQL Server Clustered Columnstore Indexes at TechEd 2013June 11th, 2013
Now that the TechEd 2013 presentations are up online on Channel9 you can check out Brian Mitchell’s session What’s New for Columnstore Indexes and Batch Mode Processing. Brian does a great job at presenting the new updatable clustered columnstore indexes and the enhancements done to the vectorized query execution (aka. batch mode). Besides the TechEd presentation there is also another excellent resource available online right now for your education on the topic: the SIGMOD 2013 paper Enhancements to SQL Server Column Stores. Besides the obvious updatability, this paper cites some more improvements that are available in clustered columnstores:
- Improved Index Build
- Sampling Support
- Bookmark Support
- Schema modification support
- Support for short strings
- Mixed Execution Mode
- Hash Join support
- Improvements in Bitmap filters
- Archival support
It should be no surprise to anyone studying columnar storage that the updatable clustered columnstores coming with the next version of SQL Server are based on deltastores. I talked before about the SQL Server 2012 Columnstore internals and I explained why the highly compressed format that makes columnar storage so fast it also makes it basically impossible to update in-place. The technique of having a ‘deltastore’ which stores updates and, during scans, merge the updates with the columnar data is not new and is employed by several of the columnar storage solution vendors.
Deltastores are ordinary B-Trees that store uncompressed row groups of the clustered columnstore
Columnstores introduce a new unit of organization, a row-group. A row-group is a logical unit that groups up to 220 rows (about 1 million rows). In SQL Server 2012 the row-groups where implicit and there was catalog view to show them. As you can see in Brian’s presentation SQL Server 14 adds a new catalog view: sys.column_store_row_groups. This catalog view show the state of each row group for all columnstores (including non-clustered ones). Updatable clustered columnstores can show the row groups in COMPRESSED state or in OPEN/CLOSED state. The OPEN/CLOSED row groups are deltastores (yes, there could be multiple deltastores per columnstore, see the above mentioned SIGMOD 2013 paper). OPEN row groups are ready to accept more inserts while CLOSED row groups have filled up and are awaiting compression. The structure of a deltastore is explained in the SIGMOD paper:
A delta store contains the same columns as the corresponding column store index. The B-tree key is a unique integer row ID generated by the system (column stores do not have unique keys).
If you wonder what a unique integer row ID generated by the system actually is, remember how uniqueifier columns work. Deltastores are managed entirely by the engine, there is no DDL to control the creation and deletion of deltastores. The engine creates a new deltastore whenever it needs one to handle inserts, closes them when full (have 1 million rows) and a background process called the Tuple Mover compresses this closed deltastores into columnar storage format.
When handling deltastores the columnar storage advantages are dimmed. Deltastores are row mode storage so the entire row has to be read, not only the column(s) of interest. Segment elimination does not occur for deltastores since the deltastores do not have metadata about min and max values contained inside each column. Parallel scans will distribute the deltastores among the threads so that multiple deltastores are scanned in parallel, but there is no attempt for parallelism inside a single deltastores. At maximum size of 1 million rows they’re simply too small to justify the engineering complications of handling parallelism inside the deltastores. All this is explained in the SIGMOD paper:
Parallel scans assign each delta store to a single thread of execution. A single delta store is too small to justify scanning in parallel but multiple delta stores can be scanned in parallel. Scanning delta stores is slower than scanning data in columnar format because complete records have to be read and not just the columns needed by the query.
The background Tuple Mover process is responsible to compressing full deltastores. The Tuple Mover is an online operation, it does not prevent data reads from the deltastores being compressed. This is described in the SIGMOD paper:
The Tuple Mover reads one closed delta store at a time and starts building the corresponding compressed segments. During this time scans continue to see and read the delta store. When the Tuple Mover has finished compressing the delta store, the newly created segments are made visible and the delta store is made invisible in a single, atomic operation. New scans will see and scan the compressed format. The Tuple Mover then waits for all scans still operating in the delta store to drain after which the delta store is removed.
Concurrent deletes or updates are blocked while the Tuple Mover compresses a deltastore. Concurrent Inserts are not blocked by the Tuple Mover. As for the ‘background process’ part of the Tuple Mover the closest analogy is the Ghost Cleanup process. Veterans know that Ghost Cleanup is never tuned right for my job, is always either too aggressive for some users or too slow for others. Will Tuple Mover suffer from the same problems? I don’t expect it to, primarily because the unit of work is big. It takes time to accumulate 1 million rows in single row by row inserts.
The removal of compressed deltastores is very efficient as is basically a deallocation (as efficient as a DROP). This is important so that the Tuple Mover does not generate exaggerate logging during compression.
The deleted bitmap is another B-Tree associated with the clustered columnstore. There is only one deleted bitmap for the entire columnstore, it covers all the row-groups (all segments). Only compressed segments use a deleted bitmap. The DELETE operation is in effect an insert into the deleted bitmap, the row-group and tuple number of the deleted row is inserted into the deleted bitmap. Scans (reads) honor the deleted bitmap by filtering out any row (tuple) marked as deleted. It is recommendable that clustered columnstore indexes that have seen a large number of deletes to be rebuild in order to restore the ‘health’ of their segments by removing the deleted rows. UPDATE operations on clustered columnstores are always split updates, meaning the Query Optimizer will create a plan that contains a delete and an insert for any UPDATE.
Delete bitmaps do not cover deltastores. As B-Trees, the deltastores support direct delete so they implement the delete by removing the deleted row.
Large bulk insert operations do not insert rows into delta stores but convert batches of rows directly into columnar format. The operation buffers rows until a sufficient number of rows has accumulated, converts them into columnar format, and writes the resulting segments and dictionaries to disk. This is very efficient; it reduces the IO requirements and immediately produces the columnar format needed for fast scans. The downside is the large memory space needed to buffer rows.
For clustered columnstores bulk insert performance is critical, as is the bread and butter of the data warehousing ETL scenarios. You are going to have to give attention to your ETL pipeline and drive to achieve the optimal directly compressed format. This requires the bulk insert to be able to upload close to 1 million rows per partition in each batch. If the bulk insert uses too small batches then the result will be sub-optimal deltastores instead of the optimized compressed segments. With SSIS you will have to pay attention to the data flow buffer size as the default 10MB is way too small for achieving efficient columnstore bulk inserts. Of course there will be cases when there simply isn’t enough data to upload and in such cases the bulk insert will result in a deltastore instead of compressed columnstores. The deltastores will be left OPEN and subsequent bulk insert operations will reuse them and fill them up, close them and leave them for the Tuple Mover to compress them. While achieving the directly compressed format during bulk insert is desirable, you should not stretch your ETL and business logic out of the way just to achieve it. Going through the intermediate deltastore will remedy itself automatically once sufficient data accumulates.
INSERT … SELECT … targeting clustered columnstore tables is a bulk insert operation
Enjoy! SELECT … INTO has always been a bulk (and minimally logged) operation, but that can only create a row-mode heap. With INSERT … SELECT being now optimized for clustered columnstore indexes to use the bulk insert API, your ETL pipeline can construct the switch-in data directly into columnar storage format in one single pass, w/o having to resort to a rebuild. This is, again, described in the SIGMOD paper.
Trickle inserts are all inserts that do not come through the bulk insert API. INSERT INTO ... VALUES ... statements are trickle inserts. Trickle inserts are handled always by a deltastore and they can never create directly compressed data. UPDATE statements (as well as MERGE) result in split updates (delete and insert) and will insert in trickle mode.
Clustered columnstore index build is smart. It solve several problems in new ways:
- Relevant Dictionaries
- Columnar storage columns use a global dictionary, shared by all segments, and local dictionaries shared by specific segments. The more relevant the global dictionary (the more actual data values are covered by it) the better the data compression achieved, as the secondary dictionaries are smaller or even not needed. In SQL Server 2012 the global dictionary was just the first dictionary built, so it could contain skewed data resulting in poor relevance. With clustered columnstore index build the entire data is first sampled ina first stage, global dictionaries are built for all columns that requires them, and then the build proper starts. This creates much better global dictionaries and result in significant storage (compression) improvement.
- Minimize blocking
- This problem is not specific to clustered columnstore indexes: during an offline index rebuild, there is no reason to block reads. Yes, there is a risk of deadlock at the end of the index rebuild, but there are ways to solve that problem. With clustered columnstore indexes offline rebuild operations are semi-online, meaning reads are allowed but updates are blocked.
- Workload Variation
- Columnstore index build is very memory intensive. Traditional execution model determines the DOP at beginning of execution and then the query executes with the given DOP. With the improved columnstore build the actual execution DOP varies as the build progresses and the build process can actively and voluntarily reduce it’s DOP (by ‘parking’ execution threads) in order to adjust to low memory conditions.
Required by the aforementioned two-stage index build in order to create relevant global dictionaries. Similar to how heap and B-Tree sampling selects entire pages to sample, columnstores can select row groups (segments) to sample.
Required to implement split updates. The heap bookmark is a physical locator (file_id:page_id:slot_id), the B-Tree bookmark is the actual key value and the clustered columnstore bookmark is the (row_group_id:tuple_id) pair. In deltastores the tuple_id is the value of the uniquifier column so the bookmark is located efficiently with a seek operation. Having bookmarks also frees the optimizer to explore additional options like eager spool.
Add column, alter column, drop column are supported by clustered columnstore indexes.
Short strings, like US state abbreviations, are frequent in fact tables and previously only dictionary encoding was available for them, which is inefficient. Now short strings can be encoded by value, w/o requiring a dictionary.
Queries can now execute in a mixture of batch-mode and row-mode stages. Special adapter operators can exchange rows into batches and vice-versa. This gives the optimizer freedom to mix batch-mode with unsupported operators w/o having to resort to reverting the entire query to row-mode.
The hash join is the preferred join of data warehousing workloads, dominated by large data sets and aggregations. The improved batch-mode hash join handles inner, outer, semi and anti-semi joins, ie. the entire spectrum of possible join operators (don’t confuse them with the join syntax). I recommend going over the relevant chapters on the SIGMOD paper for this topic for details.
Simply put, add another layer of compression using XPress8 (a variant of LZ77) over the already compressed segments and dictionaries. Recommended for cold data, can be applied per partition and can give an additional up to 66% reduction in size (YMMV).