Faster & strong: string dictionary compression using sampling and fast vectorized decompression

String dictionaries constitute a large portion of the memory footprint of database applications. While strong string dictionary compression algorithms exist, these come with impractical access and compression times. Therefore, lightweight algorithms such as front coding (PFC) are favored in practice. This paper endeavors to make strong string dictionary compression practical. We focus on Re-Pair Front Coding (RPFC), a grammar-based compression algorithm, since it consistently offers better compression ratios than other algorithms in the literature. To accelerate compression times, we propose block-based RPFC (BRPFC) which consists in independently compressing small blocks of the dictionary. For further accelerated compression times especially on large string dictionaries, we also propose an alternative version of BRPFC that uses sampling to speed up compression. Moreover, to accelerate access times, we devise a vectorized access method, using Intel® Advanced Vector Extensions 512 (Intel® AVX-512). Our experimental evaluation shows that sampled BRPFC offers compression times up to 190 × faster than RPFC, and random string lookups 2.3 × faster than RPFC on average. These results move our modified RPFC into a practical range for use in database systems because the overhead of Re-Pair-based compression for access times can be reduced by 2 ×.


Citation style:
Could not load citation form.


Use and reproduction: