Attribute Vector Coding
Attribute vector coding is the first vector transform method for compressing multidimensional database tables. It is tuple-oriented and semantics-aware. It breaks new ground by capturing and exploiting the innumerable relationships, functional dependencies, and statistical correlations in the data without having to solve the intractable problem of explicitly identifying and defining them. It achieves unequaled compression because it systematically models data at the highest levels of abstraction, across dimensions and data types. That makes it far less subject than conventional methods to compressibility limits imposed by information theory.
Attribute vector coding recovers an empirical understanding of the processes that created the data through statistical analysis, and then strives to emulate those processes through functions embodied in software and codified in data. To that end, it captures and exploits prior knowledge regardless of whether it can be explicitly identified and defined. That prior knowledge has two parts: intra-tuple correlations, from other fields within the record, and inter-tuple correlations, from fields in other records. From those, structured predictions of field values are computed and expressed, together with differences between them and actual values, through a system of functions and coefficients. Those functions, and the scheme for encoding them into symbols, decorrelate across dimensions and data types simultaneously.
Of course, the advantages of high-level abstraction would all be for nought if attribute vector coding were not cost-effective to use. That is why one other aspect is vital: flexibility. It is what allows attribute vector coding, when used together with wordencoding (described below), to systematically accommodate all table fields regardless of data type, cardinality, skew, sparsity, field order, or field width. That systematization brings to a minimum the number of discrete methods having to be implemented, optimized, and maintained to compress table data, thereby helping to make attribute vector coding, above all, practical.
Wordencoding
Wordencoding is a 0-order (context-independent) variable-to-variable-length algorithm for compressing text strings, hereinafter called words, in database table record fields. It achieves compression close to the 0-order source entropy without sacrificing speed. It does that by providing an efficient way to maximize effective combined data locality over three areas: the compressed record fields, the lexicons holding the words, and their access data structures.
Wordencoding recognizes that word frequencies will represent great statistical disparity, and it accommodates statistical disparity with algorithmic diversity – the systematic use of multiple techniques. Further, by decomposing words into fragments and compressing each fragment separately, it recognizes that uncommon words are less compressible than common ones, and that they often consist of two more common and therefore more compressible fragments. Doing that deals explicitly with the structure and statistics of the data by recognizing that redundancy in text strings exists at different levels of granularity.
Wordencoding uses two lexicons constructed by sectioning by word length and storing all lexicon words of each length together. The motivation is the observation that the ratio of discrete words to discrete word lengths will always be high. The benefits are twofold: reducing the lengths of the lexicon indexes stored in database table record fields after compression, and allowing access through data structures compact enough to be kept in cache memory. Thus, exploiting word length statistics is significant to the compression and the decompression.
Repopulation
Repopulation is a structural method for compressing the common configuration of access data structures consisting of separate hash and collision tables, and handling collisions – sequences of database record numbers – through external chaining. It populates hash table locations that would otherwise be empty with parts of collision strings that would otherwise occupy memory.
Unlike almost every other compression method, repopulation is not a replacement scheme; it draws on no information-theoretic concepts. Instead, repopulation is mechanistic; it operates like a chess-playing automaton. It is transparent, imposes no new requirements on hash functions, preserves fast real-time random access, and does not degrade collision statistics. Repopulation simultaneously achieves the access speed of a low hash table load factor and the table compactness of a high one, avoiding the historical compromise of speed with size.
Superpopulation
Superpopulation is a variable-to-variable-length algorithm that compresses index tables, lists, arrays, zerotrees, and similar data. It systematically accommodates wide local variations in data statistics. Superpopulation may be used by itself or in conjunction with repopulation.
Superpopulation recognizes that distributions of values in access data structures are often far from random. They can be highly correlated due to the nature of the processes that generated the data, and often have areas of high and low correlation. In a typical index table application, superpopulation decomposes each collision string into a sequence of adjacent substrings, each classified as one of two distinct target types or as an incompressible substring. Each target is then compressed using one of two target type-specific encoding methods.
|