Xtreme Compression's Proprietary Compression Methods

ATTRIBUTE VECTOR CODING

Attribute vector coding is the centerpiece of Xtreme Compression’s proprietary technology. It is a fixed-to-variable-length row-oriented vector transform methodology for compressing database tables with row/column structure. It breaks new ground by capturing and exploiting the innumerable interrelationships, functional dependencies, and statistical correlations in structured data without first having to solve the intractable problem of explicitly identifying and defining them.

Attribute vector coding achieves unequaled compression efficacy because it systematically models data at the highest levels of abstraction, across dimensions, and across data types. That modeling is typically more than an order of magnitude more complex than the subsequent encoding of the modeled data. That makes attribute vector coding far less subject than conventional methods to compressibility limits imposed by information theory.

Attribute vector coding works by first recovering a functional understanding of the systems and processes that create the source data. Using that, it computes structured predictions of table record field values, and expresses them, and differences between those predictions and actual data values, through a system of functions and residues. Those functions are designed to exploit interrelationships and statistical dependencies that are often too weak, complex, and/or numerous to be explicitly identified and defined. The functions, together with degenerated canonical Huffman codes for encoding them, decorrelate across dimensions, data types, and levels of abstraction simultaneously in order to exploit both inter-tuple and intra-tuple correlations.

Of course, theoretical advantages would all be for nought were attribute vector coding not cost-effective to use. That is why one other aspect is vital: flexibility. It is why attribute vector coding can systematically accommodate structured data regardless of data type, cardinality, skew, sparsity, or field width. That systematization minimizes the number of individual compression methods having to be developed, implemented, optimized, tested, and maintained in the delivered product in order to ensure that attribute vector coding is, above all, practical.

REPOPULATION

Repopulation is a structural method for compressing integer sequences in hash tables and similar data structures. It works by populating table locations that would otherwise be unused with subsequences that would otherwise occupy memory.

Repopulation, unlike almost every other lossless compression method, is not a replacement scheme. Instead, it is transpositional and mechanistic, and has similarities to a chess-playing automaton in operation. Repopulation combines the access speed of a low fill factor with the table compactness of a high one, thus avoiding that historical compromise.

Repopulation incorporates no information-theoretic concepts, but does have a tangential connection to number theory.

SUPERPOPULATION

Superpopulation is a variable-to-variable-length algorithm targeting index tables, lists, arrays, and the like. It systematically accommodates wide local variations in data statistics. Superpopulation is used sometimes by itself but more often together with repopulation.

Superpopulation recognizes that distributions of values in access data structures generally have areas of high and low correlation. It works by classifying each such table area as a particular target type, and then applying a target type-specific encoding method to each table area.

WORDENCODING

Wordencoding is a 0-order variable-to-variable-length algorithm for compressing text strings in database table record fields. It achieves compression close to the 0-order source data entropy without sacrificing speed by maximizing combined data locality over compressed record fields and access data structures.

Wordencoding deals explicitly with the data’s correlational structure by recognizing that redundancy in text strings exists at multiple levels of granularity simultaneously.