Xtreme Compression's Proprietary Compression Methods
ATTRIBUTE VECTOR CODING
Attribute vector coding is a tuple-oriented vector transform method for compressing database tables. It breaks new ground by capturing and exploiting the innumerable relationships, functional dependencies, and statistical correlations in data without having to solve the intractable problem of explicitly identifying and defining them. It achieves unequaled compression because it systematically models data at the highest levels of abstraction, across dimensions and data types. That makes it far less subject than conventional methods to compressibility limits imposed by information theory.
The compression process comprises three distinct functions that are largely independent but operate concurrently. They are the ingestion of source data at the input, the production of compressed data at the output, and the management of the prediction context in between. That separation provides autonomy, a valuable asset during algorithm design and optimization.
Attribute vector coding recovers an empirical understanding of the processes that created the data and then strives to emulate them through functions embodied in software and codified in data. To that end, it captures and exploits prior knowledge regardless of whether that knowledge can be explicitly defined. From that, it computes structured predictions of field values, and expresses them through a system of functions and coefficients. Those functions, together with degenerated canonical Huffman codes that encode them, decorrelate across dimensions and data types simultaneously, thereby exploiting inter-tuple and intra-tuple interrelationships.
Of course, the theoretical advantages would all be for nought were attribute vector coding not cost-effective to use. That is why one other aspect is vital: flexibility. It is why attribute vector coding, when used together with wordencoding, can systematically accommodate table data regardless of data type, cardinality, skew, sparsity, or field width. That systematization minimizes the number of discrete methods needed to be implemented, optimized, and maintained, making attribute vector coding, above all, practical.
Repopulation is a structural method for compressing monotonic integer sequences in hash tables and similar data structures. It populates table locations that would otherwise be unused with subsequences that would otherwise occupy memory.
Repopulation, unlike almost every other lossless compression method, is not a replacement scheme. Instead, it is transpositional and mechanistic, works like a chess-playing automaton, and employs no information-theoretic concepts. Repopulation combines the access speed of a low load factor with the table compactness of a high one, avoiding that historical compromise.
Superpopulation is a variable-to-variable-length algorithm targeting index tables, lists, arrays, and the like. It systematically accommodates wide local variations in data statistics. Superpopulation is used by itself and together with repopulation.
Superpopulation recognizes that distributions of values in access data structures generally have areas of high and low correlation. It works by classifying each such area as a particular target type, and then applies a target type-specific encoding method to each.
Wordencoding is a 0-order variable-to-variable-length algorithm for compressing text strings in database table record fields. It achieves compression close to the 0-order source entropy without sacrificing speed by maximizing combined data locality over compressed record fields and access data structures. Wordencoding deals explicitly with the data's correlational structure by recognizing that redundancy in text strings exists at multiple levels of granularity.