data compression

Home

About Us

Compare Us

Case Studies

Our Methods

Beyond Entropy

Contact Us

Our Proprietary Methods


Attribute Vector Coding

Attribute vector coding is the first vector transform method for compressing multidimensional database tables. It is tuple-oriented and semantic. It breaks new ground by capturing and exploiting the innumerable interrelationships, functional dependencies, and statistical correlations in data without having to solve the intractable problem of explicitly identifying and defining them. It achieves unequaled compression because it systematically models data at the highest levels of abstraction, across dimensions and data types. That makes it far less subject than conventional methods to compressibility limits imposed by information theory.

Attribute vector coding recovers an empirical understanding of the processes that created the data and then strives to emulate those processes through functions embodied in software and codified in data. To that end, it captures and exploits prior knowledge regardless of whether that knowledge can be explicitly identified and defined. From that, it computes structured predictions of field values, and expresses them through a system of functions and coefficients. Those functions, together with block-decomposable degenerated canonical Huffman codes that encode them into symbols, decorrelate across dimensions and data types simultaneously.

Beyond Entropy

Beyond entropy is a compound lexical method founded on two new design principles. It can compress length-constrained text fields below the entropy of the original data model.

The technique synthesizes a data model that is representationally equivalent to the original but operates at lower entropy. It builds lexicons from the input symbols, warms one by exploiting latent symbol substructure, and assigns the symbols entropy codes using the revised statistics.

Beyond entropy reveals the presence of a technology gap between the art of data modeling and the science of information theory, the value of new principles that bridge that gap, and the benefit of exploiting latent substructure in complex data symbols. It also provides new insight into how symbols can represent objects, and into how entropy differentials can exist between the processes that generate real-world data and the generative data models that emulate them.

Repopulation

Repopulation is a structural method for compressing the common configuration of access data structures consisting of separate hash and collision tables, and handling collisions – sequences of database record numbers – through external chaining. It populates hash table locations that would otherwise be empty with parts of collision strings that would otherwise occupy memory.

Unlike almost every other compression method, repopulation is not a replacement scheme. Instead, repopulation is permutational and mechanistic; it works like a chess-playing automaton. It draws on no information-theoretic concepts, so it can compress random data.

Repopulation is transparent, imposes no new requirements on hash functions, preserves fast real-time random access, and does not degrade collision statistics. Repopulation simultaneously achieves the access speed of a low hash table load factor and the table compactness of a high one, thus avoiding that historical compromise.

Superpopulation

Superpopulation is a variable-to-variable-length algorithm that compresses index tables, lists, arrays, zerotrees, and similar data. It systematically accommodates wide local variations in data statistics. Superpopulation may be used by itself or in conjunction with repopulation.

Superpopulation recognizes that distributions of values in access data structures are often far from random. They can be highly correlated due to the nature of the processes that generated the data, and often have areas of high and low correlation. In a typical index table application, superpopulation decomposes each collision string into a sequence of adjacent substrings, each classified as one of two distinct target types or as an incompressible substring. Each target is then compressed using one of two target type-specific encoding methods.

Wordencoding

Wordencoding is a 0-order (context-independent) variable-to-variable-length algorithm for compressing text strings, hereinafter called words, in database table record fields. It achieves compression close to the 0-order source entropy without sacrificing speed. It does that by providing an efficient way to maximize effective combined data locality over three areas: the compressed record fields, the lexicons holding the words, and their access data structures.

Wordencoding recognizes that word frequencies will represent great statistical disparity, and it accommodates statistical disparity with algorithmic diversity – the systematic use of multiple techniques. Further, by decomposing words into fragments and compressing each fragment separately, it recognizes that uncommon words are less compressible than common ones, and that they often consist of two more common and therefore more compressible fragments. Doing that deals explicitly with the structure and statistics of the data by recognizing that redundancy in text strings exists in different forms at different levels of granularity.
A 21st Century Approach


Beyond the State of the Art

Xtreme Compression's proprietary compression technology delivers performance beyond the reach of conventional methods. Here is why:
  • The exchange principle states that during design, moving function and algorithmic complexity from decompression to compression and from encoding to modeling can simultaneously increase compression efficacy and decrease decompression time. It aligns the otherwise-competing performance goals of compression ratio and decompression speed, eliminating historical compromise.

  • The sequence-symbol continuum principle states that for every discrete message, a sequence-symbol continuum exists with which the message can be separated into sequences of symbols. At one extreme, each entity is treated as an atomic symbol, so the symbols have no substructure, and the sequence is maximally complex dimensionally and statistically. At the other, a single symbol having maximally complex substructure represents the entire message, so there is no sequence. Lossless compression is generally possible everywhere in the continuum.

  • Some finite sets of transparently decomposable symbols can have multiple generative probabilistic symbol sequence data models that generate identical data but differ in entropy. Recognizing that representational equivalence opens the door to compression below the entropy of the original model.

  • Xtreme Compression's multilevel data modeling exploits our separable redundancy principle. That states that data can simultaneously have redundancy at some levels of abstraction and none at others.

  • From the beginning, database data are treated as database data with regard to table structure, semantics, data type heterogeneity, & prediction directionality.

  • The database designs are unencumbered by any need for symbol correspondence. That makes available a class of techniques that could not otherwise be used, and it allows bidirectional context prediction.

  • Semantic cognizance allows modeling data at higher levels of abstraction than what would otherwise be possible.

  • Xtreme Compression realizes two inviolate truths. The first is that information theory's compressibility limits apply only to the encoding of modeled data, not to the data modeling itself. The second concerns that modeling: no principle, theory, or natural law governs the succinct codification of meaning.

    That is why, for data with sufficient algorithmic complexity, interrelationships, functional dependencies, and statistical correlations, the data model's capabilities are limited solely by the designer's creative ability. It is also why Xtreme Compression's proprietary technology is beyond the state of the art.

Beyond the State of the Art