attribute vector coding

Home

About Us

Compare

Case Study

Our Methods

Contact Us

Our Methods


Attribute Vector Coding

Attribute vector coding is the first vector transform method for compressing multidimensional database tables. It is tuple-oriented and semantics-aware. It breaks new ground by capturing and exploiting the innumerable relationships, functional dependencies, and statistical correlations in the data without having to solve the intractable problem of explicitly identifying and defining them. It achieves unequaled compression because it systematically models data at the highest levels of abstraction, across dimensions and data types. That makes it far less subject than conventional methods to compressibility limits imposed by information theory.

Attribute vector coding recovers an empirical understanding of the processes that created the data through statistical analysis, and then strives to emulate those processes through functions embodied in software and codified in data. To that end, it captures and exploits prior knowledge regardless of whether it can be explicitly identified and defined. That prior knowledge has two parts: intra-tuple correlations, from other fields within the record, and inter-tuple correlations, from fields in other records. From those, structured predictions of field values are computed and expressed, together with differences between them and actual values, through a system of functions and coefficients. Those functions, and the scheme for encoding them into symbols, decorrelate across dimensions and data types simultaneously.

Of course, the advantages of high-level abstraction would all be for nought if attribute vector coding were not cost-effective to use. That is why one other aspect is vital: flexibility. It is what allows attribute vector coding, when used together with wordencoding (described below), to systematically accommodate all table fields regardless of data type, cardinality, skew, sparsity, field order, or field width. That systematization brings to a minimum the number of discrete methods having to be implemented, optimized, and maintained to compress table data, thereby helping to make attribute vector coding, above all, practical.

Wordencoding

Wordencoding is a 0-order (context-independent) variable-to-variable-length algorithm for compressing text strings, hereinafter called words, in database table record fields. It achieves compression close to the 0-order source entropy without sacrificing speed. It does that by providing an efficient way to maximize effective combined data locality over three areas: the compressed record fields, the lexicons holding the words, and their access data structures.

Wordencoding recognizes that word frequencies will represent great statistical disparity, and it accommodates statistical disparity with algorithmic diversity – the systematic use of multiple techniques. Further, by decomposing words into fragments and compressing each fragment separately, it recognizes that uncommon words are less compressible than common ones, and that they often consist of two more common and therefore more compressible fragments. Doing that deals explicitly with the structure and statistics of the data by recognizing that redundancy in text strings exists at different levels of granularity.

Wordencoding uses two lexicons constructed by sectioning by word length and storing all lexicon words of each length together. The motivation is the observation that the ratio of discrete words to discrete word lengths will always be high. The benefits are twofold: reducing the lengths of the lexicon indexes stored in database table record fields after compression, and allowing access through data structures compact enough to be kept in cache memory. Thus, exploiting word length statistics is significant to the compression and the decompression.

Repopulation

Repopulation is a structural method for compressing the common configuration of access data structures consisting of separate hash and collision tables, and handling collisions – sequences of database record numbers – through external chaining. It populates hash table locations that would otherwise be empty with parts of collision strings that would otherwise occupy memory.

Unlike almost every other compression method, repopulation is not a replacement scheme; it draws on no information-theoretic concepts. Instead, repopulation is mechanistic; it operates like a chess-playing automaton. It is transparent, imposes no new requirements on hash functions, preserves fast real-time random access, and does not degrade collision statistics. Repopulation simultaneously achieves the access speed of a low hash table load factor and the table compactness of a high one, avoiding the historical compromise of speed with size.

Superpopulation

Superpopulation is a variable-to-variable-length algorithm that compresses index tables, lists, arrays, zerotrees, and similar data. It systematically accommodates wide local variations in data statistics. Superpopulation may be used by itself or in conjunction with repopulation.

Superpopulation recognizes that distributions of values in access data structures are often far from random. They can be highly correlated due to the nature of the processes that generated the data, and often have areas of high and low correlation. In a typical index table application, superpopulation decomposes each collision string into a sequence of adjacent substrings, each classified as one of two distinct target types or as an incompressible substring. Each target is then compressed using one of two target type-specific encoding methods.
Superior by Design 


Some reasons for Xtreme Compression's superior performance:
  • From the beginning, database data are treated as database data with regard to dimensional structure, data semantics, heterogeneity of data types, and prediction directionality; that drives all of the engineering design and analysis.

  • The database design is unencumbered by any need for symbol correspondence. That makes available a class of techniques that could not otherwise be used, and it allows predicting bidirectionally.

  • Recognizing semantic content allows understanding and modeling data at higher levels of abstraction than would otherwise be possible, an overwhelming advantage.

  • Unlike our application-specific methods, general-purpose methods invariably compromise performance because of how they are optimized and evaluated.

  • The exchange principle aligns the otherwise-competing performance goals of compression ratio and decompression speed, eliminating historical compromise.

  • Attribute vector coding, the centerpiece of our technology, captures and exploits the innumerable relationships, functional dependencies, and statistical correlations in the data without explicitly identifying and defining them.

  • By design, two of our methods, wordencoding and superpopulation, accommodate statistical disparity in the data through algorithmic diversity.

  • Xtreme Compression realizes two inviolate truths. The first is that the compressibility limits imposed by information theory apply only to the encoding of modeled data, not to the data modeling itself. The second concerns that modeling: no principle, theory, or natural law exists that governs the succinct codification of meaning. That is why, for sufficiently large and algorithmically complex databases, the capabilities of the data model are limited solely by the designer's creative ability. No branch of mathematics, pure or applied, stands in the way.

Copyright 2009. Xtreme Compression, Inc. All Rights Reserved. All product names are trademarks of their respective companies.

Beyond the State of the Art