The technology gap
A technology gap exists between the art of data modeling and the science of information theory. Its cause is the categorical exclusion of the intrinsic properties of data symbols -- substructure -- from information theory’s scope. Because substructure holds latent mutual information, that unfortunate exclusion made the technology gap inevitable from the beginning.
Over the decades, the gap has grown along with the ever-increasing structural and statistical complexity of modern data. It persists because the state of the art of structured data compression has not begun to keep pace with technology.
Obsolesence in information theory
The technology gap is now so wide that parts of information theory are effectively obsolete in a practical sense. Specifically, calculations of model entropy made as they have been for decades can now underestimate compressibility. Such situations can arise when designing modern devices, applications, systems, strategies, and databases, and when engineering the complex data on which they operate.
Breaking the link to compressibility
That leaves the designer with the problem of correctly interpreting entropy calculations. The proximal cause of the ambiguity is the presence of latent mutual information carried by symbol substructure. It breaks the historic link between Shannon entropy and compressibility by invalidating justifications for the assumptions behind the calculations.
Once that happens, those calculations are no longer grounded in the processes that generate the data. They become abstract, and their results underestimate compressibility when misinterpreted.
Exploiting symbol substructure
Once the designer correctly interprets the results of the entropy calculations, however, and recognizes compressibility underestimation, the opportunity to exploit symbol substructure becomes apparent. The benefit, of course, is more compression.
Xtreme Compression introduces two new principles that bridge the technology gap between the art of data modeling and the science of information theory by focusing on the definition and substructure of data symbols. They are discussed in the sidebar.
|
New principles bridge the gap
Representational equivalence. Some finite sets of transparently decomposable symbols can have multiple models that generate identical data but differ in entropy. Among all representationally equivalent models for the same data, the original may not have the lowest entropy. Only the designer's creative ability limits the number and capabilities of such models.
Separable redundancy. Data having sufficiently complex dimensional and statistical structure can be represented at multiple levels of abstraction that differ in redundancy. That can separate organized and distributed redundancies.
New methodology makes it happen
Those principles supply design insights that suggest how to best compress discrete structured data:
Recover or synthesize the original data model and compute its entropy.
Define one or more representationally equivalent generative data models that use transparently decomposable symbols.
For each model, devise a set of methods that exploit latent symbol substructure.
|
|