In natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas can be represented using a smaller set of words.
Semantic compression is a lossy compression, that is, some data is being discarded, and an original document cannot be reconstructed in a reverse process.
Step 1 requires assembling word frequencies and information on semantic relationships, specifically hyponymy. Moving upwards in word hierarchy, a cumulative concept frequency is calculating by adding a sum of hyponyms' frequencies to frequency of their hypernym: where is a hypernym of . Then, a desired number of words with top cumulated frequencies are chosen to build a targed lexicon.
In the second step, compression mapping rules are defined for the remaining words, in order to handle every occurrence of a less frequent hyponym as its hypernym in output text.
The below fragment of text has been processed by the semantic compression. Words in bold have been replaced by their hypernyms.
They are both nest building social insects, but paper wasps and honey bees organize their colonies
in very different ways. In a new study, researchers report that despite their differences, these insects rely on the same network of genes to guide their social behavior.The study appears in the Proceedings of the Royal Society B: Biological Sciences. Honey bees and paper wasps are separated by more than 100 million years ofevolution, and there are striking differences in how they divvy up the work of maintaining a colony.
The procedure outputs the following text:
They are both facility building insect, but insects and honey insects arrange their biological groups
in very different structure. In a new study, researchers report that despite their difference of opinions, these insects act the same network of genes to steer their party demeanor. The study appears in the proceeding of the institution bacteria Biological Sciences. Honey insects and insect are separated by more than hundred million years oforganic processes, and there are impinging differences of opinions in how they divvy up the work of affirming a biological group.
A natural tendency to keep natural language expressions concise can be perceived as a form of implicit semantic compression, by omitting unmeaningful words or redundant meaningful words (especially to avoid pleonasms).
Semantic compression is advantageous in information retrieval tasks, improving their effectiveness (in terms of both precision and recall). This is due to more precise descriptors (reduced effect of language diversity – limited language redundancy, a step towards a controlled dictionary).
As in the example above, it is possible to display the output as natural text (re-applying inflexion, adding stop words).
None of the audio/visual content is hosted on this site. All media is embedded from other sites such as GoogleVideo, Wikipedia, YouTube etc. Therefore, this site has no control over the copyright issues of the streaming media.
All issues concerning copyright violations should be aimed at the sites hosting the material. This site does not host any of the streaming media and the owner has not uploaded any of the material to the video hosting servers. Anyone can find the same content on Google Video or YouTube by themselves.
The owner of this site cannot know which documentaries are in public domain, which has been uploaded to e.g. YouTube by the owner and which has been uploaded without permission. The copyright owner must contact the source if he wants his material off the Internet completely.