Elexis API documentation
Version 2.1.6 as of December 11 2011

com.healthmarketscience.jackcess.scsu
Class SCSU

java.lang.Object
  extended by com.healthmarketscience.jackcess.scsu.SCSU
Direct Known Subclasses:
Compress, Expand

public abstract class SCSU
extends java.lang.Object

Encoding text data in Unicode often requires more storage than using an existing 8-bit character set and limited to the subset of characters actually found in the text. The Unicode Compression Algorithm reduces the necessary storage while retaining the universality of Unicode. A full description of the algorithm can be found in document http://www.unicode.org/unicode/reports/tr6.html Summary The goal of the Unicode Compression Algorithm is the abilty to Express all code points in Unicode Approximate storage size for traditional character sets Work well for short strings Provide transparency for Latin-1 data Support very simple decoders Support simple as well as sophisticated encoders If needed, further compression can be achieved by layering standard file or disk-block based compression algorithms on top.

Features

Languages using small alphabets would contain runs of characters that are coded close together in Unicode. These runs are interrupted only by punctuation characters, which are themselves coded in proximity to each other in Unicode (usually in the ASCII range). Two basic mechanisms in the compression algorithm account for these two cases, sliding windows and static windows. A window is an area of 128 consecutive characters in Unicode. In the compressed data stream, each character from a sliding window would be represented as a byte between 0x80 and 0xFF, while a byte from 0x20 to 0x7F (as well as CR, LF, and TAB) would always mean an ASCII character (or control).

Notes on the Java implementation

A limitation of Java is the exclusive use of a signed byte data type. The following work arounds are required: Copying a byte to an integer variable and adding 256 for 'negative' bytes gives an integer in the range 0-255. Values of char are between 0x0000 and 0xFFFF in Java. Arithmetic on char values is unsigned. Extended characters require an int to store them. The sign is not an issue because only 1024*1024 + 65536 extended characters exist.


Constructor Summary
SCSU()
           
 
Method Summary
static boolean isCompressible(char ch)
          whether a character is compressible
 void reset()
          reset is only needed to bail out after an exception and restart with new input
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SCSU

public SCSU()
Method Detail

isCompressible

public static boolean isCompressible(char ch)
whether a character is compressible


reset

public void reset()
reset is only needed to bail out after an exception and restart with new input


Elexis API documentation
Version 2.1.6 as of December 11 2011

Copyright 2005-2011 by Gerry Weirich, Elexis