english version version française
Valid XHTML 1.0 Strict
jgz logo

jgz - Project Goals


Why ?

The main and almost unique reason why I implemented jgz is that I wanted to thoroughly understand the DEFLATE format, and implementing it seemed to be the best way to achieve that goal. I decided to ultimately publish the source code, because it could help other people learn how DEFLATE works, and also because nothing focuses more the programmer than the prospect of having his source code and comments inspected by other people.

Apart from that, jgz is pretty useless, because the Java standard library already includes a DEFLATE support, along with zlib and gzip file formats support: relevant classes are in the java.util.zip package (internally, this support uses zlib). There are a few marginal reasons why you would like to use jgz:

All these reasons are pretty weak. Note that there exists a package called jzlib, which is a rather direct translation of the zlib code into Java. That package fulfills the goals stated above; therefore, if you really want to use jgz, you should also consider using jzlib instead. I have not made (yet) any kind of benchmark between jgz and jzlib.

Performance

"Performance" is a short word for a big subject. There are many issues which relate to performance, including processing speed, code size, memory usage... I decided to concentrate on three criteria:

Additional minor criteria are the code footprint and the RAM usage. The latter is related to processing speed: a smaller RAM usage implies better interactions with the cache memory, hence improved processing speed.

The reference code is, of course, zlib. The Java virtual machine is Sun's JDK-1.6.0. That machine already includes a version of zlib (1.1.3), which is a bit old but fine for interoperability checks; for performance measures, I use the latest zlib version (1.2.3). The test machines are Linux systems, using AMD x86-compatible processors, both in 32-bit and 64-bit modes.

I do not pretend to be a serious competitor for zlib. The idea is that I use the achieved performance to know whether I got the algorithmics right. There are various reasons why pure Java code will be slower than C code when it comes to CPU-intensive work; for instance, the Java just-in-time compiler must work fast and is constrained by both the strict Java semantics, and the need for proper interaction with the garbage collector. The two main causes for Java slowdown are the following:

The first cause is not relevant for long-lived processes, e.g. server code or most graphical applications. Java code needs a bit of "warm-up". This must be taken into account when measuring processing speed.

The second cause cannot be avoided. For any array-intensive work, a speed ratio between Java code and equivalent C code of 2 to 4 is typically observed; this is easily demonstrated with cryptographic code such as symmetric encryption or hashing (I implement such algorithms for a living). DEFLATE is array-intensive, and such a slowdown is expected.

The compression ratio is not relevant to the inflater code. At the deflater level, there is a trade-off between processing speed and compression ratio. Jgz does not compress data exactly like zlib, because its internals are quite different, and the various heuristics alluded to in RFC 1951 do not apply in exactly the same way. I decided to target the default compression ratio of zlib (that's level 6 compression, where the level ranges from 0 to 9).

To sum up, jgz will be deemed successful if:

Results

Compression ratio was measured by compressing all files in the /usr directory of a Linux system (64-bit Ubuntu distribution). Each file was compressed independantly, in zlib format (which has much lower overhead than gzip), with both zlib-1.2.3 (with default compression level) and jgz-0.2 (with default compression level). There were 124171 files, for a total size of 2460581541 bytes.

The 124171 compressed files totalized 996284045 bytes for zlib-1.2.3, vs 994591245 bytes for jgz (hence, on average, jgz compresses very slightly better than the default compression level of zlib, but the difference is close to negligeable). I also checked all files for which the output size by jgz differed by more than 2% with the output size by zlib. There where 2482 such files: 769 for which the jgz output size was more than 102% of the zlib output size, and 1713 for which the jgz output was less than 98% of the zlib output size.

The worst case for jgz is a 159-byte file, compressed to 70 bytes by zlib and 74 bytes by jgz; more significant is a 101327-byte SVG file which zlib compressed to 17082 bytes but for which jgz achieved only 17901 bytes (that's about 104.8% of the output size by zlib). In the other direction, there is a highly redundant 618836-byte file (transliteration information for Chinese characters) which zlib compresses to 95038 bytes whereas jgz achieves 68777 bytes. All these are exceptional cases; for about 98% of the files, jgz and zlib provide similar compression ratios. To sum up, jgz, with its default compression level, achieves its stated goal.

It shall be noted that the COMPACT compression level for jgz is not good: it regularly yields worse compression than the default level, and it is quite slow. I still have work to do there.

As for speed, the ratio is about 2.2 for deflation, and 2.7 for inflation, with regards to zlib-1.2.3 (this is for a stock JDK-1.6.0, and a stock zlib-1.2.3 as provided with the operating system, which is an Ubuntu Linux running on x86 32-bit hardware). Please note that:

The main conclusion is that jgz lives up to its promises for the default compression level. The other compression levels still need some tweaking, especially the COMPACT level.

Last modified: Saturday, August 18, 2007 10:10:05 AM CEST