jgz - Project Goals
Why ?
The main and almost unique reason why I implemented jgz is that I wanted to thoroughly understand the DEFLATE format, and implementing it seemed to be the best way to achieve that goal. I decided to ultimately publish the source code, because it could help other people learn how DEFLATE works, and also because nothing focuses more the programmer than the prospect of having his source code and comments inspected by other people.
Apart from that, jgz is pretty useless, because the Java standard library already includes a DEFLATE support, along with zlib and gzip file formats support: relevant classes are in the java.util.zip package (internally, this support uses zlib). There are a few marginal reasons why you would like to use jgz:
- you want DEFLATE support on a platform which lacks the standard Java support (i.e. J2ME with some profiles);
- you need some support of the special zlib flush modes, which the Java API does not provide;
- you are reimplementing your own JVM, with a proper standard library, and for some reason you cannot or will not use zlib for java.util.zip.
All these reasons are pretty weak. Note that there exists a package called jzlib, which is a rather direct translation of the zlib code into Java. That package fulfills the goals stated above; therefore, if you really want to use jgz, you should also consider using jzlib instead. I have not made (yet) any kind of benchmark between jgz and jzlib.
Performance
"Performance" is a short word for a big subject. There are many issues which relate to performance, including processing speed, code size, memory usage... I decided to concentrate on three criteria:
- the processing speed when inflating compressed data;
- the processing speed when deflating uncompressed data;
- the compression ratio.
Additional minor criteria are the code footprint and the RAM usage. The latter is related to processing speed: a smaller RAM usage implies better interactions with the cache memory, hence improved processing speed.
The reference code is, of course, zlib. The Java virtual machine is Sun's JDK-1.6.0. That machine already includes a version of zlib (1.1.3), which is a bit old but fine for interoperability checks; for performance measures, I use the latest zlib version (1.2.3). The test machines are Linux systems, using AMD x86-compatible processors, both in 32-bit and 64-bit modes.
I do not pretend to be a serious competitor for zlib. The idea is that I use the achieved performance to know whether I got the algorithmics right. There are various reasons why pure Java code will be slower than C code when it comes to CPU-intensive work; for instance, the Java just-in-time compiler must work fast and is constrained by both the strict Java semantics, and the need for proper interaction with the garbage collector. The two main causes for Java slowdown are the following:
- The Java virtual machine has much work to do at startup time: loading and initialization of Java classes, JIT compilation... All this makes Java inadequate for simple command-line utilities which are so dear to Unix users.
- All array accesses within Java programs are subject to an array bounds check, which traps buffer overflows and underflows. That check uses the array length, which implies an extra memory access.
The first cause is not relevant for long-lived processes, e.g. server code or most graphical applications. Java code needs a bit of "warm-up". This must be taken into account when measuring processing speed.
The second cause cannot be avoided. For any array-intensive work, a speed ratio between Java code and equivalent C code of 2 to 4 is typically observed; this is easily demonstrated with cryptographic code such as symmetric encryption or hashing (I implement such algorithms for a living). DEFLATE is array-intensive, and such a slowdown is expected.
The compression ratio is not relevant to the inflater code. At the deflater level, there is a trade-off between processing speed and compression ratio. Jgz does not compress data exactly like zlib, because its internals are quite different, and the various heuristics alluded to in RFC 1951 do not apply in exactly the same way. I decided to target the default compression ratio of zlib (that's level 6 compression, where the level ranges from 0 to 9).
To sum up, jgz will be deemed successful if:
- the inflater code decompresses a file at most three times slower than zlib;
- the deflater code compresses a file at most three times slower than zlib (with its default compression level), while keeping the result size within ±2% of the output size obtained by zlib.
Results
Compression ratio was measured by compressing all files in the /usr directory of a Linux system (64-bit Ubuntu distribution). Each file was compressed independantly, in zlib format (which has much lower overhead than gzip), with both zlib-1.2.3 (with default compression level) and jgz-0.2 (with default compression level). There were 124171 files, for a total size of 2460581541 bytes.
The 124171 compressed files totalized 996284045 bytes for zlib-1.2.3, vs 994591245 bytes for jgz (hence, on average, jgz compresses very slightly better than the default compression level of zlib, but the difference is close to negligeable). I also checked all files for which the output size by jgz differed by more than 2% with the output size by zlib. There where 2482 such files: 769 for which the jgz output size was more than 102% of the zlib output size, and 1713 for which the jgz output was less than 98% of the zlib output size.
The worst case for jgz is a 159-byte file, compressed to 70 bytes by zlib and 74 bytes by jgz; more significant is a 101327-byte SVG file which zlib compressed to 17082 bytes but for which jgz achieved only 17901 bytes (that's about 104.8% of the output size by zlib). In the other direction, there is a highly redundant 618836-byte file (transliteration information for Chinese characters) which zlib compresses to 95038 bytes whereas jgz achieves 68777 bytes. All these are exceptional cases; for about 98% of the files, jgz and zlib provide similar compression ratios. To sum up, jgz, with its default compression level, achieves its stated goal.
It shall be noted that the COMPACT compression level for jgz is not good: it regularly yields worse compression than the default level, and it is quite slow. I still have work to do there.
As for speed, the ratio is about 2.2 for deflation, and 2.7 for inflation, with regards to zlib-1.2.3 (this is for a stock JDK-1.6.0, and a stock zlib-1.2.3 as provided with the operating system, which is an Ubuntu Linux running on x86 32-bit hardware). Please note that:
- Speed is a bit tricky to measure, due to the "warm-up" for Java, and the I/O overhead.
- Speed ratio varies with the file type.
- Zlib-1.2.3 is quite faster than zlib-1.1.3, especially the inflater; jgz achieves a speed ratio of about 2 when compared with the DEFLATE code from the JDK.
The main conclusion is that jgz lives up to its promises for the default compression level. The other compression levels still need some tweaking, especially the COMPACT level.