The Compressed RTF Format

The full specification and algorithm used by the compressed RTF format is undocumented as far as I know. The information in this page is just the result of piecing together bits of information and staring at a couple of hex dumps for a few hours...but it seems to work :-)

An implementation of a decompressing routine can be found in the JTNEF Java package CompressedRTFInputStream class.

What is the Compressed RTF Format?

It is the format used by MAPI to provide the rich text representation of an email body. It can be found in the PR_RTF_COMPRESSED property of a message, as opposed to the PR_BODY property which contains the text-only version of the message. This property is also used in TNEF streams which encapsulate the MAPI message content.

The Format

The PR_RTF_COMPRESSED data is organized according to the definitions in RTFLIB.H:


// header for compressed stream
typedef struct _lzfuhdr
{
 ULONG cbSize;  // total number of bytes following this field
 ULONG cbRawSize; // size before compression
 DWORD dwMagic;  // identifies this as a compressed stream
 DWORD dwCRC;  // CRC-32 of the compressed data for error checking
} LZFUHDR;

// magic number that identifies the stream as a compressed stream
#define dwMagicCompressedRTF 0x75465a4c

// magic number that identifies the stream as a uncompressed stream
#define dwMagicUncompressedRTF 0x414c454d

The example TNEF files I looked at contain this header with the dwMagicCompressedRTF magic number, and the header is directly followed by the compressed data. A dwMagicUncompressedRTF magic number is assumed to mean the rest of the data is the uncompressed data as-is, though I've never seen this used in practice.

From the above header definition and the magic number itself, which is "LZFU", we could guess the algorithm used is some variation of LZ (Lempel-Ziv) compression.

Reading up on LZ/LZ77/LZSS and staring at hex dumps for a few hours, we get the following: The compressed data (after the above header) consists of 8-unit chunks. Each chunk begins with a single flag byte, where each bit is a flag for the corresponding unit in the chunk - 0 means it's a 1-byte literal that should be copied as-is, and 1 means it's a 2 byte reference. The bits within the flag byte are read in LSB order (i.e., the lowest bit flags the first unit, etc.).

A reference unit consists of 2 bytes, containing the offset and length of the reference to be copied. When viewed in the order the 2 bytes appear, the first 3 nibbles represent an offset, and the last nibble is the length. In other words, the first byte shifted left 4 bits plus the second byte shifted right 4 bits give the offset (4K possible values, which should be the LZ buffer size), and the low 4 bits of the second byte is the length. Since references of length 1 or 2 are useless - they can be compressed as literals without wasting any space (a reference is 2 bytes anyway), 2 is added to the length field to provide a bigger maximum length (i.e. the length field represents a length between 2 and 17).

As for the LZ buffer itself, it's a 4096 byte buffer that wraps around as it is filled with the decompressed data - when it's filled up all the way it starts filling up again from the beginning. The reference offsets are offsets into this buffer. For efficiency, such a buffer is not required explicitly, but can be simulated by calculating the offset as an offset straight into the decompressed data. A reference pointing to itself (the current position) marks the end of the data.

The LZ buffer is preloaded with a common RTF header string so that references kick in right away to save space. The prebuffered string was found in RTFLIB32.LIB. In C/Java syntax it looks like this (linebreaks are added here for clarity only - the actual string is a single line with a single space between tokens):

{\\rtf1\\ansi\\mac\\deff0\\deftab720{\\fonttbl;}{\\f0\\fnil \\froman
 \\fswiss \\fmodern \\fscript \\fdecor MS Sans SerifSymbolArialTimes
 New RomanCourier{\\colortbl\\red0\\green0\\blue0\n\r\\par
 \\pard\\plain\\f0\\fs20\\b\\i\\u\\tab\\tx

The decompressed data is copied into the buffer right after this string. Note that this string is not part of the actual output - the output starts right after it.

The CRC32 is similar to the standard CRC32 algorithm which is demonstrated in RFC 1952 (and used in ZIP, Ethernet, and many other applications), with a slight modification: the inversion (or xor with 0xffffffffL) before and after the CRC update is ommited. This inversion is applied in order to avoid a CRC weakness, which is that any number of leading or trailing zero bytes can be added or removed without the CRC detecting the change. For some reason, the compressed RTF CRC32 implementation is the weaker one, without this inversion. It is calculated on the compressed data bytes (excluding the headers, but including padding), and should be ignored when the data is uncompressed.

For a reference implementation of a decompressing routine see the CompressedRTFInputStream class in the JTNEF Java package.