Microsoft's HTML Help (.chm) format
Preface
This is documentation on the .chm format used by Microsoft HTML
Help. This format has been reverse engineered in the past, but as far
as I know this is the first freely available documentation on it. One
Usenet message indicates that these .chm files are actually IStorage
files documented in the Microsoft Platform SDK. However, I have not
been able to locate such documentation.
Note
The word "section" is badly overloaded in this document. Sorry about that.
All numbers are in hexadecimal unless otherwise indicated in the
text. Except in tabular listings, this will be indicated by $ or 0x as
appropriate. All values within the file are Intel byte order (little
endian) unless indicated otherwise.
The overall format of a .chm file
The .chm file begins with a short ($38 byte) initial header. This is
followed by the header section table and the offset to the content.
Collectively, this is the "header".
The header is followed by the header sections. There are two header
sections. One header section is the file directory, the other contains
the file length and some unknown data. Immediately following the header
sections is the content.
The header starts with the initial header, which has the following format
0000: char[4] 'ITSF'
0004: DWORD 3 (Version number)
0008: DWORD Total header length, including header section table and
following data.
000C: DWORD 1 (unknown)
0010: DWORD a timestamp.
Considered as a big-endian DWORD, it appears to contain
seconds (MSB) and fractional seconds (second byte).
The third and fourth bytes may contain even more fractional
bits. The 4 least significant bits in the last byte are
constant.
0014: DWORD Windows Language ID. The two I've seen
$0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US
$0407 = LANG_GERMAN/SUBLANG_GERMAN
0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC}
0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC}
Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.
It is followed by the header section table, which is 2 entries, where each entry is $10 bytes long and has this format:
0000: QWORD Offset of section from beginning of file
0008: QWORD Length of section
Following the header section table is 8 bytes of additional header
data. In Version 2 files, this data is not there and the content
section starts immediately after the directory.
0000: QWORD Offset within file of content section 0
Header Section 0
This section contains the total size of the file, and not much else
0000: DWORD $01FE (unknown)
0004: DWORD 0 (unknown)
0008: QWORD File Size
0010: DWORD 0 (unknown)
0014: DWORD 0 (unknown)
Header Section 1: The Directory Listing
The central part of the .chm file: A directory of the files and information it contains.
Directory header
The directory starts with a header; its format is as follows:
0000: char[4] 'ITSP'
0004: DWORD Version number 1
0008: DWORD Length of the directory header
000C: DWORD $0a (unknown)
0010: DWORD $1000 Directory chunk size
0014: DWORD "Density" of quickref section, usually 2.
0018: DWORD Depth of the index tree
1 there is no index, 2 if there is one level of PMGI
chunks.
001C: DWORD Chunk number of root index chunk, -1 if there is none
(though at least one file has 0 despite there being no
index chunk, probably a bug.)
0020: DWORD Chunk number of first PMGL (listing) chunk
0024: DWORD Chunk number of last PMGL (listing) chunk
0028: DWORD -1 (unknown)
002C: DWORD Number of directory chunks (total)
0030: DWORD Windows language ID
0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC}
0044: DWORD $54 (This is the length again)
0048: DWORD -1 (unknown)
004C: DWORD -1 (unknown)
0050: DWORD -1 (unknown)
The Listing Chunks
The header is directly followed by the directory chunks. There are
two types of directory chunks -- index chunks, and listing chunks. The
index chunk will be omitted if there is only one listing chunk. A
listing chunk has the following format:
0000: char[4] 'PMGL'
0004: DWORD Length of free space and/or quickref area at end of
directory chunk
0008: DWORD Always 0.
000C: DWORD Chunk number of previous listing chunk when reading
directory in sequence (-1 if this is the first listing chunk)
0010: DWORD Chunk number of next listing chunk when reading
directory in sequence (-1 if this is the last listing chunk)
0014: Directory listing entries (to quickref area) Sorted by
filename; the sort is case-insensitive.
The quickref area is written backwards from the end of the chunk.
One quickref entry exists for every n entries in the file, where n is
calculated as 1 + (1 << quickref density). So for density = 2, n
= 5.
Chunklen-0002: WORD Number of entries in the chunk
Chunklen-0004: WORD Offset of entry n from entry 0
Chunklen-0008: WORD Offset of entry 2n from entry 0
Chunklen-000C: WORD Offset of entry 3n from entry 0
...
The format of a directory listing entry is as follows
ENCINT: length of name
BYTEs: name (UTF-8 encoded)
ENCINT: content section
ENCINT: offset
ENCINT: length
The offset is from the beginning of the content section the file is
in, after the section has been decompressed (if appropriate). The
length also refers to length of the file in the section after
decompression.
There are two kinds of file represented in the directory: user data
and format related files. The files which are format-related have names
which begin with '::', the user data files have names which begin with
"/".
The Index Chunk
An index chunk has the following format
0000: char[4] 'PMGI'
0004: DWORD Length of quickref/free area at end of directory chunk
0008: Directory index entries (to quickref/free area)
The quickref area in an PMGI is the same as in an PMGL
The format of a directory index entry is as follows
ENCINT: length of name
BYTEs: name (UTF-8 encoded)
ENCINT: directory listing chunk which starts with name
When higher-level indexes exist (when the depth of the index tree is
3 or higher), presumably the upper-level indexes will contain the
numbers of lower-level index chunks rather than listing chunks
Encoded Integers
An ENCINT is a variable-length integer. The high bit of each byte
indicates "continued to the next byte". Bytes are stored most
significant to least significant. So, for example, $EA $15 is
(((0xEA&0x7F)<<7)|0x15) = 0x3515.
The Content
In Version 3, the content typically immediately follows the header
sections, and is at the location indicated by the DWORD following the
header section table. In Version 2, the content immediately follows the
header. All content section 0 locations in the directory are relative
to that point. The other content sections are stored WITHIN content
section 0.
The Namelist file
There exists in content section 0 and in the directory a file called
"::DataSpace/NameList". This file contains the names of all the content
sections. The format is as follows:
0000: WORD Length of file, in words
0002: WORD Number of entries in file
Each entry:
0000: WORD Length of name in words, excluding terminating null
0002: WORD Double-byte characters
xxxx: WORD 0
Yes, the names have a length word AND are null terminated; sort of a
belt-and-suspenders approach. The coding system is likely UTF-16
(little endian).
The section names seen so far are
-
Uncompressed
-
MSCompressed
"Uncompressed" is self-explanatory. The section "MSCompressed" is compressed with Microsoft's LZX algorithm.
The Section Data
For each section other than 0, there exists a file called
'::DataSpace/Storage/<Section Name>/Content'. This file contains
the compressed data for the section. So, conceptually, getting a file
from a nonzero section is a multi-step process. First you must get the
content file from section 0. Then you decompress (if appropriate) the
section. Then you get the desired file from your decompressed section.
Other section format-related files
There are several other files associated with the sections
-
::DataSpace/Storage/<SectionName>/ControlData
This file contains $20 bytes of information on the compression. The information is partially known:
0000: DWORD Number of DWORDs following 'LZXC', must be 6 if version is 2
0004: ASCII 'LZXC' Compression type identifier
0008: DWORD Version (Must be <=2)
000C: DWORD The LZX reset interval
0010: DWORD The window size
0014: DWORD The cache size
0018: DWORD 0 (unknown)
Reset interval, window size, and cache size are in bytes if version is 1, $8000-byte blocks if version is 2.
-
::DataSpace/Storage/<SectionName>/SpanInfo
This file contains a quadword containing the uncompressed length of the section.
-
::DataSpace/Storage/<SectionName>/Transform/List
It
appears this file was intended to contain a list of GUIDs belonging to
methods of decompressing (or otherwise transforming) the section.
However, it actually contains only half of the string representation of
a GUID, apparently because it was sized for characters but contains
wide characters.
Appendix: The Compression
The compressed sections are compressed using LZX, a compression
method Microsoft also uses for its cabinet files. To ensure this, check
the second DWORD of compression info in the ControlData file for the
section — it should be 'LZXC'. To decompress, first read the file
"::DataSpace/Storage/<SectionName>/Transform/{7FC28940-9D31-11D0-9B27-00A0C91E9C7C}/InstanceData/ResetTable".
This reset table has the following format
0000: DWORD 2 unknown (possibly a version number)
0004: DWORD Number of entries in reset table
0008: DWORD 8 Size of table entry (bytes)
000C: DWORD $28 Length of table header (area before table entries)
0010: QWORD Uncompressed Length
0018: QWORD Compressed Length
0020: QWORD 0x8000 block size for locations below
0028: QWORD 0 (zeroth entry of table)
0030: QWORD location in compressed data of 1st block boundary in
uncompressed data
Repeat to end of file
Now you can finally obtain the section (from its Content file). The
window size for the LZX compression is 16 (decimal) on all the files
seen so far. This is specified by the DWORD at $10 in the ControlData
file (but note that DWORD gives the window size in 0x8000-byte blocks,
not the LZX code for the window size)
The rule that the input bit-stream is to be re-aligned to a 16-bit
boundary after $8000 output characters have been processed IS in
effect, despite this LZX not being part of a CAB file. The reset table
tells you when this was done, though there is no need for that during
decompression; you can just keep track of the number of output
characters. Furthermore, while this does not appear to be documented in
the LZX format, the uncompressed stream is padded to an $8000 byte
boundary.
There is one change from LZX as defined by Microsoft: After each LZX
reset interval (defined in the ControlData file, but in practice equal
to the window size) of compressed data is processed, the LZX state is
fully reset, as if an entirely new file was being encoded. This allows
semi-random access to the compressed data; you can start reading on any
reset interval boundary using the reset interval size and the reset
table.
Note:
Earlier versions of this document stated that the
reset interval only reset the Huffman tables and required outputting
the 1-bit header again. This was erroneous. The Lempel Ziv state is
reset as well. In practice, a decoder works just as well with the
incorrect assumption, but encoding a file with match positions which
refer to a time before the most recent LZX reset causes crashes on
decoding.
Acknowledgements
The following people in (no particular order) have submitted
information which has helped correct and close the gaps in this
document.
-
Peter Ferrie (peter_ferrie at hotmail.com) Web Site
-
Pabs (pabs at zip.to) who also runs the CHM Spec page.
And others I have not been able to reach.
Copyright 2001-2003 Matthew T. Russotto
You may freely copy and distribute unmodified copies of this file,
or copies where the only modification is a change in line endings,
padding after the html end tag, coding system, or any combination
thereof. The original is in ASCII with Unix line endings.
HTML Help (CHM) Tools and Information
HTML Help format
An incomplete description of Microsoft's .CHM format.
ITOL/ITLS format
A description of Microsoft's ITOL/ITLS format, which is used by HTML Help 2.0 among other things.
CHM Tools package
A
set of tools for working with the CHM files, consisting of a C language
library 'chmlib' and a program called 'chmdump' which dumps out the
files in a CHM file.
Not everything in the document is implemented here, but it is a
start, and an LZX decompression engine (from Stuart Caie's
"cabextract", suitably modified) is included. License is the GPL,
following "cabextract".
I also have a C++ library for reading CHM and ITOL/ITLS formats,
including the ability to use arbitrary transforms in the latter.
LZX compression package (lzxcomp)
An
LGPLed LZX compression engine, suitable for creating compressed CHM
files. Or for use in a CAB-making utility or for any other purpose LZX
is useful for.
Documentation for the lzxcomp library included and on-line.
IMPORTANT
Changed May 3 2100 EST: fixed a really dumb bug introduced last minute.
Also allowed LZ compressor to look into the match buffer, for a significant compression improvement.
-
hhm (GPL2):
hhm (HTML Help Maker) is a program that makes ITS files and in the
future it will also make Compiled HTML Help (CHM) files. Both types of
files are a kind of compressed archive format used on Win98, Win2K and
other Microsoft operating systems to store documentation.
-
chmdeco (GPL2):
chmdeco (CHM decompiler) is a program that converts the internal files
of CHM files back into the hhp, hhc, hhk etc used to compile the
documentation.
-
chmspec (GPL2):
chmspec (CHM specification) is an effort to document Microsoft's
Compiled HTML Help files (CHMs), mainly the internal files, since the
archive format is documented already.
-
istorage (BSD):
This is just a simple Windows proggie to extract files from those pesky
MS compound file objects accessible via OLE's StgOpenStorage fuction
and the IStorage interface exposed by that function. These compound
file objects are created by word, excel, & probably other MS progs.
Also Macromedia Flash source files (*.fla - there are some of these
available from levitated.net,
which is an interesting site) are these compound files. These compound
file objects can be thought of as the equivalent of tar files, but of
course MS went & invented some new format without even considering
.zip, .lha, .tar.gz, .cab, blah blah blah. One weird thing about them
is that the IStreams inside can & for word & excel do have
freaky chars in their names, often as the first char such as in word
2000 docs there are streams named "SummaryInformation" (that is an
0x05 - there are also ones with 0x01). MSDEV .opt files are also these
compound files. I updated this in April 2002 to extract Compiled HTML
Help (chm) files (also known as InfoTech Storage (ITS) files) too. This
feature uses the same IStorage interface, but uses an ITStorage
(CLSID=5D02926A-212E-11D0-9DF9-00A0C922E6EC) object (from itss.dll) got
from CoCreateInstance to open the file. I found out how to do this from
these two code samples: www.keyworks.net/code.htm & helpware.net/delphi/index.html.
-
istorage-make (unicode version) (BSD):
This is just a simple proggie to create those pesky MS compound file
and InfoTech Storage (ITS) files too. It uses the same interfaces as
istorage (an extractor for these files).
-
indychat (public domain): this is the code behind chat.indymedia.org, the old page was really crappy, so I wrote the new php version and patched and installed the new chat software.
-
pinballcheat (GPL2):
This is a little proggy to help you change the High score table for
Microsoft's 3D Pinball game - specifically SpaceCadet but you (or I on
request) can easily recompile it for other tables.
-
clap (public domain):
clap is a Windows program that monitors the Win32 clipboard and appends
each piece of data copied to an internal clipboard, which it then sends
back to the Win32 clipboard. The effect is that it seems that each copy
operation appends data to the clipboard. Currently only works with the
CF_TEXT clipboard format.
-
emxwrapper (public domain): & are
files from the emx runtime on OS/2. I have created a ncurses/panel
wrapper for these files, which means that programs written specifically
for these files can be compiled for ncurses/panel on platforms that
support these two libraries, such as GNU/Linux, various UNIXes &
GNU/Cygwin. It was written on GNU/Cygwin using ncurses & panel so
there may be incompatibility issues. There are likely lots of bugs so
if it doesn't work try to fix it & send me a diff -u or just
complain & I'll try to fix it. It is by no means complete since the
program I developed it for (VBinDiff 1.7, which is available at the Hobbes OS/2 Archive) doesn't use all the winmgr.h & kbdscan.h functions & constants.
-
bezier (public domain):
Get my bezier include file that I was using in a space animation. Use
the functions HermiteLP, BezierLP, BezierLPTension to load points into
the geometry matrix. Use p(u) to get the point, pd(u) to get the
derivative, pdd(u) to get the second derivative & pddd(u) to get
the third derivative. Other functions return curvature, the principle
normal and binormal vectors, torsion (doesn't work quite right yet),
and some as yet unfinished macros for orienting objects along the
spline & doing banking & stuff. I think the math is right -
email me if not
-
df3maker (public domain): a simple proggie to make df3 files for povray. Ported to C++ from the Visual Basic version by Mark James Lewin. His page is down, so it is available here and at archive.org.
-
xchat scripts (public domain): Some python/etc scripts for xchat that I find useful.
-
~/bin scripts (public domain): Some scripts I have written for my ~/bin folder that I find useful.
-
zwiki to twiki (public domain): a script to extract some data from zwiki and prepare it for importing to twiki.
I have contributed various patches that are not listed here to
various projects.
I've some developed patches for the following packages that are still pending.
mailman
cgiirc
et al.
-
scite
-
A half complete 133t local.properties
-
A
patch to make the window title be a reversed path (eg
patches.txt\ideas\pabs\:H). This is useful when you have enough windows
open for your taskbar to only show part of the filename, but Neil
refused to put it into the official distribution. I have diffs against 1.46 and 1.49.
-
imc_aggregator
-
povray
-
particles (same licence as upstream - will relicence if they change licences): This was based on a project by James Neill - see his projects page.
Chris Huff had integrated the ParticlePatch into his MegaPOV+ (with new
particle system simulations) and had windows & mac compiles of it
at his homepage. Mark Gordon had a Linux compile of MegaPOV+ at his homepage. Also the glow patch in MegaPOV (&MP+) is better but has no refraction.
-
some abandoned patches:
-
I
was working on a custom version (Win32 version) that keeps the scene in
memory between renders. I attempted to developing the controls used to
interface with the scene memory. I have now given up. If you want the
sources please contact me.
-
The
ObjectCameraPatch progressed slowly - few obj types done-torus,
cone/cyl, bicubic, sphere, then I waited for a bit as there was not
much interest in the ObjectCamPatch at news.povray.org. After that I
lost interest in it and moved onto other things.
-
frhed:
I developed a whole heap of patches for frhed, all of which are
included in v1.1 (the latest version ATM). You can still view the old patch page if you wish.
-
chmlib: A coupla simple fixes.
-
wine: Removed one FIXME that frhed was showing.
-
nsis: Fixed build system on Linux
I'm not a DD, or even in the NM queue yet. I have filed a few ITPs,
found a sponsor for some debian packages, experienced the melting of
NEW and the "flood" of FTBFS/etc bugreports. I plan to ITP some more
packages, a couple of fonts, some of my software and some orphaned
packages at some point, and enter the NM queue after a while of having
sponsored packages in debian. I'm thinking that I will get more
involved in QA work during that period and start sponsoring new
maintainers once I become a DD.
I also maintain debian packages of some of the software I have written, which can be found on mentors.debian.net (this means you must build it yourself).
I maintain a crappy build of
xchm for windows that can be found in the xchm downloads page. Fixing its weird crashyness is at the top of the todo list.
I also maintain win32 packages of the some software I have written.
These packages will be uploaded to their respective download pages.
CHM lib0.37
CHMLIB
is a library for dealing with Microsoft ITSS/CHM
format files. Right now, it is a very simple library, but sufficient
for dealing with all of the .chm files I've come across. Due to the
fairly well-designed indexing built into this particular file format,
even a small library is able to gain reasonably good performance
indexing into ITSS archives.
Version 0.37 is primarily a security release. On October 25th, a
security vulnerability was located by Sven Tantau. This release is
primarily to fix this, as well as a broken Makefile.in which didn't
properly install the library for people who did:
./configure; make; make install
If you did this, and were unable to subsequently build the example
programs, this release should fix it for you. 0.37.2 includes yet
another small patch to the Makefile.in. The change in 0.37.2 will be
mainly of importance to packagers who use:
make install DESTDIR=/path/to/sandbox
as DESTDIR had been inadvertently omitted from one of the actions in the "make install" target.
In the continuing Makefile.in saga, 0.37.3 contains yet one more
minor patch to make DESTDIR work properly. The symlinks were being
created pointing to $(DESTDIR)$(libdir)libchm.so.0.0.0. When DESTDIR
was set to a temporary build location for packaging, this meant that
the symlinks were broken. Thanks to Mark Rosenstand for pointing this
out and supplying a patch.
Once more with feeling! 0.37.4 contains yet another fix to the
Makefile.in, from Thomas Klausner. 'make install' was not using libtool
to install the shared library, which is a portability issue. (For
anyone who has had difficulty with 'make install' on non-Linux
platforms, this may be the cause.) Furthermore, exec_prefix was not
being set, so the library itself was being installed in /lib,
regardless of the chosen installation prefix.
Note:
UTF-8 support is
fairly minimal at present. By this, I mean that I return the filename
verbatim. Filename comparisons are done using strcasecmp, which is
clearly not correct for UTF-8. I'm very interested in hearing from
anyone who has dealt with internationalized filenames before, and can
tell me the "right" way to deal with them. (Hopefully in a portable
way.)
I've set up a sourceforge project to host this library, but I
haven't really had time to move the project over. Maybe someday...
To do:
-
an index layer which sits on top of the basic chmlib
functionality and understands the full-text index and possibly other
indexing features. (topic structure?)
-
More functionality for querying the contents of an archive
-
Add write support (maybe?)
Right now this library supports enumerating the contents of the archive, and reading files from the archive.
This code is now being distributed under the LGPL. It incorporates
LZX decompression code from the cabextract project. Thanks to Stuart
Caie for authorizing the relicensing of this code in the context of
chmlib.
Thanks to Stan Tobias for bugfixes and Andrew Hodgetts for bugfixes and portability fixes!
For those interested in the CHM format, a good resource is Pabs' HtmlHelp Maker,
which is a free software project for creating HTML Help files. More
importantly, the author maintains a reverse engineered spec for HTML
Help files, including the structure of the internal files, which
maintain the "topic" structure of the help file, the full-text index,
and other useful things. At the time of writing, the spec was not
available for download; however, the author has plans to publish it on
his site when it is more complete, and an offer to mail out the current
version to anyone who expresses interest.
Another "free software" tool which fulfills approximately the same niche as this library may be downloaded from Matthew T. Russotto's CHM site.
If, for some reason, my library does not meet your needs, try out the
chmtools from this site. Apparently, this site also offers LZX
compression code.
Download version 0.37.4:
Download version 0.37:
Download version 0.36:
Download version 0.35:
Download version 0.33:
Download version 0.32:
Download version 0.31:
Download version 0.3:
Download version 0.2:
Download version 0.1:
Applications which use chmlib:
Language bindings for chmlib: