| Cache format specificationFor updating purpose, HTTrack stores original (untouched) HTML data, 
references to downloaded files, and other meta-data (especially parts of the HTTP headers) in a cache, 
located in the hts-cache directory. Because local html pages are always modified to "fit" the local
filesystem structure, and because meta-data such as the last-Modified date and Etag can not be stored
with the associated files, the cache is absolutely mandatory for reprocessing (update/continue) phases.
 
 
 The (new) cache.zip formatThe 3.31 release of HTTrack introduces a new cache format, more extensible and efficient than the previous one (ndx/dat format).
The main advantages of this cache are:
The cache is made of ZIP files entries ; with one ZIP file entry per fetched URL (successfully or not - errors are also stored).One single file for a complete website cache archiveStandard ZIP format, that can be easily reused on most platforms and languagesCompressed data with the efficient and opened zlib format For each entry:
 
Example of cache file:The ZIP file name is the original URL [see notes below]The ZIP file contents, if available, is the original (compressed, using the deflate algorythm) dataThe ZIP file extra field (in the local file header) contains a list of meta-fields, very similar to the HTTP headers fields. See also RFC.
 The ZIP file timestamp follows the "Last-Modified-Since" field given for this URL, if any 
 
$ unzip -l hts-cache/new.zip
Archive:  hts-cache/new.zip
HTTrack Website Copier/3.31-ALPHA-4 mirror complete in 3 seconds : 5 links scanned, 
3 files written (16109 bytes overall) [17690 bytes received at 5896 bytes/sec]
(1 errors, 0 warnings, 0 messages)
  Length     Date   Time    Name
 --------    ----   ----    ----
       94  07-18-03 08:59   http://www.httrack.com/robots.txt
     9866  01-17-04 01:09   http://www.httrack.com/html/cache.html
        0  05-11-03 13:31   http://www.httrack.com/html/images/bg_rings.gif
      207  01-19-04 05:49   http://www.httrack.com/html/fade.gif
        0  05-11-03 13:31   http://www.httrack.com/html/images/header_title_4.gif
 --------                   -------
    10167                   5 files
Example of cache file meta-data:
 
HTTP/1.1 200 OK
X-In-Cache: 1
X-StatusCode: 200
X-StatusMessage: OK
X-Size: 94
Content-Type: text/plain
Last-Modified: Fri, 18 Jul 2003 08:59:11 GMT
Etag: "40ebb5-5e-3f17b6df"
X-Addr: www.httrack.com
X-Fil: /robots.txt
There are also specific issues regarding this format: 
The data in the central directory (such as CD extra field, and CD comments) are not usedThe ZIP archive is allowed to contains more than 2^16 files (65535) ; in such case the total number of entries in the 32-bit central directory is 65536 (0xffff), but the presence of the 64-bit central directory is not mandatoryThe ZIP archive is allowed to contains more than 2^32 bytes (4GiB) ; in such case the 64-bit central directory must be present (not currently supported) Meta-data stored in the "extra field" of the local file headers
 The extra field is composed of text data, and this text data is composed of distinct lines of headers.
The end of text, or a double CR/LF, mark the end of this zone.
This method allows to optionally store original HTTP headers just after the "meta-data" headers for informational use.
 
 The status line (the first headers line)
 Status-Line = HTTP-Version SP Status-Code SP X-Reason-Phrase CRLF
 
 Other lines:
 
 Specific fields:
 
 
X-In-CacheIndicates if the data are present (value=1) in the cache (that is, as ZIP data), or in an external file (value=0).
This field MUST be the first field.
 X-StatusCodeThe modified (by httrack) status code after processing. 304 error codes ("Not modified"), for example, are transformed into "200" codes after processing.
 X-StatusMessageThe modified (by httrack) status message.
 X-SizeThe stored (either in cache, or in an external file) data size.
 X-CharsetThe original charset.
 X-AddrThe original URL address part.
 X-FilThe original URL path part.
 X-SaveThe local filename, depending on user's "build structure" preferences.
 Standard (RFC 2616) "useful" fields:
 
 
Content-TypeLast-ModifiedEtagLocationContent-Disposition Specific fields in "BNF-like" grammar:
 
 
X-In-Cache          = "X-In-Cache" ":" 1*DIGIT
X-StatusCode        = "X-StatusCode" ":" 1*DIGIT
X-StatusMessage     = "X-StatusMessage" ":" *<TEXT, excluding CR, LF>
X-Size              = "X-Size" ":" 1*DIGIT
X-Charset           = "X-Charset" ":" value
X-Addr              = "X-Addr" ":" scheme ":" "//" authority
X-Fil               = "X-Fil" ":" rel_path
X-Save              = "X-Save" ":" rel_path
RFC standard fields: 
 
Content-Type        = "Content-Type" ":" media-type
Last-Modified       = "Last-Modified" ":" HTTP-date
Etag                = "ETag" ":" entity-tag
Location            = "Location" ":" absoluteURI
Content-Disposition = "Content-Disposition" ":" disposition-type *( ";" disposition-parm )
 And, for your information,
 
X-Reason-Phrase     = *<TEXT, with a maximum of 32 characters, and excluding CR, LF>
Note: Because the URLs may have an unexpected format, especially with double "/" inside, and other reserved characters ("?", "&" ..),
various ZIP uncompressors can potentially have troubles accessing or decompressing the data.
Libraries should generally handle this peculiar format, however. 
 
 |