Creating DjVu documents linux HOWTO

version 0.2 (12 July 2006)

Copyright (c) 2006 Vladimir Komendantsky
Permission is granted to copy, distribute and/or modify the content of this page under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available at http://www.gnu.org/licenses/fdl.html.

1  Synopsis

This document explains some of the uses of djvulibre implementation of DjVu for creating quality DjVu documents in linux. DjVu format features bitmap document compression and hypertext structure. It is used by numerous web sites all around the world for storing and distributing digital documents including scanned documents and high-resolution pictures. One of the advantages of DjVu files is that they are notably small, often smaller than PDF or JPEG files with the same content. This makes DjVu a helpful tool for digitizing books and journals, especially scientific ones.

Below it is considered the case when a DjVu document is created from a number of separate JPEG files each containing a single page. Here JPEG format is not a limitation, and the examples can cover arbitrary image formats. Conversion from PDF to DjVu is also discussed. Usage of scanner software is not explained: refer to the relevant documentation.

Requirements. The packages djvulibre, jpeg and netpbm are required. The packages sane and xpdf are highly recommended.

2  Creating DjVu

2.1  Scanning a book

Suppose the following situation for this section. We have a book which needs to be scanned and stored in a digital format. For the simplicity suppose that all the book contents is black and white (text, formulas, diagrams, etc.) except for the book cover which is printed in colour. What we normally can do is to scan it page by page an to store the pages separately in some image format, like JPEG or PDF. Personally, I believe JPEG is the best choice. But if you find, for instance, compressed TIFF more suitable for your purposes, this HOWTO might be of some help for you as well. However, in this case the example scripts should be slightly amended. For the time being, let us stick with JPEG.

In our situation with the book we scan the book front cover (and the back cover too, if it contains any noticeable text or pictures) to colour JPEG files. Then we scan the rest to black and white JPEGs. This should give the optimal performance. When saving the scanned images pay attention to the file names. For the purposes of conversion to DjVu all the images must be arranged alphabetically respecting the order of pages. For example,

000.jpg, 001.jpg, 002.jpg, ..., 012.jpg
is a right numbering; and
0.jpg, 1.jpg, 2.jpg, ..., 12.jpg
is a wrong one because 12.jpg will appear before 2.jpg. Once the entire book is scanned, place all the image files into a separate directory.

Depending on a scanner device, software and a method of scanning you may need to rotate all or just some of the JPEG images, usually following some simple pattern. The script jpegsrotate below can be quite handy in such a case. For example, run it with the parameter --even to turn even pages upside down in the current directory. The program jpegtran used in the script can rotate JPEGs only by 90, 180 or 270 degrees clockwise.

#!/bin/bash
#
# jpegsrotate
#

if [ -z `which jpegtran` ]; then
  usage
  echo "Error: jpegtran is needed"
  echo
  exit 1
fi

shopt -s extglob

DEFMASK="*.jpg"
DEFEVENMASK="*[02468].jpg"
DEFODDMASK="*[13579].jpg"
DEFDEG=270

function usage() {
  echo
  echo "usage:"
  echo "$0"
  echo "    rotates files with the mask $DEFMASK by $DEFDEG degrees clockwise"
  echo "$0 --even"
  echo "    rotates even files with the mask $DEFEVENMASK by 180 degrees"
  echo "$0 --odd"
  echo "    rotates odd files with the mask $DEFODDMASK by 180 degrees"
  echo "$0 --params \"REGEXP\" (90|180|270)"
  echo "    rotates files with the mask REGEXP by the given aspect ratio clockwise"
  echo
}

if [ "$1" == "--even" ]; then
  MASK=$DEFEVENMASK
  DEG=180
elif [ "$1" == "--odd" ]; then
  MASK=$DEFODDMASK
  DEG=180
elif [ "$1" == "--params" ]; then
  if [ -n "$2" -a -n "$3" ]; then
    MASK=$2
    DEG=$3
  else
    usage
    exit 1
  fi
elif [ -n "$1" ]; then
  usage
  exit 1
else
  MASK=$DEFMASK
  DEG=$DEFDEG
fi

for i in $MASK; do
  if [ ! -e $i ]; then
    usage
    echo "Error: current directory must contain files with the mask $MASK"
    echo
    exit 1
  fi
  echo "$i"
  jpegtran -rotate $DEG $i > $i.rotated
  mv $i.rotated $i
done

2.2  JPEG to bitonal DjVu

When the images are ready, each of them needs to be converted to a separate page in DjVu format by a DjVu encoder, like cjb2 or cpaldjvu, and then the separate pages are to be bundled in a single DjVu document by djvm. Write the following script called any2djvu-bw somewhere, e.g. to ~/bin/. Run the script in the directory containing the source images to convert separate black and white pages.

#!/bin/bash
#
# any2djvu-bw
#

if [ -z `which anytopnm` -o -z `which ppmtopgm` -o -z `which pgmtopbm`\
    -o -z `which cjb2` ]; then
  usage
  echo "Error: anytopnm, ppmtopgm, pgmtopbm and cjb2 are needed"
  echo
  exit 1
fi

shopt -s extglob

DEFMASK="*.jpg"
DPI=300
# uncomment the following line to compile a bundled DjVu document
#OUTFILE="#0-bw.djvu"

function usage() {
  echo
  echo "usage:"
  echo
  echo "$0 [\"REGEXP\"]"
  echo "    converts single pages with the default mask $DEFMASK (or REGEXP if provided)"
  echo "    in the current directory to single-page black and white djvu documents"
# uncomment the following line to compile a bundled DjVu document
# echo "    and bundles them as a djvu file $OUTFILE"
  echo
}

if [ -n "$1" ]; then
  MASK=$1
else
  MASK=$DEFMASK
fi

for i in $MASK; do
  if [ ! -e $i ]; then
    usage
    echo "Error: current directory must contain files with the mask $MASK"
    echo
    exit 1
  fi
  if [ ! -e $i.djvu ]; then
    echo "$i"
    anytopnm $i | ppmtopgm | pgmtopbm -value 0.499 > $i.pbm
# in netpbm >= 10.23 the above line can be replaced with the following:
#   anytopnm $i | ppmtopgm | pamditherbw -value 0.499 > $.pbm
    cjb2 -dpi $DPI $i.pbm $i.djvu
    rm -f $i.pbm
  fi
done

# uncomment the following line to compile a bundled DjVu document
#djvm -c $OUTFILE $MASK.djvu

If you run the script as

$ ~/bin/any2djvu-bw

it will take the default action and try to convert all the images *.jpg in the current directory to single page DjVu files with the extension .jpg.djvu. You can change this behaviour by defining a file mask (the optional parameter). The dithering value 0.499 was obtained experimentally and represents a very good (if not the best) setting for bitonal images. You also can uncomment the indicated lines in any2djvu-bw to compile the final bundled black and white DjVu document in a single run of the script. If you did so and if you do not need any colour pages, you may skip reading the next subsection telling about conversion of colour images.

2.3  JPEG to low colour DjVu

Next, we need to convert colour images taken from the front and back book covers. Suppose the front cover is stored in 000.jpg, and the back cover is stored in 999.jpg, and each of them contain not more than, say, 8 tones. The previous run of any2djvu-bw left two unwanted DjVu files after it, namely black and white versions 000.jpg.djvu and 999.jpg.djvu. Delete these two files. Then convert both 000.jpg and 999.jpg to colour DjVu pages by executing the following command (note, quotation marks are necessary):

$ ~/bin/any2djvu-low "+(000|999).jpg" 8

where any2djvu-low is the script given below which must be written to ~/bin/ in order to execute the command.

#!/bin/bash
#
# any2djvu-low
#

if [ -z `which cpaldjvu` ]; then
  usage
  echo "Error: cpaldjvu is needed"
  echo
  exit 1
fi

shopt -s extglob

DEFMASK="*.jpg"
DPI=300
DEFNCOLORS=256
# uncomment the following line to compile a bundled DjVu document
#OUTFILE="#0-low.djvu"

function usage() {
  echo
  echo "usage:"
  echo
  echo "$0 [\"REGEXP\" [INT]]"
  echo "    converts single pages with the default mask $DEFMASK (or REGEXP if provided)"
  echo "    in the current directory to single-page low colour djvu documents with the"
  echo "    number of colours $DEFNCOLORS (default) or INT (if provided)"
# uncomment the following line to compile a bundled DjVu document
# echo "    and bundles them as a djvu file $OUTFILE"
  echo
}

if [ -n "$1" ]; then
  MASK=$1
  if [ -n "$2" ]; then
    NCOLORS=$2
  else
    NCOLORS=$DEFNCOLORS
  fi
else
  MASK=$DEFMASK
  NCOLORS=$DEFNCOLORS
fi

for i in $MASK; do
  if [ ! -e $i ]; then
    usage
    echo "Error: current directory must contain files with the mask $MASK"
    echo
    exit 1
  fi
  if [ ! -e $i.djvu ]; then
    echo "$i"
    cpaldjvu -dpi $DPI -colors $NCOLORS $i $i.djvu
  fi
done

# uncomment the following line to compile a bundled DjVu document
#djvm -c $OUTFILE $MASK.djvu

Colour DjVu pages were produced by a low colour encoder cpaldjvu rather than by a bitonal encoder cjb2. Occasionally cpaldjvu with the 2 colour setting may produce slightly smaller output files comparing to that of cjb2. This may happen since the black colour appears to be lighter in the case of cpaldjvu. Therefore the usage of cjb2 is preferable for bitonal images which usually look nicer the brighter the black colour. In addition, conversion of a JPEG image to a bitonal DjVu using cpaldjvu takes approximately 1.5 times longer than the same thing using cjb2.

You might also expect that cpaldjvu (with the default number of colours 256) would produce an output almost the same in size as the initial (even 16M colour) JPEG file. Reducing the number of colours using the option -colors n of cpaldjvu in many cases solves the problem exponentially slow, for example, reducing n from 256 to 16 can give an output only 4 times smaller.

2.4  Binding DjVu

The final step is to bind all the separate DjVu pages into a multi-page DjVu document. The following script binddjvu does the thing.

#!/bin/bash
#
# binddjvu
#

shopt -s extglob

OUTFILE="#0.djvu"
DEFMASK="*.jpg.djvu"

if [ -n "$1" ]; then
  MASK=$1
else
  MASK=$DEFMASK
fi

djvm -c $OUTFILE $MASK

The multi-page DjVu file #0.djvu can be given some better, meaningful name:

$ mv #0.djvu thebook.djvu

And we are done with our example.

2.5  PDF to DjVu

PDF format is also used for digitizing documents, e.g. by jstor.org, and at present is still better wide-spread than DjVu only for the reason that many people have programs for reading PDF and don't have anything for reading DjVu. There are several reasons to replace PDF with DjVu, including the following:

  1. On scanned documents the performance of DjVu is strictly better than that of PDF. This is why it makes sense to convert a scanned PDF document to DjVu format.
  2. There is another kind of situation when we have many (single-page) PDF documents which we want to bind together. For example, take pages of a PDF document downloaded from an internet library.
  3. Merging PDFs, single-page or multi-page, into a single DjVu.
  4. Also on some scanners it's possible to scan directly to single-page PDF files. Then again it is convenient to bind PDFs in a multi-page DjVu.
The following script pdfs2djvu suffices for each of the above actions. By default pdfs2djvu takes all the *.pdf files in the current directory in the alphabetical order and produces a single multi-page bitonal DjVu file #0.djvu.

#!/bin/bash
#
# pdfs2djvu
#

if [ -z `which pdftoppm` -o -z `which cjb2` -o -z `which djvm` ]; then
  echo
  echo "Error: pdftoppm, cjb2 and djvm are needed"
  echo
  exit 1
fi

shopt -s extglob

OUTFILE="#0.djvu"
DEFMASK="*.pdf"
DPI=600

if [ -n "$1" ]; then
  MASK=$1
else
  MASK=$DEFMASK
fi

for PDF in $MASK; do
  if [ ! -e $PDF ]; then
    echo
    echo "Error: current directory must contain files with the mask $MASK"
    echo
    exit 1
  fi
  echo $PDF
  pdftoppm -mono -r 600 -aa yes $PDF $PDF
  for PBM in $PDF*.pbm; do
    echo $PBM
    cjb2 -dpi $DPI $PBM $PBM.djvu
    rm -f $PBM
  done
done

djvm -c $OUTFILE $MASK*.pbm.djvu

After a run the script pdfs2djvu leaves DjVu-encoded pages as files *.pbm.djvu in the current directory.

3  Concluding remarks

This HOWTO was written not by a developer of DjVu but by its user. Therefore the HOWTO possibly lacks some technical details. If you wish to get more technical information on commands, see manpages or any other relevant documentation. I would suggest a very instructive

$ man djvu

to anybody beginning to use djvulibre on linux.



Author: Vladimir Komendantsky <MY_LASTNAME at gmail.com>