PDF/HTML into EPUB
Some things I learned while trying to convert PDFs into EPUBs for use on
PDF is the worst
format from which to get an EPUB. At best, the output is likely to still
show a few oddities, at worst, some parts will simply be ureadable; This is
especially true for more complexe layouts that include multi-column text, tables,
The more sophisticated the layout, the worse the output.
But before even bothering converting to EPUB, check if your e-reader doesn't
handle PDFs well enough. One thing it might not let you do is change the font
size, but it might offer other options like allowing text reflow.
As for turning web pages (HTML) into EPUB, use pandoc.
AZW(3)/KFX format used on Kindles are actually EPUB files originally from
MOBI/PRC, and usually DRM protected. Use Calibre
to turn those to EPUB.
What are EPUBs?
An EPUB is actually a zip file that packs files in HTML, PNG, etc. Just rename
the extension from EPUB to ZIP to check it out.
Infos from "EPUB and KindleGen Tutorial"
- The reason an EPUB consists in multiple HTML files is that "eReading
devices are not known to be the fastest parsers of HTML due to their limited
processing power. If your eBook is one large source file, it will cause serious
lag and readability issues when a reader tries to open your eBook. […] You want
to make sure that your HTML files are less than about 300KB each. You can use
the exact same HTML Head Section for each file.
- eBooks actually have two separate Table of Contents: an NCX (or Meta)
Table of Contents and an HTML Table of Contents. Different eReading devices
utilize these two Tables of Contents in different ways.
- Inside the EPUB package are the following files:
- The HTML content of your eBook (required)
- An XML file called toc.ncx which is the NCX Table of Contents (required)
- An XML file called content.opf which contains exactly how the EPUB
is structured, what files are in the EPUB package, and the eBooks relevant
- An XML file called container.xml which tells the eReader where the
content.opf file is located in the compressed directory structure (required)
- A text file called mimetype which says that the EPUB file is an EPUB
and ZIP file (required)
- The cover and content images (optional)
- Audio, Video, Fonts and other media (optional)
- One or more CSS files (optional)
- Important Note: All of these files are case-sensitive, which may seem
unusual for Windows users. So, be careful when you are building your EPUB package.
- Please note that there is a very specific way that you have to compress
the files into the EPUB format. Unfortunately for Windows users, compressing
all your files using a GUI-based compression tool like 7-Zip will cause your
EPUB file to fail validation. Per the IDPF specification, it is necessary to
have the mimetype file added first to the zip file, and also to have it “stored”
(i.e. uncompressed). That is why you have to use the command line to build your
Why is converting from PDF so difficult?
"PDF is a page oriented format while EPUB is a a reflowable format."
"The main problem here is that PDF is a page oriented format (it describes
where to put glyphs on the page), while epub and mobi are both text-oriented
formats (they leave it to the device to do the layout). So basically, you need
to extract the text from the PDF, intelligently recognize the formatting, express
this formatting in HTML, and then convert it to epub/mobi. By definition, this
can't be "lightweight". And even "heavyweight" applications
might give you bad results, without manual correction. – dirkt Jan 4 '18 at
"Quite simply because there IS no "textual information" in
a PDF document. A PDF document doesn't contain paragraphs, sentences, and words.
All that it contains is drawing instructions of the form "draw this shape
at these coordinates". A PDF document is essentially a series of instructions
for drawing a picture on a sheet of paper. It's not a book." (Source)
"A PDF document is a software program containing instructions written
in a restricted subset of the PostScript document description language, which
is a full blown stack-based programming language. Extracting text from a PDF
document is difficult because it is not stored in specific sections of the file,
but scattered in difficult to predict ways among the instructions that generate
the document layout." (Source)
"Trying to extract a properly formatted document from a PDF is akin
to hoping to recover a full-sized image by "enhancing" a small thumbnail."
"there is no concept of text structure in a PDF file at all, no lines,
no paragraphs, sentences, nothing. All there is in a PDF file is 'this text'
and 'put it here on the page'.
The encoding used for the text may even be custom, and ther emay be no possible
method (other than OCR) for determining the actual text content (eg the Unicode
Sentences don't even have to be contiguous." (Source: comp.lang.postscript)
You can learn by reading archives of the Calibre
> Conversion forum, searching for "EPUB PDF" in the titles.
Note that a PDF/EPUB can look different on the computer using eg. SumatraPDF
and on your e-reader.
How to proceed
- Ideally, get an EPUB file
- If the document is only available as PDF, try to conver it into EPUB with
Calibre (which relies
on pdftohtml provided by the Poppler library)
- If the output from Poppler + post-editing by Calibre isn't to your liking, and as your e-reader
is most likely capable of reading PDFs, just copy the PDF to the reader and see if it's good enough
- If it doesn't look good, run it through k2pdfopt, which will massage
it for use on an e-reader
investigate how to extract just those problematic pages and convert
them into PNG and turn them back into PDF, and then merge them all
into an hybryd, text + picture PDF file
- Yet another alternative, since e-readers usually also support HTML,
is to use MuPDF
(or the deadware Mobipocket Creator) to turn the PDF into HTML pages, but
use the following to avoid creating one huge HTML file that
your e-reader might have a hard time handling (pics are embedded
as base64): mutool draw -o %d.html in.pdf . You could also turn those web pages
into an EPUB with pandoc
- If the PDF is clean enough, you can also try to run it through an OCR,
and turn the output text into an EPUB.
Can it run regexes to remove eg. header/footers?
("Kindle 2 PDF Optimizer") is a cross-platform GUI/CLI application. "The output from k2pdfopt is a new
(optimized) PDF file." It relies on MuPDF to read PDFs, but can be configured
to use Ghostscript instead.
Here is the
list of the commands it supports.
Note: By default, k2pdfopt converts pages into bitmaps, even when the source
file is native PDF (ie. text, not scanned text into bitmaps).
If you don't like the Windows3-looking GUI, there's an alternative: k2pdfopt
GUI (220.127.116.11, .Net application; Requires k2pdtopt) > "Error loading
If your e-reader has it, play with its "Reflow text" option, and
see how PDFs are displayed, whether they're text ("native") PDFs or
bitmaps. Some e-readers are able to reflow text if they use alternative/complementary
firmware from KOReader or Duokan.
To convert a clean (ie. not scanned) PDF: k2pdfopt -mode fw -ls- input.pdf
("fit width" removes the excess borders; -ls- prevents turning the
document on its side). Use the "-p" switch to only include a subset
of pages, eg. "-p 2,4-8,37"
If need be, you can increase the output margins with the -om command-line option, e.g. -om 0.2 will add 0.2 inches of padding around the output pages.
If a PDF look OK on the computer but doesn't on the e-reader, an alternative
is to convert the whole file into bitmaps using "k2pdfopt -mode fw -ls-
-n- input.pdf". Obviously, the file will be much bigger than the
If a PDF contains scanned pages instead of text, here's how to run it through
k2pdfopt's embedded OCR program and include searchable text along with the bitmaps:
k2pdfopt -mode copy -odpi 200 -ocr t -ocrlang <set language, eg. -ocrlang
fra> -ocrd p input.pdf . In case of a multi-column layout, you can use
the usual CTRL+mouse to select part of the screen.
Notes taken from the web site:
- K2pdfopt converts each page of the input file to a bitmap, scans the
bitmap for viewable areas (rectangular regions), cuts + crops these regions
and assembles them into multiple smaller pages without excess margins so
that the viewing region is maximized. Making use of this method, k2pdfopt
can re-flow text lines, even on scanned documents.
- As of v1.50, k2pdfopt will also embed OCR text into the PDF so that
text can be searched and highlighted, and v1.60 can create output files
with the native PDF instructions from the source file (if the source file
- K2pdfopt has the advantage over other PDF converters in that it fully
preserves the rendered PDF fonts and graphics from the original file, unlike
programs that convert the PDF to an e-book format. Also, because k2pdfopt
is completely independent of language or fonts, it will work equally well
on documents in any language.
- MS Office offers PDF Reflow, where MS Word converts PDF files to Word
documents amazingly well. Once you have your PDF file in MS Word format,
you'll have a lot more capability to manipulate it into other formats and/or
- With the default conversion, which allows text re-flow, every converted
page is a bitmap, so the file size of the converted file is often larger
than the original; however, many e-readers can process PDF files made up
of bitmaps faster and with less memory overhead than the original PDF file,
so you might still prefer this type of conversion. If you still want a smaller
output file size, see my help page on output file size for options that
reduce the output file size, mostly at the expense of the output quality.
If you don't need text re-flow, you might try using a mode which converts
using native PDF output.
- To remove the excess borders on my PDF file, use "-mode fw"
(fw = fit width). If you still want to rasterize the output, use -mode fw
-n-. If you don't want to turn the document on its side, use -mode fw -ls-.
- To crop region and put only that region in the output PDF à la Briss,
use the GUI: Make one of the "Crop Areas" active (check box);
type in the applicable page range for the crop box (e.g. 2-99), then click
the blue Select button and choose your crop region. For the conversion mode,
select Crop (command-line: -mode crop).
- The reason a native PDF output can cause the device to run out of memory,
be very slow, or even crash, is likely because of too many cropped-and-scaled
regions in the output file. Try using a specific conversion mode instead.
Modes are shorthand for setting a collection of options that are best suited
for s specific type of optimization.
- If there are more than one cropped/scaled regions on an output page,
most PDF reading applications will get confused and allow selection of "invisible"
text which is outside cropped regions and which overlaps with displayed
- To see how k2 interpretes a PDF file, try using the -sm command-line
option ("sm" from the interactive menu), which will write out
a PDF file that shows the regions found by k2pdfopt.
- As of v1.35, k2pdfopt has a nice debugging option to clearly show you
how it is interpreting your PDF file by marking the regions on it in the
order it chooses to display them. The command-line option -sm (show markings)
does this, or you can select "sm" from the interactive menu. This
will generate a file name ending in "_marked.pdf".
- To use text re-flow, even with tables / equations / figures, try protecting
those regions by drawing boxes around them.
- To prevent images / figures from being split across pages, use -f2p
-1, or select "bp" from the interactive menu and enter -1 for
the "fit-to-page" value.
- To remove the document headers, footers, page numbers and/or other marks
near the edges of the source pages, tell k2pdfopt to ignore an arbitrarily
sized border around your document. See Ignoring Borders/Headers/Footers.
- k2 allows for searching / highlighting the text in the converted PDF
file because it has OCR capability, and as of v1.60, k2pdfopt has options
for native PDF output, much like Cut2Col, SoPDF, and the latest version
- NATIVE PDF OUTPUT = zoomable, and searchable like the original with
no need for hidden OCR page.
- The defaults for the kindle are 560 x 735. Even though the kindle screen
is technically 600 x 800, the useable space for PDF files is 560 x 735.
The other factor that affects the size and quality of the text on the display
- While k2pdfopt is designed to give good results on a 6-inch reader by
default, you may want to fine tune the DPI settings depending on your reader
and your input file. The -idpi and -odpi settings, discussed above, control
the quality (-idpi) and magnification (-odpi) of the k2pdfopt output PDF
- Landscape mode (use -ls from the command line or select option (l) from
the interactive settings menu) can be used to increase the text magnification
at the expense of having more pages.
- If you would like to reduce the output PDF file size, you can use the
-bpc option to reduce the number of bits per color plane. The default is
4 (for 16 graylevels--the same as the kindle can display), but using -bpc
2 will reduce to 4 graylevels and reduce the PDF file size to approximately
- If you want a little extra space around the text on your reading device,
you can use the -om option to set the output margins (or select option (om)
from the interactive settings menu in v1.16+).
- Since v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped
text to native ASCII characters so that the text in the output file can
be searched or copied and pasted into other applications. And in v1.63,
bitmapped text from any language that Tesseract supports (including, for
example, Chinese) is converted to Unicode-16 values and can be copied and
pasted into Unicode-aware applications (e.g. most web browsers and modern
word processing software). See the examples below.
- Make sure you really need to perform OCR first. With k2pdfopt v2.x,
if the source PDF document has searchable or highlightable text (e.g. if
it is computer-generated or scanned but has an OCR layer), then k2pdfopt
output of either type (native PDF or the default re-flowed text mode) should
also have searchable text without having to resort to time-consuming OCR.
OCR should only be necessary if the source document is scanned and does
not already have a text/OCR layer.
- the -m option (or select option (m) from the interactive settings menu
in v1.16+) to tell k2pdfopt to ignore a certain amount of margin in the
input file. For this particular example, 0.8 inches is a good value, so
-m 0.8 should be used:
- K2pdfopt has built-in PDF translation (via the MuPDF library) but will
try to use Ghostscript if Ghostscript is available and the internal (MuPDF)
translation fails. Since I fixed a couple bugs with MuPDF in v1.16, I have
found no instances where MuPDF fails to correctly translate a PDF file,
but you can force Ghostscript to be used with the -gs option.
- Forum: https://www.mobileread.com/forums/showthread.php?t=144711
- GETTING STARTED WITH THE WINDOWS GUI https://www.willus.com/k2pdfopt/help/overview.shtml
- INTERACTIVE TEXT MENU https://www.willus.com/k2pdfopt/help/textmenu.shtml
- LIST OF K2PDFOPT COMMAND-LINE OPTIONS https://www.willus.com/k2pdfopt/help/options.shtml
Build an hybrid PDF
As an alternative to turning a PDF into EPUB with Poppler and all its issues,
there's the option of simply converting the few problematic pages
(tables, etc.) into pictures, replacing+merging them back into the main PDF,
and reading the PDF on my e-reader that has no problem handling basic text.
Obviously, while flipping through that kind of mixed PDF, the user can tell
the difference, but IMHO it's a much better solution than the HTML output from
k2 is unable to run once and handle pages differently, turning some pages
into bitmaps (rasterize) while leaving the others as text ("native PDF"):
You'd have to write a loop, and merge those two sets back into a PDF. Likewise,
I haven't found how to use cpdf to crop, maximize, and rasterize pages.
Things that could be improved:
- Maybe Poppler or ImageMagick are better tools than MuPDF for this task?
- Crop PDF pages before/after turning them into PNG to maximize screen
- Convert relevant pages directly in PDF without having to extract to
PNG, convert to PDF, and merge
- Rewrite in Ruby: DOS cmd is hell (input params, arrays, etc.) ; Single
EXE or portable Ruby?
Here's the Windows batch script:
- @ECHO OFF
REM myscript.bat output.pdf input.pdf "1-5,8,25"
- REM Note: ~ removes quotes
- if "%~1"=="" GOTO PARAM
- if "%~2"=="" GOTO PARAM
- if "%~3"=="" GOTO PARAM
- REM Change those to match your e-reader
- SET DPI=213
- SET WIDTH=758
- SET HEIGHT=1024
- IF NOT EXIST mutool.exe (ECHO mutool missing & GOTO
- SET APP=..\mutool.exe
- SET OUTPUT=%1
- SET INPUT=..\%2
- SET LIST=%~3
- SET TMPDIR=TEMP%random%%random%%random%%random%%random%%random%TEMP
- REM Create temp dir
- IF NOT EXIST %TMPDIR% MD %TMPDIR%
- CD %TMPDIR%
- REM Convert input PDF into individual PDFs
- FOR /F "tokens=* delims=" %%# IN ('%APP% show
%INPUT% Root.Pages.Count') DO SET "COUNT=%%#"
- ECHO Found %COUNT% pages
- FOR /L %%i IN (1,1,%COUNT%) DO (ECHO Handling %%i &
%APP% clean -g %INPUT% %%i.pdf %%i)
- REM Convert required pages into PNG, and remove matching
- %APP% draw -r %DPI% -w %WIDTH% -h %HEIGHT% -o %%d.png
- REM Delete matching PDFs
- FOR %%A in (*.png) DO (ECHO Deleting %%~nA.pdf &
- REM Convert PNG files into PDFs, and remove PNG
- FOR %%A IN (*.PNG) DO (%APP% convert -O compress -F pdf
-o %%~nA.pdf %%A & ECHO Deleting %%A & DEL %%A)
- REM Merge individual PDFs into single PDF
- REM Build list
- SETLOCAL EnableDelayedExpansion
- SET _filelist=
- FOR /F "delims=|" %%f in ('dir /b *.pdf') DO
- SET "_filelist=!_filelist!%%f "
- SET LIST=%_filelist:,,=%
- REM ECHO LIST=%LIST%
- ECHO Merging
- %APP% merge -o %OUTPUT% -O compress %LIST%
- REM Cleaning up
- MV %OUTPUT% ..
- CD ..
- RMDIR /S /Q %TMPDIR%
- GOTO END
- ECHO Usage : %0 output.pdf input.pdf "pages"
(Use quotes if pattern includes commas, eg. "2,3")
- GOTO END
REWRITE AS POWERSHELL OR RUBY
cmd.exe > powershell
PDF to EPUB
Issues that must be manually fixed:
- Wrong linebreaks
- Hyphens (requires a dictionary for the source language to try and fix)
- Lost formatting (italics, bold, etc.)
- Must re-add footnotes
- Tables and graphics (e-readers screens differ in size)
Conversion Tips for e-readers
Text in its own layer can be easily read and saved in a text file, but italics,
notes and other formatting will be lost. pdftotext
is one of the applications available.
Calibre/Poppler don't usually do a very good job turning PDF into text. LibreOffice
opens PDF in Draw, but is unable to export this to Writer. Softmaker TextMaker
can't open PDFs.
Word does a pretty good job thanks to its"PDF Reflow" feature.
Alternatively, and even though the PDF already contains a text layer, is
open the PDF in Abbyy FineReader, and copy/paste to LibreOffice Writer or Sigil,
and fix the issues left. A CLI
might be available; If not an AutoIT script is handy to automate the process.
gImageReader says: "PDFs with text. These PDF files already contain
text", and will stop.
Using an OCR
OCRing + EPUBing my first book: Tips?
OCR: gImageReader (GUI to Tesseract), Abbyy FineReader
EPUB editor: LibreOffice Writer, Silig (last Win32 release: 0.9.14; How
- Open the PDF in a viewer (on Windows, SumatraPDF can read PDF and EPUB),
and make a list of the pages that include anything more sophisticated than
- Text that is displayed in multicolumns must be turned into one column
- Insets must be removed, and turned into regular text
- Tables: Rather than trying to rewrite it as HTML, it's easier to
just take a screenshot and save it as JPG/PNG to be inserted in the
EPUB later; Make sure the picture is no bigger than the width+height
of your e-reader, and that the picture is located in the HTML file at
the top+left so that it's correctly displayed
- Use Calibre to generate the EPUB; If need be, play with its settings
in the Conversion dialog, including the Page setup where you can
tell Calibre which e-reader you have
- Open the output in its editor (right click > Edit book, or T), and edit the pages that need it;
Pages can be removed through the Delete key, and new ones added with File > Insert;
To insert an image, use the familiar <img src=""> sequence
- Copy EPUB to e-reader.
Editing the PDF
An alternative is to use LibreOffice Draw to modify the PDF, and read it
in your e-reader without bothering with EPUB:
- If the PDF file is big and would make LibreOffice sluggish, use
qpdf to export each page as an individual PDF file:
--split-pages infile.pdf %d.pdf
- In Draw, open and edit each problematic page to replace all nasty parts
(remove/rewrite insets and multi-column text, replace tables with screenshots)
- Use qpdf to merge all the pages back into a single PDF:
--empty --pages *.pdf -- out.pdf
A faster way is to simply turn each "problematic" page into pictures:
- Open PDF on computer, and make a list of the pages that contain anything
more than basic, one-column text (eg. multi-columns, tables, insets, etc.)
- Split all the pages of the PDF into individual files
- Convert each PDF with difficult layout into pictures, matching the e-reader's
- Merge all the PDFs back into a single PDF
- Send to e-reader, and test.
Nolim: Fichiers supportés pour les livres : epub, fb2, html, txt, pdf et
Can qpdf convert PDF into pictures?
Can MuPDF convert PDF into pictures, and merge the files back?
for N in $(seq $(mutool show input.pdf Root.Pages.Count)); do mutool clean
-g input.pdf page$N.pdf $N; done
convert relevant pages into pictures
Investigate ImageMagick's convert
convert in.pdf -crop 50%x0 +repage out.pdf
Try TIFF or PSD vs. JPG/PNG
How to get rid of "side circles" (typographice signs)?
pdfcairo vs. pdfppm?
How to crop?
pdfseparate.exe: progress bar?
How to compile Poppler for Windows32?
Converting HTML pages into an EPUB
Multiple HTML files can be cleaned up and converted into a single EPUB file.
- With Calibre, you first need to create
- And then, call the command: "c:\Program Files\Calibre2\ebook-convert.exe"
a well-known cross-platform, open-source, GUI application to remove DRMs written
by Kovid Goyal.. It
can also convert a PDF into EPUB. It does a reasonable job at the latter, but
just like other tools, it has a difficult time with more sophisticated layout,
tables, and headers/footnotes.
The UI can be changed through the Preferences > Change Calibre behavior
(CTRL+P) > Interface, and the Layout button in the bottom right corner:
To convert a PDF into HTML, Calibre actaully just relies on poppler
What Calibre does, is run poppler's pdftohtml to convert each page of the PDF
into HTML, and then work from there and build an EPUB. The settings in Calibre's
Convert dialog lets you changes the settings that it will use for this operation,
but Calibre can only do so much using the input from poppler.
Once Calibre is installed, you can run the PDF-to-EPUB converter through
the command line: ebook-convert.exe input_file output_file [options].
"Calibre is awesome at many things, but PDF conversion isn't one of
its strong points. What I find most annoying is the text unwrapping, and that
certainly is Calibre's fault. The algorithm it uses is quite simplistic, if
a line is less than xx% of the page width, it's considered a paragraph break,
if it's longer, it's not. So in a typical book, you end up with hundreds of
incorrect paragraph breaks - spurious breaks that shouldn't be there, and paragraphs
stuck together that shouldn't be." (Source)
How to work with the Convert dialog
The Debug option lets you see the files at the four steps: input, parsed,
structure, and processed.
Slight editing can be done in the \input directory, which contains the HTML
files generated by poppler. When you're done, zip the files up, add it through
Edit meta information dialog, and proceed with the conversions.
How to make the most of the Convert dialog
Start by converting the PDF to EPUB using Calibre's default settings, and
see what the issues are. When clicking on the Wizard button in the Search &
replace section, if an EPUB isn't available, Calibre will first convert the
PDF into HTML, which explains the pause.
Line Un-Wrapping Factor
Used to unwrap paragraphs. This is a scale used to determine the length at
which a line should be unwrapped. Valid values are a decimal between 0 and 1.
The default is 0.45, just under the median line length. Lower this value to
include more text in the unwrapping. Increase to include less. You can adjust
this value in the conversion settings under PDF Input.
The default setting for this is 0.45, you can set this lower to make line
unwrapping more 'aggressive', but be aware that doing this may unwrap lines
which shouldn't be unwrapped.
"the unwrap function looks at the median (or average, can't remember)
line length, and only unwraps lines that exceed that length. That works well
for a book with consistent breaks in roughly the same location for every line
(OCR, pdf, many well formatted text files), but it will fail where the hard
breaks are inconsistent/infrequent. Reducing the unwrap factor basically tells
Calibre to look for shorter lines than the median. The fewer or more erratic
the breaks the lower you need to go, sometimes all the way down to 0.05".
Page setup: Choose a device that matches the screen size of your device
Table of contents
Search & replace
In the Search and replace section, use the wizard to test regexes
Regexs are applied to the HTML as produced by poppler. If an EPUB has already
be generated, Calibre prompts you whether to use its HTML or to start again
from the PDF.
Headers and footers must be searched
and removed because they are often part of the document and they can
throw off the paragraph unwrapping.
Use the Wizard in the "Search & replace" section to try regexes.
"If you are intimidated by regular expressions, many Windows users have
reported that [deadware] Mobipocket Creator is a good alternative to use to
do the initial pdf conversion. Use Mobipocket
Creator to convert the pdf to the .mobi format, and then use Calibre to
convert from mobi to your final desired format."
Any way to tell Calibre to ignore some pages (ToC, tables, etc.)?
? PDF Input > Line un-wrapping factor = 0.45 VS. Heuristic processing
> Line un-wrap factor =0.40
If the EPUB output needs some work, you can use either Calibre's internal
editor (select the book in the list > Edit
Book) or Sigil (80MB; Sigil is just
the editor in Calibre without all the fuss; Must
restart app when changing language for UI.)
How to hide the left-side tree list Authors, Languages, etc.?
Like pdftohtml, poppler is also based on xpdf. Confusingly,
poppler kept the names for the applications such as "pdftohtml", so it's hard to know
it's not the original whose development was abandonned in 2006.
As of April 2020, the latest stable release is poppler-0.87.0.tar.xz, released
on March 28, 2020. Note that packages for Ubuntu et al. might be out of date.
Poppler includes multiples applications:
- pdfseparate – extract single pages from a PDF
- pdftocairo – convert single pages from a PDF to vector or bitmap formats
- pdftoppm – convert a PDF page to a bitmap
- pdfunite – merges several PDF
- pdftohtml – convert PDF to HTML format retaining formatting
- pdftotext – extract all text from PDF
- pdfdetach – extract embedded documents from a PDF
- pdfimages – extract all embedded images at native resolution from a
- pdftops – convert PDF to printable PS format
- pdffonts – lists the fonts used in a PDF
- pdfinfo – list all information of a PDF
apt-get install poppler-utils
"-c : This will output in complex mode. You can't use -noframes with
the complex flag."
"-noframes generate no frames. Not supported in complex
"complex mode": One page = one HTML file + one PNG that only includes
some typographical feature to center the output.
Here's how to convert a PDF that was encoded in Latin1: pdftohtml -c -s -enc
Latin1 test.pdf test.html
Windows release is 0.68 while the current release is 0.87 released on March
to date but only available for Win64)
! Source Win32 https://github.com/zotero/cross-poppler
"cross-poppler compiles Poppler PDF tools for macOS (x64), Windows (x86,
x64), Linux (x86, x64). This is only intended to be used for pdfinfo and pdftotext."
poppler-0.39.0-win32.zip 2016-01-07 7.3
"PDFMasher, now long abandoned and unmaintained."
Even worse than poppler to convert PDF to HTML (one line = one <p></p>)
- Artiflex also handles Ghostscript
- To ask questions, log on to irc.freenode.net,
- Open-source, cross-platform, CLI
- mupdf.exe and mupdf-gl.exe are GUI readers: "For Linux and Windows
there are two viewers. One is a very basic viewer using x11 and win32, respectively.
It has been supplanted by a newer viewer using OpenGL for rendering, which
has more features such as table of contents, unicode search, etc. We keep
the old viewers around for older systems where OpenGL is not available."
- mutool.exe: The command line tools are all gathered into one umbrella
- mutool draw: This is the more customizable tool, but also has a more
difficult set of command line options. It is primarily used for rendering
a document to image files.
- mutool convert: This tool is used for converting documents into other
formats, and is easier to use.
The HTML displays fine in a browser, but is useless to create an EPUB because
it has no notions of lines and paragraphs: Each line of text is just displayed
at coordinates x,y, with no indication that it belongs to a paragraph.
Note: According to Calibre author, mutool outputs "non-reflowable HTML,
it is just as useless as the original PDF file."
mutool draw -F html -o out.%d.html in.pdf
one page = one HTML (pics embedded
mutool draw -F html -o out.html in.pdf
single HTML file, with pictures
embedded as "data:image/png;base64"
Remove footer? ffirs_simmons.qxd 5/16/05 4:13 PM Page iii
-> Copied HTML into e-reader: Took ~ one minute to open, and… unreadable
Note: In "mutool convert", N can be used to stand for the last
clean vs. draw vs. convert:
- mutool clean reads and writes PDF files
- mutool draw reads any file we can read, and writes out in most formats, but it does it by interpreting the graphics and creating the output file from scratch.
any extra PDF information like bookmarks and links are lost in the process
- mutool convert has a simpler interface than "mutool draw", and also provides more detailed options to many of the output formats. mutool draw is primarily intended for bitmap output.
so both mutool draw and mutool convert do the same thing, but the interfaces are different
Functions offered by mutool:
- draw -- the most commonly used tool, capable of converting/rendering
documents to a range of bitmap and vector formats. It performs a similar
task to the convert utility, using a different set of internal mechanisms;
output format: png, tga, pnm, pam, pbm, pkm, pwg, pcl, ps, svg, pdf, trace,
txt, html, stext; modifies page width/height, rotate, colorspace, etc.
- convert -- performs a similar task to the draw utility, using a different
internal mechanism (the document writer interface); output format (default
inferred from output file name): png, pnm, pgm, ppm, pam, tga, pbm, pkm,
pdf, svg, cbz; modifies page width/height, rotate, colorspace, etc.
- clean -- rewrite pdf file
- create -- create pdf document
- extract -- extract font and image resources
- merge -- merge pages from multiple pdf sources into a new pdf
- portfolio -- manipulate PDF portfolios
- poster -- split large page into many tiles
- info -- show information about pdf resources
- show -- show internal pdf objects
- pages -- show information about pdf pages
Written by Coherent Graphics Ltd's John Whitington, author of O'Reilly's
Explained". Based on an open source library written in Caml.
Notes from cpdfmanual.pdf:
- The cpdf
tool has been available commercially since 2007, and is widely used in industry
and government. Now we're releasing two tools for free, the main program
under a special not-for-commercial-use license, and a lossless PDF squeezer
under the LGPL.
measurements are given to cpdf , they are in points (1 point = 1/72 inch).
They may optionally be followed by some letters to change the measurement.
The following are supported: pt Points (72 points per inch). The default.
cm Centimeters, mm Millimeters, in Inches.
- Linearized PDF is a version of the PDF format in which the data is held
in a special manner to allow content to be fetched only when needed. This
means viewing a multipage PDF over a slow connection is more responsive.
This requires the existence of the external program cpdflin which is provided
with commercial versions of cpdf.
Functions offered by cpdf:
- scaling, rotating, etc.
- showing infos (show-boxes, list-fonts, etc.)
- handing bookmarks
- turning a PDC into a PowerPoint-like presentation
- watermark and stamps
- file attachments
- images, fonts
- Written by Jay Berkenbilt; "QPDF was originally created in 2001
and modified periodically between 2001 and 2005 during my employment at
Apex CoVantage. Upon my departure
from Apex, the company graciously allowed me to take ownership of the software
and continue maintaining as an open source project, a decision for which
I am very grateful."
Notes from qpdf-manual.pdf
- qpdf does structural, content-preserving transformations on PDF files.
- In QDF mode, qpdf creates PDF files in what we call QDF form. The purpose
of QDF form is to make it possible to edit PDF files, with some restrictions,
in an ordinary text editor.
- A Python module called pikepdf [https://pypi.org/project/pikepdf/] provides
a clean and highly functional set of Python bindings to the qpdf library.
Using pikepdf, you can work with PDF files in a natural way and combine
qpdf's capabilities with other functionality provided by Python's rich standard
library and available modules.
- the qpdf command-line program can produce a JSON representation of the
non-content data in a PDF file. It includes a dump in JSON format of all
objects in the PDF file excluding the content of streams. This JSON representation
makes it very easy to look in detail at the structure of a given PDF file.
Functions offered by qpdf:
apt-get install pdftk ghostscript
BULLSHIT! PDFtk Free
- Deadware (pdftohtml-0.39, 2006-08-03); Forked with poppler
- Open-source, Linux
- apt-get install poppler-utils
- pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html
- output much worse than mutool
- -c : 'gswin32c' is not recognized as an internal or external command,
operable program or batch file. Error: Failed to launch Ghostscript!
- C:\Program Files\gs\gs9.52\bin\gswin32c.exe
- Deadware: "Sunday, December 11, 2016 Looking for new maintainer"
- "As of 2020, PDFMiner is not actively maintained. The code still
works, but this project is largely dormant. For the active project, check
out its fork pdfminer.six."
pandoc cannot convert PDF to HTML, but can turn HTML into EPUB.
Written in Haskell, it's rather slow and resource-hungry so isn't great on big files.
Incidently, here's a command you can use to download a single web page and
its dependencies, and turn into an EPUB:
- wget -E -H -k -K -p http://www.acme.com/mypage.html
- pandoc -f html -t epub -o output.epub mypage.html
pandoc can also fetch web pages directly: pandoc -f html -t epub -o output.epub
To install on Linux: apt-get install pandoc
- Packages on Linux can be very old. Check you have the
- pandoc also supports creating an EPUB3 file… which your e-reader may
or may not support: "Please note that the EPUB 3.0 specification has been
released as of late 2011. Unfortunately, the eBook stores have been very slow
to adopt this standard. The EPUB3 specification will allow for more complex
eBook designs that include audio/video embedding, footnote support, and even
- Windows doesn't support shell expansion. You need to create a batch
file or run in PowerShell. More info here.
- It's very slow on anything but small files (eg. a 5MB file is a no go)
You can add metadata
pandoc -f html -t epub3 --epub-metadata=metadata.xml -o output.epub input.html
"Alternatively, you could use pdftotext, save it to text, edit it into
shape as well formatted markdown, and then use pandoc to convert it to epub.....i've
done that several times - after a lot of practice (and some handy vim key mappings),
it takes me about a day or so to convert a book with a few hundred pages."
Note: "pandoc-citeproc originated as a fork of Andrea Rossato's citeproc-hs.
The pandoc-citeproc executable can be used as a filter with pandoc to resolve
and format citations using a bibliography file and a CSL stylesheet."
pandoc -o output.epub *.html
pandoc: *.html: openBinaryFile: invalid
argument (Invalid argument)
It's due to how
Windows handles input. Name this batch file pandoc.cmd:
- @echo off
- :: Pandoc wrapper for calling it with wildcard file parameters.
- :: Expands any arguments containing wildcards according to standard
- :: Windows CMD.exe conventions.
- setlocal EnableDelayedExpansion
- set pandoc_cmd=pandoc
- for %%I in (%*) do set pandoc_cmd=!pandoc_cmd! "%%~I"
Call it thusly: echo output.epub | pandoc.cmd *.html -
If it still fails, use PowerShell instead of cmd.exe, or write a script in
richer language like Python, etc.
NO! copy /b *.html full.html
pandoc -o full.epub full.html
It's a better idea to keep individual HTML files, and merge them into a single
"The EpubCheck tool is an open-source program written in Java that checks
your EPUB file for errors. Most eBook stores that utilize the EPUB format will
utilize this exact same program to see if the eBook you upload for sale is valid."
Tried 18.104.22.168 with a 10MB single HTML with all pics embedded as base64: As
displayed in SumatraPDF, as crappy as Caliber.
- Turns a 10MB PDF into 70MB EPUB. And pages in the output are images,
PDFelement Standard perpetual license $79
iSkysoft PDF Editor is PDFelement under a different name.
Xilisoft PDF to EPUB Converter
to EPUB Converter $20
FineReader ; Convert
PDFs to e-book formats EPUB, FB2 (Standard, Corporate). 199€; Release 14 ~500-850MB
Converter: PDF to EPUB? $100
A-PDF no converter?
Deadware from Mobipocket;
Final release 4.2 can be found here
in the "Tutorial
- How to Create a MobiPocket eBook" thread.
"Home Edition has a simple to use interface and is designed to produce
content for private use. When creating new files from scratch you can use predefined
templates to aid in the creation effort. A user can also use the windows version
of MobiPocket Reader to convert files.
Installing [the Publisher Edition] provides the most power to customize
the output of the file and is required to submit eBooks commercially. if you
are a publisher and intend to sell eBooks through eBookbase, this is the version
of the Mobipocket Creator that you should use. Additional features essential
for publishers include: the encryption level required by eBookbase; an integrated
"deploy" feature to automatically upload or update your books in eBookbase;
the metadata editor to set the price, ISBN, cover image... of your books; PDF
Installed Publisher. After it reads a PDF, it generates an HTML file and
pictures. "Build" creates a .PRC file, which you don't need. Once
you have the HTML file, doctor it in the Caliber editor or Sigil, before turning
it into EPUB.
From a PDF, creates a single HTML + multiple PNGs.
Why are some PDFs non-selectable? Why are some PDFs selectable but no copyable
to the clipboard?
Either the pages are juste pictures and not text, or the PDF could be configured
to forbid copying: "Denied Permissions: copying text".
qpdf --decrypt input.pdf output.pdf