PDF/HTML into EPUB
Introduction
Some things I learned while trying to convert PDFs into EPUBs for use on
a e-reader.
PDF is possibly the worst
format from which to get an EPUB. At best, the output is likely to still
show a few oddities, at worst, some parts will simply be ureadable; This is
especially true for more complexe layouts that include multi-column text, tables,
or insets.
The more sophisticated the layout, the worse the output.
Before even trying to convert a PDF to EPUB, check if your e-reader doesn't
handle PDFs well enough, especially if it has a larger screen — although smaller
screens can sometimes "reflow" PDFs to fit (check the options). Regardless
of its format (EPUB or PDF), a complex document eg. with multiple columns will
never work well on a small screen; For those, a wider e-reader is the realistic
solution. If the stock e-reader software isn't up to the task, you can always
try to massage the PDF with k2pdfopt and see if it works well enough, or even
see if the KOReader application can be installed on your e-reader.
AZW(3)/KFX format used on Kindles are actually EPUB files originally from
MOBI/PRC, and usually DRM protected. Use Calibre
to turn those to EPUB.
As for turning web pages (HTML) into EPUB, use pandoc.
How to proceed
- Ideally, get an EPUB file
- If you're stuck with PDF and a bigger e-reader is not an option (although
it's the only realistic way to read PDFs beyond a single column, no-thrill
layout), open the PDF in your e-reader, if necessary playing with its "Reflow Text" and/or
"Crop Margins" options to remove useless space around pages
- If an application can be installed, try alternative/complementary
firmware from KOReader or Duokan
- If it still doesn't look good enough to read, run it through k2pdfopt, which will massage
and create a new PDF for use on smaller e-readers
- Alternatively,
if only some pages are garbage, investigate how to extract just those and convert
them into PNG, convert them into PDF, and then merge them all
into an hybrid, text + picture PDF file; This is especially important for
tables
- If it's too painful to read, try to convert it into an EPUB with
Calibre — which relies
on pdftohtml provided by the Poppler library to import text. It might be useful to start
by hard-cropping the PDF to remove the useless
headers + footers through redaction annotations instead of relying on Calibre's regex-based Search and
Replace function (tools like "mutool trim" only hide text, they
don't actually remove text from the PDF)… but don't expect miracles if page layout in the PDF
is anything more complicated than a single column
Note: Calibre keeps
basic text formatting (eg. italics) by default, while Abbyy FineReader requires
enabling "Retain fonts and font sizes option" in the File/Tools
> Options > Format Settings > EPUB document type > Document
layout > Formatted text item
Since Calibre can have a hard time
finding where chapters start, an easier solution is to split the source
PDF into sub-PDFs (automate the process by first making a list of pages/ranges
and feeding it to a slicer, eg. cpdf.exe input.pdf 45-48 -o pdf_45-48.pdf;
Thus, one chapter = one PDF) 4) run Calibre to convert them into EPUBs (ebook-convert.exe
pdf_45-48.pdf pdf_45-48.epub --enable-heuristics --no-default-epub-cover),
and 5) finally join them into a single EPUB (calibre-debug.exe --run-plugin
EpubMerge -- -o full.epub --author "John Doe" --title "My
book" --no-titles-in-toc --no-original-toc pdf_*.epub).
Another
solution is to edit the source PDF to add bookmarks and use those to slice
the PDF into sub-PDFs:
a. cpdf.exe -add-bookmarks bookmarks.txt input.pdf
-o input.BOOKMARKS.pdf
b. Use cpdf
to split the input file into multiple PDFs: cpdf.exe -split-bookmarks
1 -utf8 input.BOOKMARKS.pdf -o out%%%.pdf
c. Remove bookmarks from
all PDFs to prevent Calibre from appending "Document Outline"
sections: cpdf -remove-bookmarks out003.pdf -o out003.NO.BOOKMARKS.pdf
c.
Run Calibre to turn each PDF into an EPUB: "C:\Program Files\Calibre2\ebook-convert.exe"
"out001.pdf" "out001.epub" --enable-heuristics --no-default-epub-cover
d.
Finally, use Calibre's EpubMerge plug-in to join them into a single EPUB
file: "C:\Program Files\Calibre2\calibre-debug.exe" --run-plugin
EpubMerge -- --author "John Doe" --title "My great title" -no-titles-in-toc
--no-original-toc -o full.epub out001.epub out002.epub
Note: cpdf
can prepend a table of contents; Use the right option to prevent it from
adding bookmarks:
#-toc-no-bookmark
cpdf -table-of-contents -toc-title
"My Great ToC" input.pdf -o output.pdf
- Yet another alternative, since e-readers usually also support HTML,
is to use MuPDF
(or the deadware Mobipocket Creator) to turn the PDF into HTML pages, but
use the following to avoid creating one huge HTML file that
your e-reader might have a hard time handling (pics are embedded
as base64): mutool draw -o %d.html in.pdf . You could also turn those web pages
into an EPUB with pandoc.
What are EPUBs?
An EPUB is actually a zip file that packs files in HTML, PNG, etc. Just rename
the extension from EPUB to ZIP to check it out.
Infos from "EPUB and KindleGen Tutorial"
- http://bbebooksthailand.com/bb-epub-kindlegen-tutorial.html
-
- The reason an EPUB consists in multiple HTML files is that "eReading
devices are not known to be the fastest parsers of HTML due to their limited
processing power. If your eBook is one large source file, it will cause serious
lag and readability issues when a reader tries to open your eBook. […] You want
to make sure that your HTML files are less than about 300KB each. You can use
the exact same HTML Head Section for each file.
-
- eBooks actually have two separate Table of Contents: an NCX (or Meta)
Table of Contents and an HTML Table of Contents. Different eReading devices
utilize these two Tables of Contents in different ways.
-
- Inside the EPUB package are the following files:
- The HTML content of your eBook (required)
- An XML file called toc.ncx which is the NCX Table of Contents (required)
- An XML file called content.opf which contains exactly how the EPUB
is structured, what files are in the EPUB package, and the eBooks relevant
metadata (required)
- An XML file called container.xml which tells the eReader where the
content.opf file is located in the compressed directory structure (required)
- A text file called mimetype which says that the EPUB file is an EPUB
and ZIP file (required)
- The cover and content images (optional)
- Audio, Video, Fonts and other media (optional)
- One or more CSS files (optional)
-
- Important Note: All of these files are case-sensitive, which may seem
unusual for Windows users. So, be careful when you are building your EPUB package.
-
- Please note that there is a very specific way that you have to compress
the files into the EPUB format. Unfortunately for Windows users, compressing
all your files using a GUI-based compression tool like 7-Zip will cause your
EPUB file to fail validation. Per the IDPF specification, it is necessary to
have the mimetype file added first to the zip file, and also to have it “stored”
(i.e. uncompressed). That is why you have to use the command line to build your
EPUB."
Why is converting from PDF so difficult?
"PDF is a page oriented format while EPUB is a a reflowable format."
"The main problem here is that PDF is a page oriented format (it describes
where to put glyphs on the page), while epub and mobi are both text-oriented
formats (they leave it to the device to do the layout). So basically, you need
to extract the text from the PDF, intelligently recognize the formatting, express
this formatting in HTML, and then convert it to epub/mobi. By definition, this
can't be "lightweight". And even "heavyweight" applications
might give you bad results, without manual correction. – dirkt Jan 4 '18"
"Quite simply because there IS no "textual information" in
a PDF document. A PDF document doesn't contain paragraphs, sentences, and words.
All that it contains is drawing instructions of the form "draw this shape
at these coordinates". A PDF document is essentially a series of instructions
for drawing a picture on a sheet of paper. It's not a book." (Source)
"A PDF document is a software program containing instructions written
in a restricted subset of the PostScript document description language, which
is a full blown stack-based programming language. Extracting text from a PDF
document is difficult because it is not stored in specific sections of the file,
but scattered in difficult to predict ways among the instructions that generate
the document layout." (Source)
"Trying to extract a properly formatted document from a PDF is akin
to hoping to recover a full-sized image by "enhancing" a small thumbnail."
(Source)
"there is no concept of text structure in a PDF file at all, no lines,
no paragraphs, sentences, nothing. All there is in a PDF file is 'this text'
and 'put it here on the page'.
The encoding used for the text may even be custom, and there may be no possible
method (other than OCR) for determining the actual text content (eg the Unicode
values).
Sentences don't even have to be contiguous." (Source: comp.lang.postscript)
You can learn by reading archives of the Calibre
> Conversion forum, searching for "EPUB PDF" in the titles.
Note that a PDF/EPUB can look different on the computer using eg. SumatraPDF
and on your e-reader.
PDF to EPUB
k2pdfopt
("Kindle 2 PDF Optimizer") is the first thing to try to make the PDF
as readable
as possible on a smaller e-reader. If NOK, try Calibre to convert PDF into EPUB.
Text PDF
If you have a recent version of MS Word, open the PDF and see if its PDF
Reflow feature does the job well enough, before convering its docx file into
an EPUB using Calibre, Writer2ePub,
Pandoc, etc.
Issues that must be manually fixed:
- Wrong linebreaks
- Hyphens (requires a dictionary for the source language to try and fix)
- Lost formatting (italics, bold, etc.)
- Must re-add footnotes
- Tables and graphics (e-readers screens differ in size)
Text in its own layer can be easily read and saved in a text file, but italics,
notes and other formatting will be lost. pdftotext
is one of the applications available.
Calibre/Poppler don't usually do a very good job turning PDF into text. LibreOffice
opens PDF in Draw, but is unable to export this to Writer. Softmaker TextMaker
can't open PDFs.
Alternatively, and even though the PDF already contains a text layer, is
open the PDF in Abbyy FineReader, and copy/paste to LibreOffice Writer or Sigil,
and fix the issues left. A CLI
might be available; If not an AutoIT script is handy to automate the process.
gImageReader says "PDFs with text. These PDF files already contain
text", and stops.
Try k2pdfopt
K2pdfopt
("Kindle 2 PDF Optimizer") is a cross-platform GUI/CLI application to
"optimize the format of PDF (or DJVU) files for viewing on small (e.g.
6-inch) mobile reader and smartphone screens such as the Kindle's. The output from k2pdfopt is a new
(optimized) PDF file." It relies on MuPDF to read PDFs, but can be configured
to use Ghostscript instead. Check PDF
Conversion Tips for e-readers . Unless the output looks weird, the only settings you need to set is the e-reader's
width + height + DPI, and possibly crop pages to reduce useless margins. Here is the
list of the commands it supports. If you don't like the odd-looking GUI, there's an alternative: k2pdfopt
GUI.
Note: By default, k2pdfopt converts pages into bitmaps, even when the source
file is native PDF (ie. text, not scanned text into bitmaps).
To convert a native (ie. not scanned) PDF: k2pdfopt -mode fw -ls- input.pdf
("fit width" removes the excess borders; -ls- prevents turning the
document on its side). Use the "-p" switch to only include a subset
of pages, eg. "-p 2,4-8,37"
If need be, you can increase the output margins with the -om command-line option, e.g. -om 0.2 will add 0.2 inches of padding around the output pages.
If a PDF look OK on the computer but doesn't on the e-reader, an alternative
is to convert the whole file into bitmaps using "k2pdfopt -mode fw -ls-
-n- input.pdf". Obviously, the file will be much bigger than the
original.
If a PDF contains scanned pages instead of text, here's how to run it through
k2pdfopt's embedded OCR program and include searchable text along with the bitmaps:
k2pdfopt -mode copy -odpi 200 -ocr t -ocrlang <set language, eg. -ocrlang
fra> -ocrd p input.pdf . In case of a multi-column layout, you can use
the usual CTRL+mouse to select part of the screen.
Notes taken from the web site:
- K2pdfopt converts each page of the input file to a bitmap, scans the
bitmap for viewable areas (rectangular regions), cuts + crops these regions
and assembles them into multiple smaller pages without excess margins so
that the viewing region is maximized. Making use of this method, k2pdfopt
can re-flow text lines, even on scanned documents.
-
- As of v1.50, k2pdfopt will also embed OCR text into the PDF so that
text can be searched and highlighted, and v1.60 can create output files
with the native PDF instructions from the source file (if the source file
is PDF).
-
- K2pdfopt has the advantage over other PDF converters in that it fully
preserves the rendered PDF fonts and graphics from the original file, unlike
programs that convert the PDF to an e-book format. Also, because k2pdfopt
is completely independent of language or fonts, it will work equally well
on documents in any language.
-
- MS Office offers PDF Reflow, where MS Word converts PDF files to Word
documents amazingly well. Once you have your PDF file in MS Word format,
you'll have a lot more capability to manipulate it into other formats and/or
form factors.
-
- With the default conversion, which allows text re-flow, every converted
page is a bitmap, so the file size of the converted file is often larger
than the original; however, many e-readers can process PDF files made up
of bitmaps faster and with less memory overhead than the original PDF file,
so you might still prefer this type of conversion. If you still want a smaller
output file size, see my help page on output file size for options that
reduce the output file size, mostly at the expense of the output quality.
If you don't need text re-flow, you might try using a mode which converts
using native PDF output.
-
- To remove the excess borders on my PDF file, use "-mode fw"
(fw = fit width). If you still want to rasterize the output, use -mode fw
-n-. If you don't want to turn the document on its side, use -mode fw -ls-.
-
- To crop region and put only that region in the output PDF à la Briss,
use the GUI: Make one of the "Crop Areas" active (check box);
type in the applicable page range for the crop box (e.g. 2-99), then click
the blue Select button and choose your crop region. For the conversion mode,
select Crop (command-line: -mode crop).
-
- The reason a native PDF output can cause the device to run out of memory,
be very slow, or even crash, is likely because of too many cropped-and-scaled
regions in the output file. Try using a specific conversion mode instead.
Modes are shorthand for setting a collection of options that are best suited
for s specific type of optimization.
-
- If there are more than one cropped/scaled regions on an output page,
most PDF reading applications will get confused and allow selection of "invisible"
text which is outside cropped regions and which overlaps with displayed
text.
- To see how k2 interpretes a PDF file, try using the -sm command-line
option ("sm" from the interactive menu), which will write out
a PDF file that shows the regions found by k2pdfopt.
-
- As of v1.35, k2pdfopt has a nice debugging option to clearly show you
how it is interpreting your PDF file by marking the regions on it in the
order it chooses to display them. The command-line option -sm (show markings)
does this, or you can select "sm" from the interactive menu. This
will generate a file name ending in "_marked.pdf".
-
- To use text re-flow, even with tables / equations / figures, try protecting
those regions by drawing boxes around them.
- To prevent images / figures from being split across pages, use -f2p
-1, or select "bp" from the interactive menu and enter -1 for
the "fit-to-page" value.
-
- To remove the document headers, footers, page numbers and/or other marks
near the edges of the source pages, tell k2pdfopt to ignore an arbitrarily
sized border around your document. See Ignoring Borders/Headers/Footers.
-
- k2 allows for searching / highlighting the text in the converted PDF
file because it has OCR capability, and as of v1.60, k2pdfopt has options
for native PDF output, much like Cut2Col, SoPDF, and the latest version
of PaperCrop.
-
- NATIVE PDF OUTPUT = zoomable, and searchable like the original with
no need for hidden OCR page.
-
- The defaults for the kindle are 560 x 735. Even though the kindle screen
is technically 600 x 800, the useable space for PDF files is 560 x 735.
The other factor that affects the size and quality of the text on the display
the DPI.
-
- While k2pdfopt is designed to give good results on a 6-inch reader by
default, you may want to fine tune the DPI settings depending on your reader
and your input file. The -idpi and -odpi settings, discussed above, control
the quality (-idpi) and magnification (-odpi) of the k2pdfopt output PDF
file.
-
- Landscape mode (use -ls from the command line or select option (l) from
the interactive settings menu) can be used to increase the text magnification
at the expense of having more pages.
-
- If you would like to reduce the output PDF file size, you can use the
-bpc option to reduce the number of bits per color plane. The default is
4 (for 16 graylevels--the same as the kindle can display), but using -bpc
2 will reduce to 4 graylevels and reduce the PDF file size to approximately
half.
-
- If you want a little extra space around the text on your reading device,
you can use the -om option to set the output margins (or select option (om)
from the interactive settings menu in v1.16+).
-
- Since v1.50, k2pdfopt can use one of two OCR engines to convert bitmapped
text to native ASCII characters so that the text in the output file can
be searched or copied and pasted into other applications. And in v1.63,
bitmapped text from any language that Tesseract supports (including, for
example, Chinese) is converted to Unicode-16 values and can be copied and
pasted into Unicode-aware applications (e.g. most web browsers and modern
word processing software). See the examples below.
-
- Make sure you really need to perform OCR first. With k2pdfopt v2.x,
if the source PDF document has searchable or highlightable text (e.g. if
it is computer-generated or scanned but has an OCR layer), then k2pdfopt
output of either type (native PDF or the default re-flowed text mode) should
also have searchable text without having to resort to time-consuming OCR.
OCR should only be necessary if the source document is scanned and does
not already have a text/OCR layer.
-
- the -m option (or select option (m) from the interactive settings menu
in v1.16+) to tell k2pdfopt to ignore a certain amount of margin in the
input file. For this particular example, 0.8 inches is a good value, so
-m 0.8 should be used:
-
- K2pdfopt has built-in PDF translation (via the MuPDF library) but will
try to use Ghostscript if Ghostscript is available and the internal (MuPDF)
translation fails. Since I fixed a couple bugs with MuPDF in v1.16, I have
found no instances where MuPDF fails to correctly translate a PDF file,
but you can force Ghostscript to be used with the -gs option.
-
- Forum: https://www.mobileread.com/forums/showthread.php?t=144711
-
- GETTING STARTED WITH THE WINDOWS GUI https://www.willus.com/k2pdfopt/help/overview.shtml
- INTERACTIVE TEXT MENU https://www.willus.com/k2pdfopt/help/textmenu.shtml
- LIST OF K2PDFOPT COMMAND-LINE OPTIONS https://www.willus.com/k2pdfopt/help/options.shtml
Tips
List of options.
"The modes are really just shortcuts that combine multiple individual
options that, together, are well suited for a particular type of conversion.
You can then tailor things further, if desired, by adding more options after
the -mode command.)"
"If the entire source page fits your device when you strip away the
margins, try -mode trim. This will trim away any margin areas around
the text and fit it to your device screen to maximize the size of the text."
"If the width of the source material fits either the width or the height
of your device and is comfortably readable, try -mode fitwidth or -mode fw.
If you don't want the output in landscape, add -ls- to force portrait output.
If
there is a common area on every page that you want to select which will then
comfortably fit your device screen, you can use -cbox to specify this
region, or use the MS Windows GUI to graphically select the region (see the
"Crop Areas" part of the GUI). If the entire selected area fits onto
your device with no trimming or text re-flow required to be readable, use -mode
crop."
To remove excess borders: -mode fw (fw = fit width)
To set the device model: -dev kbg (for Kobo Glo)
To set the page height and width: -w 758 -h 1024
If the text only has a single column: -col 1
To set the magnification: -dpi 213
-m* are used on the intput, while -om* are used on the output
To ignore headers/footers/borders: -m* or -cbox (Important: As usual, cropped
data is only hidden, not removed from the PDF). The -ml, -mr, -mb, and -mt options
can also be used to more specifically set the left, right, bottom, and top margin-ignoring
widths, respectively.
To add some extra space around the text, use the output margin option: -om
0.3 (or -oml, -omr, -omb, and -omt) https://www.willus.com/k2pdfopt/help/margins.shtml
Using an OCR
OCRing + EPUBing my first book: Tips?
https://www.mobileread.com/forums/showthread.php?t=331376
OCR: gImageReader (GUI to Tesseract), Abbyy FineReader
EPUB editor: LibreOffice Writer, Silig (last Win32 release: 0.9.14; How
to compile)
Help
Q&A
"native PDF output"?
"rather than rendering the output file as a sequence of bitmaps, each
output page is rendered directly using the source PDF file instructions, but
with translation, scaling, and cropping directives to place the source regions
at the appropriate places on the output pages"
In the GUI, how does "native PDF output" differ from "Re-flow
text"?
Native = -n -wrap-, Re-flow text = -wrap+
"can re-flow text even on scanned PDF files"
How does it move text in a scanned page?
My e-reader isn't listed
Use the -w (width) and -h (height) command-line options.
"text re-flow"?
-wrap vs. -wrap+?
"rasterize"?
Turn native text into a bitmap
Why is it hard converting PDF to text (eg. EPUB)?
How does the device setting ("Kobo Glo, Kindle 1-5, etc.") change?
What toolkit was used to write the GUI?
"native/bitmapped PDF"?
How to remove page numbers displayed in the middle of a page?
Per this
tip, in the GUI, try adding the following to the "Addition options"
section, and run a test on just one page where the problem occurs: -m 0.25,0.25,0.25,0.7
Using Calibre
- Open the PDF in a viewer (on Windows, SumatraPDF can read PDF and EPUB),
and make a list of the pages that include anything more sophisticated than
plain text:
- Text that is displayed in multicolumns must be turned into one column
- Insets must be removed, and turned into regular text
- Tables: Rather than trying to rewrite it as HTML, it's easier to
just take a screenshot and save it as JPG/PNG to be inserted in the
EPUB later; Make sure the picture is no bigger than the width+height
of your e-reader, and that the picture is located in the HTML file at
the top+left so that it's correctly displayed
- Use Calibre to generate the EPUB; If need be, play with its settings
in the Conversion dialog, including the Page setup where you can
tell Calibre which e-reader you have
- Open the output in its editor (right click > Edit book, or T), and edit the pages that need it;
Pages can be removed through the Delete key, and new ones added with File > Insert;
To insert an image, use the familiar <img src=""> sequence
- Copy EPUB to e-reader.
An easy way to fix issues with the EPUB created by Calibre is to edit the
file in Sigil.
Build an hybrid PDF
As an alternative to turning a PDF into EPUB with Poppler and all its issues,
there's the option of simply converting the few problematic pages
(tables, etc.) into pictures, replacing+merging them back into the main PDF,
and reading the PDF on my e-reader that has no problem handling basic text.
Obviously, while flipping through that kind of mixed PDF, the user can tell
the difference, but IMHO it's a much better solution than the HTML output from
Poppler.
k2 is unable to run once and handle pages differently, turning some pages
into bitmaps (rasterize) while leaving the others as text ("native PDF"):
You'd have to write a loop, and merge those two sets back into a PDF. Likewise,
I haven't found how to use cpdf to crop, maximize, and rasterize pages.
Things that could be improved:
- Maybe Poppler or ImageMagick are better tools than MuPDF for this task?
- Crop PDF pages before/after turning them into PNG to maximize screen
size
- Convert relevant pages directly in PDF without having to extract to
PNG, convert to PDF, and merge
- Rewrite in Ruby: DOS cmd is hell (input params, arrays, etc.) ; Single
EXE or portable Ruby?
To crop:
Here's the Windows batch script:
- @ECHO OFF
REM myscript.bat output.pdf input.pdf "1-5,8,25"
-
- REM Note: ~ removes quotes
- if "%~1"=="" GOTO PARAM
- if "%~2"=="" GOTO PARAM
- if "%~3"=="" GOTO PARAM
-
- REM Change those to match your e-reader
- SET DPI=213
- SET WIDTH=758
- SET HEIGHT=1024
-
- IF NOT EXIST mutool.exe (ECHO mutool missing & GOTO
END)
- SET APP=..\mutool.exe
- SET OUTPUT=%1
- SET INPUT=..\%2
- SET LIST=%~3
- SET TMPDIR=TEMP%random%%random%%random%%random%%random%%random%TEMP
-
- REM Create temp dir
- IF NOT EXIST %TMPDIR% MD %TMPDIR%
- CD %TMPDIR%
-
- REM Convert input PDF into individual PDFs
- FOR /F "tokens=* delims=" %%# IN ('%APP% show
%INPUT% Root.Pages.Count') DO SET "COUNT=%%#"
- ECHO Found %COUNT% pages
- FOR /L %%i IN (1,1,%COUNT%) DO (ECHO Handling %%i &
%APP% clean -g %INPUT% %%i.pdf %%i)
-
- REM Convert required pages into PNG, and remove matching
PDF
- %APP% draw -r %DPI% -w %WIDTH% -h %HEIGHT% -o %%d.png
%INPUT% %LIST%
- REM Delete matching PDFs
- FOR %%A in (*.png) DO (ECHO Deleting %%~nA.pdf &
DEL %%~nA.pdf)
-
- REM Convert PNG files into PDFs, and remove PNG
- FOR %%A IN (*.PNG) DO (%APP% convert -O compress -F pdf
-o %%~nA.pdf %%A & ECHO Deleting %%A & DEL %%A)
-
- REM Merge individual PDFs into single PDF
- REM Build list
- SETLOCAL EnableDelayedExpansion
- SET _filelist=
- FOR /F "delims=|" %%f in ('dir /b *.pdf') DO
(
- SET "_filelist=!_filelist!%%f "
- )
- SET LIST=%_filelist:,,=%
- REM ECHO LIST=%LIST%
- ECHO Merging
- %APP% merge -o %OUTPUT% -O compress %LIST%
-
- REM Cleaning up
- MV %OUTPUT% ..
- CD ..
- RMDIR /S /Q %TMPDIR%
- GOTO END
-
- :PARAM
- ECHO Usage : %0 output.pdf input.pdf "pages"
(Use quotes if pattern includes commas, eg. "2,3")
- GOTO END
-
- :END
REWRITE AS POWERSHELL OR RUBY
https://en.wikipedia.org/wiki/PowerShell
cmd.exe > powershell
$PSVersionTable
Editing the PDF
An alternative is to use LibreOffice Draw to modify the PDF, and read it
in your e-reader without bothering with EPUB:
- If the PDF file is big and would make LibreOffice sluggish, use
qpdf to export each page as an individual PDF file:
qpdf --progress
--split-pages infile.pdf %d.pdf
- In Draw, open and edit each problematic page to replace all nasty parts
(remove/rewrite insets and multi-column text, replace tables with screenshots)
- Use qpdf to merge all the pages back into a single PDF:
qpdf
--empty --pages *.pdf -- out.pdf
A faster way is to simply turn each "problematic" page into pictures:
- Open PDF on computer, and make a list of the pages that contain anything
more than basic, one-column text (eg. multi-columns, tables, insets, etc.)
- Split all the pages of the PDF into individual files
pdfseparate.exe"
input.pdf %d.pdf
- Convert each PDF with difficult layout into pictures, matching the e-reader's
width+height
- Merge all the PDFs back into a single PDF
- Send to e-reader, and test.
Q&A
Nolim: Fichiers supportés pour les livres : epub, fb2, html, txt, pdf et
drm Adobe
Can qpdf convert PDF into pictures?
No.
Can MuPDF convert PDF into pictures, and merge the files back?
for N in $(seq $(mutool show input.pdf Root.Pages.Count)); do mutool clean
-g input.pdf page$N.pdf $N; done
convert relevant pages into pictures
mutool merge
Investigate ImageMagick's convert
convert in.pdf -crop 50%x0 +repage out.pdf
Try TIFF or PSD vs. JPG/PNG
How to get rid of "side circles" (typographice signs)?
d:\Temp\PDF.to.EPUB\test.PDF.edit\10.jpg-1.jpg
pdfcairo vs. pdfppm?
How to crop?
pdfseparate.exe: progress bar?
How to compile Poppler for Windows32?
https://sourceforge.net/projects/poppler-win32/
https://www.anaconda.org/conda-forge/poppler/files
https://blog.alivate.com.au/tag/pdftohtml/
https://towardsdatascience.com/poppler-on-windows-179af0e50150
Converting HTML pages into an EPUB
Multiple HTML files can be cleaned up and converted into a single EPUB file.
- With Calibre, you first need to create
a ToC
- And then, call the command: "c:\Program Files\Calibre2\ebook-convert.exe"
ToC.html full.epub
Open-source applications
Calibre
Calibre is
a well-known cross-platform, open-source, GUI application to remove DRMs written
by Kovid Goyal.. It
can also convert a PDF into EPUB. It does a reasonable job at the latter, but
just like other tools, it has a difficult time with more sophisticated layout,
tables, and headers/footnotes.
The UI can be changed through the Preferences > Change Calibre behavior
(CTRL+P) > Interface, and the Layout button in the bottom right corner:
It's also available as a CLI:
- ebook-convert.exe input_file output_file [options]
Once Calibre is installed, you can run the PDF-to-EPUB converter through
the command line: ebook-convert.exe input_file output_file [options].
To convert a PDF into HTML, Calibre actually relies on poppler
(version).
What Calibre does, is run poppler's pdftohtml to convert each page of the PDF
into HTML, and then work from there and build an EPUB. The settings in Calibre's
Convert dialog lets you changes the settings that it will use for this operation,
but Calibre can only do so much using the input from poppler.
"Calibre is awesome at many things, but PDF conversion isn't one of
its strong points. What I find most annoying is the text unwrapping, and that
certainly is Calibre's fault. The algorithm it uses is quite simplistic, if
a line is less than xx% of the page width, it's considered a paragraph break,
if it's longer, it's not. So in a typical book, you end up with hundreds of
incorrect paragraph breaks - spurious breaks that shouldn't be there, and paragraphs
stuck together that shouldn't be." (Source)
How to work with the Convert dialog
The Debug option lets you see the files at the four steps: input, parsed,
structure, and processed.
Slight editing can be done in the \input directory, which contains the HTML
files generated by poppler. When you're done, zip the files up, add it through
Edit meta information dialog, and proceed with the conversions.
How to make the most of the Convert dialog
Start by converting the PDF to EPUB using Calibre's default settings, and
see what the issues are. When clicking on the Wizard button in the Search &
replace section, if an EPUB isn't available, Calibre will first convert the
PDF into HTML, which explains the pause.
https://manual.calibre-ebook.com/conversion.html
Heuristic Processing
Line numbers
https://dearauthor.com/ebooks/calibre-pdfs-epub-conversion-tips/
Line Un-Wrapping Factor
Used to unwrap paragraphs. This is a scale used to determine the length at
which a line should be unwrapped. Valid values are a decimal between 0 and 1.
The default is 0.45, just under the median line length. Lower this value to
include more text in the unwrapping. Increase to include less. You can adjust
this value in the conversion settings under PDF Input.
The default setting for this is 0.45, you can set this lower to make line
unwrapping more 'aggressive', but be aware that doing this may unwrap lines
which shouldn't be unwrapped.
"the unwrap function looks at the median (or average, can't remember)
line length, and only unwraps lines that exceed that length. That works well
for a book with consistent breaks in roughly the same location for every line
(OCR, pdf, many well formatted text files), but it will fail where the hard
breaks are inconsistent/infrequent. Reducing the unwrap factor basically tells
Calibre to look for shorter lines than the median. The fewer or more erratic
the breaks the lower you need to go, sometimes all the way down to 0.05".
(Source)
Page setup
Page setup: Choose a device that matches the screen size of your device
Structure detection
Table of contents
Search & replace
In the Search and replace section, use the wizard to test regexes
Regexs are applied to the HTML as produced by poppler. If an EPUB has already
be generated, Calibre prompts you whether to use its HTML or to start again
from the PDF.
Headers and footers must be searched
and removed because they are often part of the document and they can
throw off the paragraph unwrapping.
Use the Wizard in the "Search & replace" section to try regexes.
"If you are intimidated by regular expressions, many Windows users have
reported that [deadware] Mobipocket Creator is a good alternative to use to
do the initial pdf conversion. Use Mobipocket
Creator to convert the pdf to the .mobi format, and then use Calibre to
convert from mobi to your final desired format."
Q&A
Any way to tell Calibre to ignore some pages (ToC, tables, etc.)?
? PDF Input > Line un-wrapping factor = 0.45 VS. Heuristic processing
> Line un-wrap factor =0.40
PDF input
EPUB output
Debug
Post-EPUB editing
If the EPUB output needs some work, you can use either Calibre's internal
editor (select the book in the list > Edit
Book) or Sigil (80MB; Sigil is just
the editor in Calibre without all the fuss; Must
restart app when changing language for UI.)
Infos
Q&A
How to hide the left-side tree list Authors, Languages, etc.?
MuPDF (mutool)
Even worse than poppler to convert PDF to HTML (one line = one <p></p>)
- https://www.mupdf.com
- Artiflex also handles Ghostscript
- To ask questions, log on to irc.freenode.net,
channel #mupdf
- Open-source, cross-platform, CLI
- mupdf.exe and mupdf-gl.exe are GUI readers: "For Linux and Windows
there are two viewers. One is a very basic viewer using x11 and win32, respectively.
It has been supplanted by a newer viewer using OpenGL for rendering, which
has more features such as table of contents, unicode search, etc. We keep
the old viewers around for older systems where OpenGL is not available."
- mutool.exe: The command line tools are all gathered into one umbrella
command: mutool
- mutool draw: This is the more customizable tool, but also has a more
difficult set of command line options. It is primarily used for rendering
a document to image files.
- mutool convert: This tool is used for converting documents into other
formats, and is easier to use.
The HTML displays fine in a browser, but is useless to create an EPUB because
it has no notions of lines and paragraphs: Each line of text is just displayed
at coordinates x,y, with no indication that it belongs to a paragraph.
Note: According to Calibre author, mutool outputs "non-reflowable HTML,
it is just as useless as the original PDF file."
https://www.mupdf.com/docs/manual-mutool-draw.html
mutool draw -F html -o out.%d.html in.pdf
one page = one HTML (pics embedded
as base64)
mutool draw -F html -o out.html in.pdf
single HTML file, with pictures
embedded as "data:image/png;base64"
Remove footer? ffirs_simmons.qxd 5/16/05 4:13 PM Page iii
-> Copied HTML into e-reader: Took ~ one minute to open, and… unreadable
(format foobared).
https://artifex.com/support/open-source/
Note: In "mutool convert", N can be used to stand for the last
page
clean vs. draw vs. convert:
- mutool clean reads and writes PDF files
- mutool draw reads any file we can read, and writes out in most formats, but it does it by interpreting the graphics and creating the output file from scratch.
any extra PDF information like bookmarks and links are lost in the process
- mutool convert has a simpler interface than "mutool draw", and also provides more detailed options to many of the output formats. mutool draw is primarily intended for bitmap output.
so both mutool draw and mutool convert do the same thing, but the interfaces are different
Functions offered by mutool:
- draw -- the most commonly used tool, capable of converting/rendering
documents to a range of bitmap and vector formats. It performs a similar
task to the convert utility, using a different set of internal mechanisms;
output format: png, tga, pnm, pam, pbm, pkm, pwg, pcl, ps, svg, pdf, trace,
txt, html, stext; modifies page width/height, rotate, colorspace, etc.
- convert -- performs a similar task to the draw utility, using a different
internal mechanism (the document writer interface); output format (default
inferred from output file name): png, pnm, pgm, ppm, pam, tga, pbm, pkm,
pdf, svg, cbz; modifies page width/height, rotate, colorspace, etc.
- clean -- rewrite pdf file
- create -- create pdf document
- extract -- extract font and image resources
- merge -- merge pages from multiple pdf sources into a new pdf
- portfolio -- manipulate PDF portfolios
- poster -- split large page into many tiles
- run -- run javascript
- info -- show information about pdf resources
- show -- show internal pdf objects
- pages -- show information about pdf pages
Some examples:
mutool pages input.pdf 20
#trim introduced in 1.22
mutool trim -b mediabox -o cropped.pdf in.pdf
cpdf
https://github.com/coherentgraphics/cpdf-binaries
https://github.com/coherentgraphics/cpdfsqueeze-binaries
Written by Coherent Graphics Ltd's John Whitington, author of O'Reilly's
"PDF
Explained". Based on an open source library written in Caml.
Notes from cpdfmanual.pdf:
- The cpdf
tool has been available commercially since 2007, and is widely used in industry
and government. Now we're releasing two tools for free, the main program
under a special not-for-commercial-use license, and a lossless PDF squeezer
under the LGPL.
When
measurements are given to cpdf , they are in points (1 point = 1/72 inch).
They may optionally be followed by some letters to change the measurement.
The following are supported: pt Points (72 points per inch). The default.
cm Centimeters, mm Millimeters, in Inches.
-
- Linearized PDF is a version of the PDF format in which the data is held
in a special manner to allow content to be fetched only when needed. This
means viewing a multipage PDF over a slow connection is more responsive.
This requires the existence of the external program cpdflin which is provided
with commercial versions of cpdf.
Functions offered by cpdf:
- merge/-split
- scaling, rotating, etc.
- showing infos (show-boxes, list-fonts, etc.)
- encrypting/decrypting
- compressing/decompressing
- handing bookmarks
- turning a PDC into a PowerPoint-like presentation
- watermark and stamps
- multipage
- annotations
- metadata
- file attachments
- images, fonts
A couple of examples:
cpdf -page-info input.pdf 25
cpdf -mediabox "0 0 424pt 600pt" input.pdf 1,25-50 -o output.pdf
pdfCropMargins
"The pdfCropMargins
program is a command-line application to automatically crop the margins of PDF
files."
pdfcpu
"pdfcpu is a PDF processing
library written in Go supporting encryption. It provides both an API and a CLI."
Examples:
pdfcpu box add -- "media:[0 0 200 600]" input.pdf output.pdf
pdfcpu boxes list -p 20 output.pdf
qpdf
- Written by Jay Berkenbilt; "QPDF was originally created in 2001
and modified periodically between 2001 and 2005 during my employment at
Apex CoVantage. Upon my departure
from Apex, the company graciously allowed me to take ownership of the software
and continue maintaining as an open source project, a decision for which
I am very grateful."
- http://qpdf.sourceforge.net
- https://github.com/qpdf/qpdf
Notes from qpdf-manual.pdf
- qpdf does structural, content-preserving transformations on PDF files.
-
- In QDF mode, qpdf creates PDF files in what we call QDF form. The purpose
of QDF form is to make it possible to edit PDF files, with some restrictions,
in an ordinary text editor.
-
- A Python module called pikepdf [https://pypi.org/project/pikepdf/] provides
a clean and highly functional set of Python bindings to the qpdf library.
Using pikepdf, you can work with PDF files in a natural way and combine
qpdf's capabilities with other functionality provided by Python's rich standard
library and available modules.
-
- the qpdf command-line program can produce a JSON representation of the
non-content data in a PDF file. It includes a dump in JSON format of all
objects in the PDF file excluding the content of streams. This JSON representation
makes it very easy to look in detail at the structure of a given PDF file.
Functions offered by qpdf:
- encryption
- linearization
- rotating
- collating/concatenating
- splitting
- over/underlaying
poppler
Like pdftohtml, poppler is also based on xpdf. Confusingly,
poppler kept the names for the applications such as "pdftohtml", so it's hard to know
it's not the original whose development was abandonned in 2006.
As of April 2020, the latest stable release is poppler-0.87.0.tar.xz, released
on March 28, 2020. Note that packages for Ubuntu et al. might be out of date.
Poppler includes multiples applications:
- pdfseparate – extract single pages from a PDF
- pdftocairo – convert single pages from a PDF to vector or bitmap formats
using cairo
- pdftoppm – convert a PDF page to a bitmap
- pdfunite – merges several PDF
- pdftohtml – convert PDF to HTML format retaining formatting
- pdftotext – extract all text from PDF
- pdfdetach – extract embedded documents from a PDF
- pdfimages – extract all embedded images at native resolution from a
PDF
- pdftops – convert PDF to printable PS format
- pdffonts – lists the fonts used in a PDF
- pdfinfo – list all information of a PDF
https://blog.alivate.com.au/tag/pdftohtml/
https://towardsdatascience.com/poppler-on-windows-179af0e50150
apt-get install poppler-utils
pdftohtml
"-c : This will output in complex mode. You can't use -noframes with
the complex flag."
"-noframes generate no frames. Not supported in complex
output mode."
"complex mode": One page = one HTML file + one PNG that only includes
some typographical feature to center the output.
Here's how to convert a PDF that was encoded in Latin1: pdftohtml -c -s -enc
Latin1 test.pdf test.html
Windows release
https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows
http://blog.alivate.com.au/poppler-windows/
(Last
Windows release is 0.68 while the current release is 0.87 released on March
28, 2020)
https://anaconda.org/conda-forge/poppler/files(Up
to date but only available for Win64)
d:\Temp\temp.Archie.SVG\inkscape\libpoppler-73.dll
d:\Temp\temp.Archie.SVG\inkscape\libpoppler-glib-8.dll
! Source Win32 https://github.com/zotero/cross-poppler
"cross-poppler compiles Poppler PDF tools for macOS (x64), Windows (x86,
x64), Linux (x86, x64). This is only intended to be used for pdfinfo and pdftotext."
poppler-0.39.0-win32.zip 2016-01-07 7.3
MB https://sourceforge.net/projects/poppler-win32/
pdfium
pdfium
podofo
PDFMasher
"PDFMasher, now long abandoned and unmaintained."
pdftk
Written by Sid Steward, author of O'Reilly's "PDF Hacks". For some
reason, the CLI binary is called "PDFtk Server".
To uncompress a PDF: pdftk input.pdf output output.pdf uncompress
apt-get install pdftk ghostscript
https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/https://en.wikipedia.org/wiki/PDFtk
pdfminer.six
PDFMiner
pdftohtml
- Deadware (pdftohtml-0.39, 2006-08-03); Forked with poppler
- Open-source, Linux
- http://pdftohtml.sourceforge.net
- apt-get install poppler-utils
- pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html
- output much worse than mutool
- -c : 'gswin32c' is not recognized as an internal or external command,
operable program or batch file. Error: Failed to launch Ghostscript!
- C:\Program Files\gs\gs9.52\bin\gswin32c.exe
pdf2htmlEX
- Deadware: "Sunday, December 11, 2016 Looking for new maintainer"
- https://github.com/coolwanglu/pdf2htmlEX
- "As of 2020, PDFMiner is not actively maintained. The code still
works, but this project is largely dormant. For the active project, check
out its fork pdfminer.six."
Pandoc
pandoc cannot convert PDF to HTML, but can turn HTML into EPUB.
Written in Haskell, it's rather slow and resource-hungry so isn't great on big files.
Incidently, here's a command you can use to download a single web page and
its dependencies, and turn into an EPUB:
- wget -E -H -k -K -p http://www.acme.com/mypage.html
- pandoc -f html -t epub -o output.epub mypage.html
pandoc can also fetch web pages directly: pandoc -f html -t epub -o output.epub
https://www.fsf.org
To install on Linux: apt-get install pandoc
Notes:
- Packages on Linux can be very old. Check you have the
latest.
- pandoc also supports creating an EPUB3 file… which your e-reader may
or may not support: "Please note that the EPUB 3.0 specification has been
released as of late 2011. Unfortunately, the eBook stores have been very slow
to adopt this standard. The EPUB3 specification will allow for more complex
eBook designs that include audio/video embedding, footnote support, and even
JavaScript." (Source).
- Windows doesn't support shell expansion. You need to create a batch
file or run in PowerShell. More info here.
- It's very slow on anything but small files (eg. a 5MB file is a no go)
You can add metadata
for epub:
pandoc -f html -t epub3 --epub-metadata=metadata.xml -o output.epub input.html
"Alternatively, you could use pdftotext, save it to text, edit it into
shape as well formatted markdown, and then use pandoc to convert it to epub.....i've
done that several times - after a lot of practice (and some handy vim key mappings),
it takes me about a day or so to convert a book with a few hundred pages."
Note: "pandoc-citeproc originated as a fork of Andrea Rossato's citeproc-hs.
The pandoc-citeproc executable can be used as a filter with pandoc to resolve
and format citations using a bibliography file and a CSL stylesheet."
pandoc -o output.epub *.html
pandoc: *.html: openBinaryFile: invalid
argument (Invalid argument)
It's due to how
Windows handles input. Name this batch file pandoc.cmd:
- @echo off
- :: Pandoc wrapper for calling it with wildcard file parameters.
- :: Expands any arguments containing wildcards according to standard
- :: Windows CMD.exe conventions.
- setlocal EnableDelayedExpansion
- set pandoc_cmd=pandoc
- for %%I in (%*) do set pandoc_cmd=!pandoc_cmd! "%%~I"
- !pandoc_cmd!
- endlocal
Call it thusly: echo output.epub | pandoc.cmd *.html -
If it still fails, use PowerShell instead of cmd.exe, or write a script in
richer language like Python, etc.
NO! copy /b *.html full.html
pandoc -o full.epub full.html
It's a better idea to keep individual HTML files, and merge them into a single
EPUB.
EPUBCheck
https://github.com/w3c/epubcheck
"The EpubCheck tool is an open-source program written in Java that checks
your EPUB file for errors. Most eBook stores that utilize the EPUB format will
utilize this exact same program to see if the eBook you upload for sale is valid."
(Source)
Closed-source solutions
Multidoc Converter
http://multidoc-converter.com
Tried 1.6.0.0 with a 10MB single HTML with all pics embedded as base64: As
displayed in SumatraPDF, as crappy as Caliber.
PDFMate (free/pro)
- PDFMate
- Turns a 10MB PDF into 70MB EPUB. And pages in the output are images,
not text.
PDFelement (Pro)
PDFelement (Pro)
PDFelement Standard perpetual license $79
iSkysoft PDF Editor is PDFelement under a different name.
Xilisoft PDF to EPUB Converter
Xilisoft PDF
to EPUB Converter $20
Abby FineReader
FineReader ; Convert
PDFs to e-book formats EPUB, FB2 (Standard, Corporate). 199€; Release 14 ~500-850MB
Solid Converter
Solid
Converter: PDF to EPUB? $100
A-PDF
A-PDF no converter?
MobiPocket Creator
Deadware from Mobipocket;
Final release 4.2 can be found here
in the "Tutorial
- How to Create a MobiPocket eBook" thread.
"Home Edition has a simple to use interface and is designed to produce
content for private use. When creating new files from scratch you can use predefined
templates to aid in the creation effort. A user can also use the windows version
of MobiPocket Reader to convert files.
Installing [the Publisher Edition] provides the most power to customize
the output of the file and is required to submit eBooks commercially. if you
are a publisher and intend to sell eBooks through eBookbase, this is the version
of the Mobipocket Creator that you should use. Additional features essential
for publishers include: the encryption level required by eBookbase; an integrated
"deploy" feature to automatically upload or update your books in eBookbase;
the metadata editor to set the price, ISBN, cover image... of your books; PDF
import".
https://www.mobileread.com/forums/showthread.php?t=17914
https://wiki.mobileread.com/wiki/MobiPocket_Creator
Installed Publisher. After it reads a PDF, it generates an HTML file and
pictures. "Build" creates a .PRC file, which you don't need. Once
you have the HTML file, doctor it in the Caliber editor or Sigil, before turning
it into EPUB.
From a PDF, creates a single HTML + multiple PNGs.
Tools to crop PDF
- mutool trim (use cpdf to first change the PDF's mediabox)
- Briss: Stuck at "Loading
new file - Creating merged previews"
- pdfarranger: "Access denied!"
- PoDoFo requires
ghostscript
- krop: GUI, Python, PyQT, python-poppler-qt5, PyPDF2)
- pdfjam: Unix only
- PDTtk: Gui, commercial; PDFtk Server = CLI
Q&A
Mediabox, cropbox, bleedbox, trimbox, artbox?
https://wiki.scribus.net/canvas/PDF_Boxes_:_mediabox,_cropbox,_bleedbox,_trimbox,_artbox
https://pdfcpu.io/getting_started/box
- media box: boundaries of the physical medium on which the page
is to be printed.
- crop box: region to which the contents of the page shall be clipped
(cropped) when displayed or printed.
- bleed box: region to which the contents of the page shall be clipped
when output in a production environment.
- trim box: intended dimensions of the finished page after trimming.
- art box: extent of the page’s meaningful content as intended by
the page’s creator.
- The media box is mandatory and serves as default for the crop box and
is its parent box. The crop box serves as default for art box, bleed box
and trim box and is their parent box.
mutool vs. muPDF?
https://www.mankier.com/1/mutool
muPDF is just a PDF viewer, while mutool is a command-line application to
work with PDFs.
Why are some PDFs non-mouse-selectable? Why are some PDFs selectable but no copyable
to the clipboard?
Either the pages are juste pictures and not text, or the PDF could be configured
to forbid copying: "Denied Permissions: copying text".
qpdf --decrypt input.pdf output.pdf
If they're pictures, run the PDF through an OCR to add a text layout.
How to edit meta-data (title, etc.)?
Surprisingly, exiftool can also edit
PDF meta-data:
exiftool.exe -o output.pdf -Title="My title" -Author="Some
author" -Subject="My subject" input.pdf
Resources
Books
Sites