PdfExtractor

PdfExtractor is a library to obtain all the resources of a pdf.

Features

Extract your text and images in a PDF, thanks to PDFBox technology.
Extract your tables in a PDF, thanks to Tabula technology.
Choose the parts to extract from your PDF, by bookmarks, by pages or all PDF.

Download

Download a version of the PdfExtractor's jar from our releases page.

Usage

PdfExtractor provides a command line application:

$java -jar PdfExtractor.jar --help 
usage: PDFExtractor [-b <NUMBERS>] [-f] [-h] [-i <input PDF or FOLDER>]
       [-o <output FOLDER>] [-p <NUMBERS>] [-r]
Mised argument
 -b,--bookmark <NUMBERS>            [OPTIONAL] ¡NOT AT SAME THAN -p! By
                                    default, the extractor extract all of
                                    them.
                                    If the PDF has BOOKMARKS, we extract
                                    all content from selected. Using comma
                                    separated or list of ranges to
                                    listExamples: --bookmark 1-3,5-7,
                                    --bookmark 3.
 -f,--fix                           [EXPERIMENTAL] Force PDF to be
                                    extracted adjunting words, deleting
                                    files, deleting footers, .. By
                                    default, disabled
 -h,--help                          Indicate how yo use the program.
 -i,--input <input PDF or FOLDER>   [REQUIRED] Absolute Pdf or folder with
                                    PDF location path. Ex:
                                    /Users/thoqbk/table.pdf
 -o,--output <output FOLDER>        Absolute output file. By default the
                                    folder on i or the parent. Ex:
                                    /Users/thoqbk/results
 -p,--pages <NUMBERS>               [OPTIONAL] ¡NOT AT SAME THAN -p! By
                                    default, the extractor extract all of
                                    them.
                                    Using comma separated or list of
                                    ranges to list to select
                                    pagesExamples: --pages 1-3,5-7,
                                    --pages 3.
 -r,--resources                     Try to extract all resources from PDF
                                    (text, image and tables). By default,
                                    disabled

The option --fix try to join parts of separate words together, remove footers and headers, and remove tables from the final text

Building from Source

Clone this repo and run:

mvn clean compile assembly:single

Then, get your own version of the jar in the project's target folder.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.settings		.settings
bin		bin
src		src
.classpath		.classpath
.project		.project
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PdfExtractor

Features

Download

Usage

Building from Source

About

Releases 1

Packages

Languages

License

CrawlyOEG/PDFExtractor

Folders and files

Latest commit

History

Repository files navigation

PdfExtractor

Features

Download

Usage

Building from Source

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages