|
1 year ago | |
---|---|---|
IO_wrapper | 1 year ago | |
classification | 1 year ago | |
data_acquisition | 1 year ago | |
datastructure | 1 year ago | |
schema | 1 year ago | |
text_extraction | 1 year ago | |
utils | 1 year ago | |
.gitignore | 1 year ago | |
IO_handler.py | 1 year ago | |
LICENSE | 1 year ago | |
PDFminerLayoutExtractor.py | 1 year ago | |
PDFminerLineStreamer.py | 1 year ago | |
README.md | 1 year ago | |
SegmentedPDF.py | 1 year ago | |
coordinates_calculator.py | 1 year ago | |
ignore_coordinates.py | 1 year ago | |
miner.py | 1 year ago | |
requirements.txt | 1 year ago | |
segment.py | 1 year ago | |
text_analyzer.py | 1 year ago |
This project is part of the Knox multiproject and is located in layer one. The goal of this module is to extract and segment information from PDF-documents provided by Grundfos.
The module is able to extract text, titles, images and tables from PDF-files and produce a folder containing the extracted information. The module contains three components in addition to a few utilities:
git clone https://git.its.aau.dk/Knox/grundfos-preprocessing.git
conda create -n grundfos-preprocessing python=3.8 pip
conda activate grundfos-preprocessing
cd grundfos-preprocessing
pip install -r requirements.txt
https://drive.google.com/file/d/1Jx2m_2I1d9PYzFRQ4gl82xQa-G7Vsnsl/view?usp=sharing
and place it in the classification folder.To segment a document run the segment.py file in the root folder of the repository using the following command:
python segment.py [FLAGS] INPUT_FOLDER OUTPUT_FOLDER
The INPUT_FOLDER must contain all of the PDF files to include in the segmention. Subfolders are omitted. The OUTPUT_FOLDER must exist as it is not created by the program. Optional arguments are written before specifying the INPUT_FOLDER and *OUTPUT_FOLDER, indicated by [FLAGS].
The flags available for segment.py are:
-h, --help
Provides an overview of the flags and arguments available.-a A, --accuraccy A
Minimum threshold for the prediction accuracy used by machine intelligence module. Value between 0 and 1. Defult is 0.7. -m, --machine
Enable the machine intelligence module when running the program.-t, -temporay
Keep the temporary files created while running the program.-c, --clean
Clear the output folder before running the program.-s, SCHEMA, --schema SCHEMA
Path to the JSON schema. Default is schema/manuals_v1.2.schema.json.-d, --download
Download the Grundfos data set before running the program.The wrapper module uses another module created by the Nordjyske preprocessing group. The module and its documentation can be found at https://git.its.aau.dk/Knox/source-data-io.