|
2 years ago | |
---|---|---|
IO_wrapper | 2 years ago | |
classification | 2 years ago | |
data_acquisition | 2 years ago | |
datastructure | 2 years ago | |
schema | 2 years ago | |
text_extraction | 2 years ago | |
utils | 2 years ago | |
.gitignore | 2 years ago | |
IO_handler.py | 2 years ago | |
LICENSE | 2 years ago | |
PDFminerLayoutExtractor.py | 2 years ago | |
PDFminerLineStreamer.py | 2 years ago | |
README.md | 2 years ago | |
SegmentedPDF.py | 2 years ago | |
coordinates_calculator.py | 2 years ago | |
ignore_coordinates.py | 2 years ago | |
miner.py | 2 years ago | |
requirements.txt | 2 years ago | |
segment.py | 2 years ago | |
text_analyzer.py | 2 years ago |
README.md
Grundfos preprocessing
This project is part of the Knox multiproject and is located in layer one. The goal of this module is to extract and segment information from PDF-documents provided by Grundfos.
The module is able to extract text, titles, images and tables from PDF-files and produce a folder containing the extracted information. The module contains three components in addition to a few utilities:
- The Text segmenter component recursively scans the document to find and segment text into their correct sections or subsections.
- The Miner component analyzes the lines in the document to find tables and figures.
- The Inference component uses computer vision to find tables and images (as well as text, lists and titles if desired).
Installation
- Clone the repository using
git clone https://git.its.aau.dk/Knox/grundfos-preprocessing.git
- Create a new virtual environment using your preferred tool for example Conda. For Conda:
conda create -n grundfos-preprocessing python=3.8 pip
- Activate the virtual environment. For Conda:
conda activate grundfos-preprocessing
- Locate the root folder of the repository
cd grundfos-preprocessing
- Install required dependencies with pip:
pip install -r requirements.txt
- To use the machine-intelligence component download the model from
https://drive.google.com/file/d/1Jx2m_2I1d9PYzFRQ4gl82xQa-G7Vsnsl/view?usp=sharing
and place it in the classification folder.
Usage
To segment a document run the segment.py file in the root folder of the repository using the following command:
python segment.py [FLAGS] INPUT_FOLDER OUTPUT_FOLDER
The INPUT_FOLDER must contain all of the PDF files to include in the segmention. Subfolders are omitted. The OUTPUT_FOLDER must exist as it is not created by the program. Optional arguments are written before specifying the INPUT_FOLDER and *OUTPUT_FOLDER, indicated by [FLAGS].
Flags
The flags available for segment.py are:
-h, --help
Provides an overview of the flags and arguments available.-a A, --accuraccy A
Minimum threshold for the prediction accuracy used by machine intelligence module. Value between 0 and 1. Defult is 0.7.-m, --machine
Enable the machine intelligence module when running the program.-t, -temporay
Keep the temporary files created while running the program.-c, --clean
Clear the output folder before running the program.-s, SCHEMA, --schema SCHEMA
Path to the JSON schema. Default is schema/manuals_v1.2.schema.json.-d, --download
Download the Grundfos data set before running the program.
Acknowledgement
The wrapper module uses another module created by the Nordjyske preprocessing group. The module and its documentation can be found at https://git.its.aau.dk/Knox/source-data-io.