You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Magnus Mølgaard Lund 098daa829b Update '' 2 years ago
IO_wrapper Noise reduction 2 years ago
classification Upload 2 years ago
data_acquisition Upload 2 years ago
datastructure Upload 2 years ago
schema Upload 2 years ago
text_extraction Upload 2 years ago
utils Upload 2 years ago
.gitignore Noise reduction 2 years ago Upload 2 years ago
LICENSE Noise reduction 2 years ago Upload 2 years ago Upload 2 years ago Update '' 2 years ago Upload 2 years ago Upload 2 years ago Upload 2 years ago Removed comments 2 years ago
requirements.txt Upload 2 years ago Update '' 2 years ago Upload 2 years ago

Grundfos preprocessing

This project is part of the Knox multiproject and is located in layer one. The goal of this module is to extract and segment information from PDF-documents provided by Grundfos.

The module is able to extract text, titles, images and tables from PDF-files and produce a folder containing the extracted information. The module contains three components in addition to a few utilities:

  • The Text segmenter component recursively scans the document to find and segment text into their correct sections or subsections.
  • The Miner component analyzes the lines in the document to find tables and figures.
  • The Inference component uses computer vision to find tables and images (as well as text, lists and titles if desired).


  1. Clone the repository using git clone
  2. Create a new virtual environment using your preferred tool for example Conda. For Conda: conda create -n grundfos-preprocessing python=3.8 pip
  3. Activate the virtual environment. For Conda: conda activate grundfos-preprocessing
  4. Locate the root folder of the repository cd grundfos-preprocessing
  5. Install required dependencies with pip: pip install -r requirements.txt
  6. To use the machine-intelligence component download the model from and place it in the classification folder.


To segment a document run the file in the root folder of the repository using the following command: python [FLAGS] INPUT_FOLDER OUTPUT_FOLDER

The INPUT_FOLDER must contain all of the PDF files to include in the segmention. Subfolders are omitted. The OUTPUT_FOLDER must exist as it is not created by the program. Optional arguments are written before specifying the INPUT_FOLDER and *OUTPUT_FOLDER, indicated by [FLAGS].


The flags available for are:

  • -h, --help Provides an overview of the flags and arguments available.
  • -a A, --accuraccy A Minimum threshold for the prediction accuracy used by machine intelligence module. Value between 0 and 1. Defult is 0.7.
  • -m, --machine Enable the machine intelligence module when running the program.
  • -t, -temporay Keep the temporary files created while running the program.
  • -c, --clean Clear the output folder before running the program.
  • -s, SCHEMA, --schema SCHEMA Path to the JSON schema. Default is schema/manuals_v1.2.schema.json.
  • -d, --download Download the Grundfos data set before running the program.


The wrapper module uses another module created by the Nordjyske preprocessing group. The module and its documentation can be found at