Installation of graphic objects recognition server

Description and capabilities

This module is a software package, which is installed on a separate server (virtual machine) in addition to StaffСop Enterprise. It contains several engines that extract data and detect various graphic and other items from documents.

The following features are currently available:

  • detection of documents with stamps: scan copies, screenshots, photos with seal samples;

  • detection of passports of Russian Federation: scan copies, screenshots, photos of the main page of passports of Russian Federation;

  • face detection and creating the corresponding notifications based on the recognition results;

  • efficient recognition of huge text volumes using multi-threaded recognition technology, suitable for huge document flow processing. Text detection is available with any images that contain text: scan copies, screenshots, photos, as well as container files (such as PDF files and ZIP-archives).

System requirements for recognition server

OS: Ubuntu 18.04 Processor: Intel or AMD with AVX instructions support. Memory: from 2 GB per core. Disc space: from 20 GB to 50 GB.

Stamp, passport or image detection through CR Tesseract 4 in JPG format in FullHD size will take from 4 to 6 seconds.

Face detection will take from 5 to 60 seconds.

1 core, minutes

2 cores, minutes.

3 cores, minutes

4 cores, minutes

FullHD image/seconds

Image number

Core multiplier

83.3

41.7

20.8

10.4

5

1000

1

166.7

83.3

41.7

20.8

5

2000

2

416.7

208.3

104.2

52.1

5

5000

4

833.3

666.7

208.3

104.2

5

10000

8

Server-side module installation

StaffCop graphic analyzer requires the following additional packages to be installed:

sudo apt update
sudo apt -y install software-properties-common python3.7 libpoppler-cpp0v5 poppler-utils libsm6 python3.7-venv tesseract-ocr

Add StaffCop repository and install module via apt:

wget -O - http://distr.staffcop.su/stable4.8/staffcop.gpg | sudo apt-key add -
echo "deb http://distr.staffcop.su/stable4.8 stable4.8 non-free" | sudo tee /etc/apt/sources.list.d/staffcop.list
sudo apt update

Then install recognition server module package:

sudo apt install staffcop-cpservice

Note

The package size is about 800 MB, so significant amount of time will be required to download the package in case of slow internet connection.

At this step software installation has been completed successfully.

Customization of StaffCop Enterprise server side

Before customizing «Content analysis» module, you should enable access to API server in StaffCop Enterprise server settings, as this access is not enabled by default.

For this open server settings page through web-interface (select «Control panel» menu item -> then select «Server settings» item):

Server parameters page will open, click on «API access is enabled» option.

Page for editing this parameter will be opened, just set up a required value and press «Save» button. As a result, settings against «API access is enabled» will be changed to «Yes»

Next go to Filters tab, then click «Policies -> System policies -> Recognition server: Content processing» with the following options:

  • API address and port as follows http://cpservice.atom.local 9090, where cpservice.atom.local - domain name or IP of module server, 9090 - active port, specified during module installation.

  • Number of Cores should match the number of cores for module server.

  • Batch size - the less a value (the minimum recommended value is 100), the more often the recognized data will be displayed in interface, activated policies, and the more often the policy activation will send requests to Data Base. We recommend to set up a value from 1000 to 10000 for normal performance.

To run the recognition server click «Policy is activated» checkbox.

../_images/ocr_1.png

The customization of «Content processing» module on recognition server

Open SSH-console of recognition server with one of the commonly used applications, such as PuTTY.

«Content processing» module is configured by changing its text configuration files with nano text editor.

Run nano editor with the following name of configuration file sudo nano –c /etc/staffcop/cpservice-config

PORT = 9090
SERVER_ADDR = 'http://192.168.1.x'
SECRET = 'xxxxxxxxxxxxxxxx'

Change SERVER_ADDR parameter to your StaffCop Enterprise server address, update SECRET parameter – to the value of your API key, which you can find out in the «Server settings» tab (please see above):

After changing these parameters, finish the editing process by clicking Ctrl-X, confirm file record by pressing Y and Enter.

So we have just specified all required settings for recognition server. Next step is to restarting the server through the following command:

sudo service staffcop-cpservice restart

Module customization has been completed successfully.

Content processing policies

Policies are customized through StaffCop server web-interface.

Text recognition

This policy is located in Filters tab. Select menu item «Policies -> System policies -> OCR».

In recognition settings select «Content processing» menu item. Make sure to save changes.

../_images/ocr_2.png

Recognition of stamps

«Content processing» is able to detect stamps in images and documents (jpg, png, pdf) according to specified samples. Only round-formed stamps are now supported, documents can contain several different stamps.

This policy is created through the menu item in the upper left corner «Create -> Recognition of Stamps»

Stamp samples are required - image segments with the target stamps. Surrounding text, signatures will not be a hindrance. It would be better to have several samples (from 3 to 10) — this would increase the quality of recognition. The suitable size of stamp sample is 400x400px.

Note

Preferably, if your stamp samples differ in angle of inclination, pressure force and other variations without obvious defects. Also stamp sample should not contain segments of other stamps.

Policy is designed to detect documents with prescribed stamp sample. While creating policy, you should specify a sample or several different seal samples, at that each sample should be presented as an image in (jpg or png) format and contain only one stamp.

Add the selected samples to the policy, then make sure that module recognizes stamps on them – check mark should appear in «Status» column for each sample.

Also set up filter for the policy: for example, specify content type: pdf, jpg, png, specify event type - intercepted files/screenshots. If you do not set up filter, policy will not function. To activate policy, check the box «Policy is enabled». Detection results will be available in «Triggered filter» tab.

../_images/ocr_3.png

False triggering

More often this happens with similar stamps. The workaround is to decrease STAMP_RECOGNITION_THRESHOLD value. For example, go to settings.py and change STAMP_RECOGNITION_THRESHOLD = 0.6 value to 0.5 or less in /etc/config/staffcop.

STAMP_RECOGNITION_THRESHOLD = 0.5

Note

False triggering can happen in cases, when there are similar stamps within a company, or when stamps only differ in department names or division codes. In this case, we would suggest you to contact technical support for assistance.

Graphical objects recognition

This policy stays apart from the others, since it provides several ways of data processing (Russian passports, Stamped documents, Faces). Further we will take a closer look at every type.

../_images/ocr_4.png

Russian passport

«Recognition server» has the functionality of detecting spreads of the main pages of RF passports (page with the issuing information + page with a photo of passport owner) in PDF files and images (png, jpeg, including screenshots).

To activate this functionality, you need to create a policy «Detection of graphic objects → Passport (Russian Federation)». Select filter criteria in «Filter» tab (event type, content type, etc.).

Marked events that match the name of the policy created earlier, will appear as the results of policy activation in «Triggered filter» tab (for example, in «Graphical objects recognition» tab) once the filter detects events with stamps.

False triggering

False positive activations happen in case an event matching the policy name occurs in «Triggered filter» tab, although it doesn’t contain any recognized passports. This can hardly happen, but in case it does, you should report to the developers for further analysis. In most cases, these will be images of similar documents. In case of multiple false triggering, you should increase PASSPORT_THRESHOLD value, by adding /etc/staffcop/cpservice-config string to «Content processor»:

PASSPORT_THRESHOLD = 0.7

Stamped documents

«Content processor» module has the functionality of detecting round seals (stamps) in PDF files and images (png, jpeg, including screenshots). For stamps detection, you need to create a corresponding policy via «Detection of graphic objects → Recognition of Stamps». For this go to «Filter» tab and select filtration criteria (event type, content type, etc.).

Marked events, that math the name of the policy created earlier, will appear as the results of policy triggering in «Triggered filters» tab (for example, in «Graphical objects recognition» tab) once the filter detects events with stamps.

False triggering

False triggering can happen, in case event matching policy name occurs in «Triggered filter» tab, although it doesn’t contain any recognized passports. This can happen if there are round items, such as avatar or company logo on the images, in this case module can consider them as stamps by mistake. Generally, this happens if screenshots fall under filter. Such scenario is rare, but if it ever happens - you should contact technical support for assistance. Or you can just increase STAMP_THRESHOLD value, bу adding the following string /etc/staffcop/cpservice-config to «Content processor»:

STAMP_THRESHOLD = 0.6

Note

The higher this value, the higher detection threshold is and the fewer images will fall under the filter. If necessary, you can raise the value even higher. However, in this case, module will skip some doubtful or disputable seals.

Faces

«Content processor» has the functionality of detecting faces on web-camera snapshots. Detection results are fixed in the form of corresponding alerts: No face, Own face, Unknown face, Several faces. Your/Unknown faces are detected based on preassigned images in web-console.

You need to create a new policy via «Graphical objects recognition» tab → «Faces». Select filter criteria (event type «Webcam snapshot», content type (jpg, png), etc.) in «Filter» tab. «Content processor» should be fully functional and configured as well as the corresponding policy.

After the first recognition results become available in web-console, you should select web camera full face (face to camera) snapshot with suitable lighting in Event table. Then point the mouse cursor inside the frame around the face and click this area, «This face belongs to the user?» dialog will occur, you should answer «Yes», if the face corresponds to the user account. The customization described above should be done just once for every user account.

Note

If your answer is «No», then previously assigned match will be disabled! You should better use «Close» button if this dialog was opened by accident or you decided to undo changes.

Changes in face assignments will be applied only to new events. Make sure that new snapshots are correctly associated with the faces presented on them.

Results can be seen in «Constructor» tab of web console in forms of Alerts:

  • No face: faces are not detected.

  • Own face: a face has been detected and identified as the user currently working at the PC.

  • Unknown face: a face has been detected, however it has not been identified, i.e. it does not match to any of the faces available in database.

  • Wrong face: the face matches one of the faces from database, however detected face is not a current PC user.

  • Several faces: several faces are available on snapshot, regardless of their affiliation.

  • No snapshots: snapshot is unreadable, darkness on snapshot, lack of sharpness.

../_images/ocr_5.png

False triggering

Keep in mind that head rotation, inconvenient face angle, face covered with hand will not be recognized. In some rare cases face mismatch is possible. To improve recognition results, decrease FACE_DETECT_THRESHOLD value by adding to /etc/staffcop/config the following string:

FACE_DETECT_THRESHOLD = 0.5

Note

If necessary, you can decrease this value even lower, up to 0.4, however please note - the less this value, the less number of faces will be recognized.

Recognition of images, other than camera snapshots

Initially, this functionality was designed for web-cameras, however, there is an option to recognize other images and intercepted files as well. To enable this functionality, go to the policy filter «Face recognition» and select the appropriate criteria. Add to /etc/staffcop/config the following entry:

After that make sure to restart Staffcop:

sudo service staffcop restart

Tracking of Content processor in log files

View the following log on module side /var/log/staffcop-cpservice.err. Log request with options:

2020-09-10 12:19:39,065 [DEBUG] cp_server:112 Request for 2020_09_10/ae4cd000abaecdaf46eec3d3ac90750d327e688a.jpe : text_extraction face_detection

where text_extraction face_detection - text extraction and face detection are optional settings.

Processing results:

2020-09-10 12:24:20,125 [DEBUG] cp_server:127 Response for 2020_09_10/9ade404783b02bff8741ed1632ffbf63d883c64e.jpe done in 0:01:04.814513: "document_class": undetected, "face": {'size': {'width': 640, 'height': 480}, 'bounds': [{'top': 306, 'right': 381, 'bottom': 476, 'left': 211}], 'vectors': '...'}, "extracted_text": "Document type, samples of face detection results and text extraction results, processing time are specified above.

On StaffCop log side /var/log/staffcop/process.log you can find out the start and end time of events processing:

2020-09-10 12:19:38,877 [INFO] graphic_objects_detector:152 Graphic objects: process range 53793 - 53840
...
2020-09-10 12:26:14,148 [INFO] graphic_objects_detector:212 Graphic objects: finished at 53840. Time: 0:06

Add-on for face detection

Based on results of internal testing the following enhancements have been introduced:

  • Check for «muted camera» has been added - check for closed/hidden lens of web camera, camera with poor sharpness and so on. Issues can happen with images hardly discern by computer, images with darkness and so on. In case «No snapshot» occurs on a continuous basis, make sure to contact developer team.

  • For correct face detection, face area should be at least 2.5% from image area (distant faces will not be detected).

  • In case of several face detection faces should not be more than twice different from each other, otherwise distant faces will also be detected. Faces less than twice difference will be disregarded.

  • Faces should not cross each other. Center of the smaller face should not lie inside a frame of the bigger one, otherwise, such face will be disregarded.

  • The same faces assigned to one user account will be detected just once. Redundant detections will not be processed.

  • Face detection algorithm has been switched to HOG which is faster and more efficient. Previously, we used to detect with CNN algorithm.