Why we created the replacement of old systems search for documents

November 15, 2017

Since the late 2000s we dealt with the automation of the processes in the security services of large companies. Almost all companies one of the key objectives of the security was checking potential clients and partners for reliability. Check includes regular search information about the companies and people in a vast array of textual information. This array represented (and still represents) a few tens of millions of documents in different formats and from different sources. It could be certificates, reports, statements pdf, doc, xls, txt, sometimes the scan to the same pdf, tiff, etc. In General, the task of finding information quickly about any company or person in this dataset is critically important for any business.

We've come a long way from using dtSearch to complete their own decisions. In this article I want to share our experience.

To automate the process of testing we used our own decisions, however engine for full-text search in the documents we had dtSearch. A bit about our choice (which was conducted in 2010 and remained with us until autumn 2016):

the

was Cross, Copernic, Archivarius, and dtSearch several exotic solutions
to compare the speed of queries on large amounts of data showed the obvious winner — dtSearch
dtSearch at that time was the most advanced query syntax, which allowed us to implement all the "details" information retrieval
dtSearch API is a library for C# that we used to integrate the engine into our system. Not the most convenient option, but at that time was the most acceptable

What happened

as the years Passed, our system has evolved, and gradually dtSearch became narrow and problematic place:

the

Continuously growing amounts of information, together with that falling speed search to the end of 2016, some queries took 5 minutes — absolutely unacceptable indicator
dtSearch cannot recognize scanned documents (OCR), and such documents became more and more — losing a lot of information
dtSearch incorrect index file in the coding CP866
dtSearch is not always correct tokenserver phrases, numbers, dates and words, which can lose information, for example, when searching for compound names and phone numbers
Our system is gradually moved from ASP.NET MVC/C#/MSSQL stack to a more modern React/Node.js/Python/ElasticSearch/MongoDB, and dtSearch can be integrated only via C++ or C# API, which had to fence complex integration (wanted REST)
To dtSearch indexer had to use a full Windows Server
dtSearch can not work in a cluster, which is important for large volumes. Had to keep a very thick machine is specially for dtSearch

the List goes on and on, but everything else is trivia in comparison with the problems listed above.

So at some point we realized that I cannot live like this and need to seek alternatives or create your own solution. The search for alternatives, unfortunately, anything sensible did not bring that existed in 2010, the product is not particularly advanced, and appeared new (LucidWorks Fusion, SearchInform, etc.) we are not impressed.

Next, we considered the option of creating a full-text search module for our system using Apache Tika + ElasticSearch or Apache Solr, which generally solved the problem. However, we did not cease to torment the idea that the market still has no good solutions with fast search, OCR, and intuitive interface.

So, without hesitation, we decided to create our own open-source solution which would all make life easier — so Ambar was born.

Ambar – system full-text search on documents

Interface Ambar

Release was in January 2017 Then we launched Ambar from first major client.

the Main points about our system that are important to know:

the

super Fast search engine with the features of the language: for example, fuzzy search, the query takes about a hundred milliseconds to more than ten million files,
Easy and intuitive interface for search and administration
supports all common (and not so) file formats and data deduplication
the Best in the market of parsing pdf, smart type definition page (scan/text)
Advanced OCR
Advanced full-text parser, now you will not lose information due to incorrect tokenization of dates, phones, etc.
a Simple REST API, easy integration with anything
Ability to use the cloud version or install on your own hardware
When installed on your own hardware can be installed in a cluster and scale to petabytes of data

In the near future we plan to add the ability to read and index the contents of the email and begin to develop the analytical part of the system by adding named entity recognition (names, addresses, document numbers, identification numbers, telephone numbers).

→ project Description and contacts

→ project Page on GitHub

→ Our blog, where we share any interesting facts and achievements

Thank you!

Article based on information from habrahabr.ru

Search This Blog

computer express