Why we created the replacement of old systems search for documents

Since the late 2000s we dealt with the automation of the processes in the security services of large companies. Almost all companies one of the key objectives of the security was checking potential clients and partners for reliability. Check includes regular search information about the companies and people in a vast array of textual information. This array represented (and still represents) a few tens of millions of documents in different formats and from different sources. It could be certificates, reports, statements pdf, doc, xls, txt, sometimes the scan to the same pdf, tiff, etc. In General, the task of finding information quickly about any company or person in this dataset is critically important for any business.


We've come a long way from using dtSearch to complete their own decisions. In this article I want to share our experience.


To automate the process of testing we used our own decisions, however engine for full-text search in the documents we had dtSearch. A bit about our choice (which was conducted in 2010 and remained with us until autumn 2016):


the
    the
  • was Cross, Copernic, Archivarius, and dtSearch several exotic solutions
  • the
  • to compare the speed of queries on large amounts of data showed the obvious winner — dtSearch
  • the
  • dtSearch at that time was the most advanced query syntax, which allowed us to implement all the "details" information retrieval
  • the
  • dtSearch API is a library for C# that we used to integrate the engine into our system. Not the most convenient option, but at that time was the most acceptable

What happened


as the years Passed, our system has evolved, and gradually dtSearch became narrow and problematic place:


the
    the
  • Continuously growing amounts of information, together with that falling speed search to the end of 2016, some queries took 5 minutes — absolutely unacceptable indicator
  • the
  • dtSearch cannot recognize scanned documents (OCR), and such documents became more and more — losing a lot of information
  • the
  • dtSearch incorrect index file in the coding CP866
  • the
  • dtSearch is not always correct tokenserver phrases, numbers, dates and words, which can lose information, for example, when searching for compound names and phone numbers
  • the
  • Our system is gradually moved from ASP.NET MVC/C#/MSSQL stack to a more modern React/Node.js/Python/ElasticSearch/MongoDB, and dtSearch can be integrated only via C++ or C# API, which had to fence complex integration (wanted REST)
  • the
  • To dtSearch indexer had to use a full Windows Server
  • the
  • dtSearch can not work in a cluster, which is important for large volumes. Had to keep a very thick machine is specially for dtSearch

the List goes on and on, but everything else is trivia in comparison with the problems listed above.


So at some point we realized that I cannot live like this and need to seek alternatives or create your own solution. The search for alternatives, unfortunately, anything sensible did not bring that existed in 2010, the product is not particularly advanced, and appeared new (LucidWorks Fusion, SearchInform, etc.) we are not impressed.


Next, we considered the option of creating a full-text search module for our system using Apache Tika + ElasticSearch or Apache Solr, which generally solved the problem. However, we did not cease to torment the idea that the market still has no good solutions with fast search, OCR, and intuitive interface.


So, without hesitation, we decided to create our own open-source solution which would all make life easier — so Ambar was born.


Ambar – system full-text search on documents


Interface Ambar


Release was in January 2017 Then we launched Ambar from first major client.


the Main points about our system that are important to know:


the
    the
  • super Fast search engine with the features of the language: for example, fuzzy search, the query takes about a hundred milliseconds to more than ten million files,
  • the
  • Easy and intuitive interface for search and administration
  • the
  • supports all common (and not so) file formats and data deduplication
  • the
  • the Best in the market of parsing pdf, smart type definition page (scan/text)
  • the
  • Advanced OCR
  • the
  • Advanced full-text parser, now you will not lose information due to incorrect tokenization of dates, phones, etc.
  • the
  • a Simple REST API, easy integration with anything
  • the
  • Ability to use the cloud version or install on your own hardware
  • the
  • When installed on your own hardware can be installed in a cluster and scale to petabytes of data

In the near future we plan to add the ability to read and index the contents of the email and begin to develop the analytical part of the system by adding named entity recognition (names, addresses, document numbers, identification numbers, telephone numbers).


project Description and contacts


project Page on GitHub


Our blog, where we share any interesting facts and achievements


Thank you!

Article based on information from habrahabr.ru

Comments

Popular posts from this blog

Powershell and Cyrillic in the console (updated)

Active/Passive PostgreSQL Cluster, using Pacemaker, Corosync

Experience with the GPS logger Holux M-241. Working from under Windows, Mac OS X, Linux