The search on the site with his own hands

Probably many ever wondered how to do a search on the website? Of course, for large sites with lots of content search is simply irreplaceable thing. In most cases, the user first visit Your website in search of something important, will be to understand the navigation panels, dropdown menus and other navigation elements, and in haste will try to find something similar to the search string. And if that luxury on the website will not appear, or it will fail with the search request, the visitor will simply close the tab. But the article is not about the value of site search and not about the psychology of the visitors. I'll tell you how to implement a small algorithm full-text search, which will save beginners a headache.

The reader may raise the question: why write everything from scratch, if everything is already written? Yes, major search engines have APIs, there are cool projects like Sphinx and Apache Solr. But each of these solutions has its advantages and disadvantages. Using the services of search engines such as Google and Yandex, You will get many buns, such as a powerful morphological analysis, correction of typos and errors in the request, the recognition of the wrong keyboard layout, but the fly in the ointment here will not do. First, the search is not integrated in the site structure — it is external, and You will not be able to tell it what data is most important and which are not. Second, the contents of the website are only indexed with a certain interval, which depends on the search engine, so if the website anything will be updated, have to wait for the moment when these changes will be in the index and will be available in the search. From Sphinx and Apache Solr case with the integration and indexing much better, but not every web hosting will allow you to run.

Nothing prevents to write a search engine yourself. It is assumed that the site is running on PHP in conjunction with some database server such as MySQL. Let's first define what is required from the search on the site?

the

the language of morphology. regardless of the case, the end and
other virtues of the great and mighty language search should find what you need
user. In other words, "apples", "Apple", "apples" are forms of a single
the word "Apple" that need to be considered in the search algorithm. One way
achieve this goal is to bring each word of the search query and words
the content of the site to the base form.

option to specify the search context. That is, the ability to choose
the content of the website within which you want to run the search algorithm, and to determine
the significance for each of the limits. For example, consider an online store. It is assumed,
what is the search query most often will contain the name of the desired product, so a search for
the names of the items will have the highest priority. As a next priority can be
choose search by properties of the goods, and then search by description.br>

Indexing site content. Imagine the situation: about 30 people
to perform search queries. The server accepts each connection, flow control,
is passed to the PHP interpreter. With each request re-initializes the search
engine re-perebivaetsya the contents of the site... it is Difficult to say how much time and
resources required to handle all these requests. It is in order to not
to do the same job a hundred times, was invented by the indexing technology.
Indexing occurs only when changing or adding the content of the site,
and the search is already on the index and not by content.

Mechanism of ranking. the Ranking of search results — sorting search results, that is based on the assessment of the significance of data found. For example, in some blog runs the search query "space". This word contains two articles: the first 16 times, second 5 times. Most likely, the first article will be of greater value to the initiator of search. Also every variety of website content when indexing is set to a specific ratio, which will affect its position in search results.

Now a few words about what we have to do:
the

morphological analyzer
the ranking algorithm,
indexing algorithm,
the search algorithm.

At the end of the article shows a sample implementation of search the example of a simple online store. Those who are too lazy to learn it and just want a finished search engine, you can safely pick up the engine from GitHub FireWind.

Principle of operation

From the backend search works like this:
the

contents of the website is indexed
the user sends the request
from the query specifically excludes the official part of speech,
the resulting string is split into an array of translated words in the base form,

search each word of the resulting array is in the index, the

search results are ranked, sorted and given to the user.

Preparation

The task is set, you can now get down to business. I use Linux as a working OS, but will try to use its more exotic features, the Windows was able to "collect" a search engine by analogy. All You need is a basic knowledge of PHP and ability to deal with MySQL. Go!

Our project will consist of a core, where will be gathered all the vital functions and the module of morphological analysis and word processing. Let's create the root folder of the project firewind and it will create the file core.php it will be the core.
the

$ mkdir firewind
$ cd firewind
$ touch core.php

Now armed with your favorite text editor and prepare the frame:
the

<?php
class firewind {
public $VERSION = "1.0.0";

function __construct() {
// Initializer //
}
}
?>

Here we have created a main class which can be used on Your sites. At this preparatory part is over, it's time to move on.

Morphological analyzer

Russian language is a pretty complicated thing, which is very diverse and shocking aliens designs, such as "Yes-no-maybe". To teach the machine to understand it, and any other language is not an easy task. The most successful in terms of search companies such as Google and Yandex, who constantly improve their algorithms and keep them secret. We'll have to do something different, easier. Fortunately, the wheel is to invent it is not necessary — everything is already done for us. Meet phpMorphy morphological analyzer that supports Russian, English and German. More detailed information can be obtained here, however, we are only interested in two possibilities: lemmatization, which is the basic form of the word, and receiving grammatical information about the word (gender, number, case, part of speech, etc.).

Need a library and a dictionary for her. All this stuff can be found here. The library is located in the same folder "phpmorphy", dictionaries are located in the "phpmorphy-dictionaries". Download the latest version in the project's root folder and unpack:
the

# Unpack the library
$ unzip phpmorphy-0.3.7.zip
$ mv phpmorphy-0.3.7 phpmorphy

# Unpack the dictionary in the phpmorphy/dicts
$ unzip morphy-0.3.x-ru_RU-withjo-utf-8.zip -d phpmorphy/dicts/

# Delete original archives
$ rm phpmorphy-0.3.7.zip morphy-0.3.x-ru_RU-withjo-utf-8.zip

Great! The library is ready to use. It's time to write a "wrapper" that hides the work with phpMorphy. To do this, create another file morphyus.php in the root directory:
the

<?php
require_once __DIR__.'/phpmorphy/src/common.php';

class morphyus {
private $phpmorphy = null;
private $regexp_word = '/([a-zа-H0-9]+)/ui';
private $regexp_entity = '/&([a-zA-Z0-9]+);/';

function __construct() {
$directory = __DIR__.'/phpmorphy/dicts';
$language = 'ru_RU';
$options[ 'storage' ] = PHPMORPHY_STORAGE_FILE;

// Initialize libraries //
$this- > phpmorphy = new phpMorphy( $directory, $language, $options );
}

/**
* Divides the text into array of words
*
* @param {string} content the Original text to highlight the words
* @param {boolean} filter enables the filtering of HTML tags and entities

*/
public function get_words( $content, $filter=true ) {
// Filter HTML tags and HTML entities //
if ( $filter ) {
$content = strip_tags( $content );
$content = preg_replace( $this->regexp_entity, ' ', $content );
}

// Uppercase //
$content = mb_strtoupper( $content, 'UTF-8' );

// Replace e by e //
$content = str_ireplace( 'E', 'E', $content );

// Selection of words from context //
preg_match_all( $this->regexp_word, $content, $words_src );
return $words_src[ 1 ];
}

/**
* Finds the Lemma of the word
* 
* @param {string} word the Original word
* @param {array|boolean} Array of possible lemmas of the word or false
*/
public function lemmatize( $word ) {
// Get base form of word //
$lemmas = $this- > phpmorphy- > lemmatize( $word );
return $lemmas;
}
}
?>

While implemented only two methods. get_words divides the text into array of words, while filtering HTML tags and entity of type " ". Lemmatize method returns an array of lemmas of the words, or false if none found.

the Mechanism of ranking on the level of morphology

Let's focus on a unit of language, as a suggestion. The most important part of the sentence is a subject and/or predicate. Often the subject is expressed by noun and the predicate verb. Secondary members mostly used to clarify the meaning of the base. In different sentences the same parts of speech sometimes have completely different meanings, and the most accurate assessment of this value in the context of the text today. However, programmatically estimate the value of some words is still possible, though not exactly. The ranking algorithm should be based on the so-called profile of the text, which is determined by the author. The profile is an associative array whose keys are the parts of speech and the values are respectively the rank (or weight) of each of them. An example of the profile I show in the conclusion, but for now let's try to translate these reflections into language PHP, adding another method to the class morphyus:
the

<?php
require_once __DIR__.'/phpmorphy/src/common.php';

class morphyus {
private $phpmorphy = null;
private $regexp_word = '/([a-zа-H0-9]+)/ui';
private $regexp_entity = '/&([a-zA-Z0-9]+);/';

// ... //

/**
* Appreciates the significance of words
* 
* @param {string} word the Original word
* @param {array} profile Profile text
* @return {integer} the Rating of importance from 0 to 5
*/
public function weigh( $word, $profile=false ) {
// Attempt to determine the part of speech //
$partsOfSpeech = $this- > phpmorphy- > getPartOfSpeech( $word );

// Default profile //
if ( !$profile ) {
$profile = [
// Call the parts of speech //
'B' => 0,
'UNION' => 0,
'INT' => 0,
'WODN' => 0,
'COMMON' => 0,
'MS' => 0,

// Most important part of speech //
'From' => 5,
'G' => 5,
'P' => 3,
'N' => 3,

// Other parts of speech //
'DEFAULT' => 1
];
}

// If unable to determine the possible parts of speech //
if ( !$partsOfSpeech ) {
return $profile[ 'DEFAULT' ];
}

// Define rank //
for ( $i = 0; $i < count( $partsOfSpeech ); $i++ ) {
if ( isset( $profile[ $partsOfSpeech[ $i ] ] ) ) {
$range[] = $profile[ $partsOfSpeech[ $i ] ];
} else {
$range[] = $profile[ 'DEFAULT' ];
}
}

return max( $range );
}
}
?>

Indexing site content

As mentioned above, indexing significantly speeds up the search query, because search engine does not need to process the content every time a search is performed on the index. But what happens when indexing? If in order, then:

the

First text formed from an array of words, and this is done using the method get_words.

According to the profile, the text is discarded irrelevant parts of speech. the

Significant are assessed on a five-point scale, using the weigh method.
For each owl searches for lemmas, in other words underlying forms.
Calculates the number of occurrences of each word and the total rank.
All data is written to the object in JSON is stored in the database.

The result is an object with the following format:
the

{
"range" : "<the weighting of the indexed data>",
"words" : [
// One word //
{
"source" : "<basic version>",
"range" : "<summary rank>",
"count" : "<the number of repetitions of the word in the text>",

"basic" : [
// Versions of the lemmas of words //
]
}
]
}

Write the initializer and the first method core search engine:
the

<?php
require_once 'morphyus.php';

class firewind {
public $VERSION = "1.0.0";
private $morphyus;

function __construct() {
$this->morphyus = new morphyus;
}

/**
* Performs text indexing
*
* @param {string} content Text to index
* @param {integer} [range] a weighting of the indexed data
* @return {object} the result of the indexing
*/
public function make_index( $content, $range=1 ) {
$index = new stdClass;
$index- > range = $range;
$index->words = [];

// Highlight words from the text //
$words = $this->morphyus- > get_words( $content );

foreach ( $words as $word ) {
// Estimation of the importance of words //
$weight = $this->morphyus->weigh( $word );

if ( $weight > 0 ) {
// The number of words in the index //
$length = count( $index- > words );

// Check for the existence of the original word in the index //
for ( $i = 0; $i < $length; $i++ ) {
if ( $index->words[ $i ]->source === $word ) {
// The original word is already in the index //
$index->words[ $i ]->count++;
$index->words[ $i ]->range = 
$range * $index->words[ $i ]->count * $index->words[ $i]- > weight;

// Processing the next word //
continue 2;
}
}

// If the source words not yet in the index //
$lemma = $this->morphyus- > lemmatize( $word );

if ( $lemma ) {
// Check for the presence of lemmas in the index //
for ( $i = 0; $i < $length; $i++ ) {
// If the compared words are lemmas //
if ( $index->words[ $i ]->basic ) {
$difference = count(
array_diff( $lemma, $index->words[ $i ]->basic )
);

// Compare if the word has at least two distinct lemmas //
if ( $difference === 0 ) {
$index->words[ $i ]->count++;
$index->words[ $i ]->range = 
$range * $index->words[ $i ]->count * $index->words[ $i]- > weight;

// Processing the next word //
continue 2;
}
}
}
}

// If the index has no lemmas, no original word, //
// it is time to add it //
$node = new stdClass;
$node->source = $word;
$node->count = 1;
$node->range = $range * $weight;
$node->weight = $weight;
$node->basic = $lemma;

$index->words[] = $node;
}
}

return $index;
}
}
?>

Now when you add or modify data in tables is enough just to call this function to index them, but not necessarily: the index may be postponed. The first argument of the method make_index is the original text, the second is the weighting of the indexed data. Rank of each word, by the way, is calculated according to the formula:

the

<?php
$range = <weighting> * <rank based on parts of speech> * <reps>;

// In your code it looks like this: //
$index->words[ $i]- > range = $range * $index->words[ $i ]->count * $index->words[ $i]- > weight;
?>

Storing indexed data

Obviously, the index needs a place to store and bind to the source data. The most suitable place for them to be database. If you are indexing the content of files, you can create a separate table in the database that will contain the index name of each file, and for content that is already stored in the database, you can add another field type in the table structure. This approach will allow you to share content types when searching, for example, the names and description of entries from the blog.

Unresolved there was only a question of the format of the indexed content, because make_index returns an object, and so simply to a database or file it can not record. You can use JSON and store it in fields of type LONGTEXT, BSON or CBOR, using the LONGBLOB data type. The last two format allows to provide data in a more compact form than the first.

As they say, "the boss", so to decide where and how it will be stored You.

Benchmark

Let's see what we've got. I took a text his favorite article "the Dark matter of the Internet", namely the contents of the node #content html_format and saved it in a separate file.

the

<?php
require_once '../src/core.php';
$activities = new activities;

// Read source text //
$source = file_get_contents( './source.html' );

// Note the start time //
$begin_time = microtime( true );
echo "Indexing started: $begin_time\n";

// Indexing //
$index = $firewind- > make_index( $source );

// Note the time of the end //
$finish_time = microtime( true );
echo "Indexing finished: $finish_time\n";

// Results //
$total_time = $finish_time - $begin_time;
echo "Total time: $total_time\n";
?>

On my machine with configuration:
CPU: Intel Core i7-4510U @ 2.00 GHz, 4M Cache
RAM: 2x4096 Mb
OS: Ubuntu 14.04.1 LTS, x64
PHP: 5.5.9-1ubuntu4.5

Indexing took about seconds:
the

$ php benchmark.php
Indexing started: 1417343592.3094
Indexing finished: 1417343593.5604
Total time: 1.2510349750519

I think quite a good result.

Implementing search

I was the last and most important method, search method. In the first argument of method accepts index of a search query, as the second — content index where the search is performed. In the execution result returns the total grade, calculated on the basis of rank of words found, or 0 if nothing was found. This will allow you to sort search results.

the

<?php
require_once 'morphyus.php';

class firewind {
public $VERSION = "1.0.0";
private $morphyus;

// ... //

/**
* Searches for the index words of one object in another
*
* @param {object} target the Sought data
* @param {object} source the Data to be searched
* @return {integer} the Total grade on the basis of the found data
*/
public function search( $target, $index ) {
$total_range = 0;

// Iterate over the query //
foreach ( $target->words as $target_word ) {
// Iterates through the words in the index //
foreach ( $index- > words as $index_word ) {
if ( $index_word->source === $target_word->source ) {
$total_range += $index_word->range;
} else if ( $index_word->basic && $target_word->basic ) {
// If the searched and indexed words of the Lemma //
$index_count = count( $index_word ->basic );
$target_count = count( $target_word ->basic );

for ( $i = 0; $i < $target_count; $i++ ) {
for ( $j = 0; $j < $index_count; $j++ ) {
if ( $index_word->basic[ $j ] === $target_word->basic[ $i ] ) {
$total_range += $index_word->range;
continue 2;
}
}
}
}
}
}

return $total_range;
}
}
?>

All! Search engine ready to use. But... actually it's not a gin a magician, and just throwing it on your website You will not get anything. It should be integrated, and this process largely depends on the architecture of Your website. Let's consider this process on the example of a small online store.

implementation of a search example online store

For example, information about the sold products are stored in a table production:
the

CREATE TABLE `production` (
`uid` INT NOT NULL AUTO_INCREMENT, -- Unique identifier
`name` VARCHAR(45) NOT NULL, -- product Name
`manufacturer` VARCHAR(45) NOT NULL, -- Manufacturer
`price` INT NOT NULL, -- the value of the product
`keywords` TEXT NULL, -- Index of keywords
PRIMARY KEY ( `uid` )
);

SHOW COLUMNS FROM `production`;
+--------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-------------+------+-----+---------+-------+
| uid | int(11) | NO | PRI | NULL | |
| name | varchar(45) | NO | | NULL | |
| manufacturer | varchar(45) | NO | | NULL | |
| price | int(11) | NO | | NULL | |
| keywords | text | YES | | NULL | |
+--------------+-------------+------+-----+---------+-------+

And the description in the table description:

the

CREATE TABLE `description` (
`uid` INT NOT NULL AUTO_INCREMENT, -- Unique identifier
`fid` INT NOT NULL, -- Foreign key to bind the description to the product
`description` LONGTEXT NOT NULL, -- the Very description
the `index` TEXT NULL, -- Indexed description
PRIMARY KEY ( `uid` )
);

SHOW COLUMNS FROM `description`;
+-------------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+-------+
| uid | int(11) | NO | PRI | NULL | |
| fid | int(11) | NO | | NULL | |
| description | longtext | NO | | NULL | |
| index | text | YES | | NULL | |
+-------------+----------+------+-----+---------+-------+

Field production.keywords will contain the index of the key words of the product description.index will contain an indexed description. And all this will be stored in JSON format.

Here is an example function add new product:
the

<?php
require_once 'firewind/core.php';
$activities = new activities;

$connection = new mysqli( 'host', 'user', 'password', 'database' );
if ( $connection->connect_error ) {
die( 'Cannot connect to database.' );
}

$connection->set_charset( 'UTF8' );

function add_product( $name, $manufacturer, $price, $description, $keywords ) {
global $firewind, $connection;

// The index of the description of the product //
$description_index = $firewind- > make_index( $description );
$description_index = json_encode( $description_index );

// Index of key words //
$keywords_index = $firewind- > make_index( $keywords, 2 );


// Prepare the queries //
$production_query = $connection->prepare(
"INSERT INTO `production` ( `name`, `manufacturer`, `price`, `keywords` )
VALUES ( ?, ?, ?, ? )"
);

$description_query = $connection->prepare(
"INSERT INTO `description` ( `fid`, `description`, `index` )
VALUES ( LAST_INSERT_ID(), ?, ? )"
);

if ( !$production_query || !$description_query ) {
die( "Cannot prepare requests!\n" );
}

if (
// Binding the parameters //
$production_query -> bind_param( 'ssis', $name, $manufacturer, $price, $keywords_index ) &&
$description_query -> bind_param( 'ss', $description, $description_index ) &&

// Execute query //
$production_query -> execute() &&
$description_query -> execute()
) {
// If the query executed successfully //
echo( "Product successfully added!\n" );

// End queries //
$production_query -> close();
$description_query -> close();

return true;
} else {
die( "An error occurred while executing query...\n" );
}
}
?>

Here the search engine has been integrated into the function of adding a new store product. And now the handler of search queries:
the

<?php
require_once '../src/core.php';
$activities = new activities;

$connection = new mysqli( 'host', 'user', 'password', 'database' );
if ( $connection->connect_error ) {
die( 'Cannot connect to database.' );
}

$connection->set_charset( 'UTF8' );

// Search query //
$query = isset( $_GET[ 'query' ] ) ? trim( $_GET[ 'query' ] ) : false;

if ( $query ) {
// Search query //
$query_index = $firewind- > make_index( $query );

// Receive data //
$production = $connection->query("
SELECT p.`uid`, p.`name`, p.`keywords`, d.`index` 
FROM `production` p, `description` d 
WHERE p.`uid` = d.`uid`
");

if ( !$production ) {
die( "Cannot get production info.\n" );
}

// Execute the search //
while ( $product = $production->fetch_assoc() ) {
// Unpack the index //
$keywords = json_decode( $product[ 'keywords' ] );
$index = json_decode( $product[ 'index' ] );

$range = $firewind->search( $query_index, $keywords );
$range += $firewind->search( $query_index, $index );

if ( $range > 0 ) {
$result[ $product[ 'uid' ] ] = $range; 
}
}

// If anything was found //
if ( isset( $result ) ) {
// Sort descending //
arsort( $result );

// Output results //
$i = 1;

foreach ( $result as $uid = > $range ) {
printf(
"#%d. Found product with id %d and range %d.\n",
$i++,
$uid,
$range
);
}
} else {
echo( "Sorry, no results found.\n" );
}
} else {
echo( "Query cannot be empty. Try again.\n" );
}
?>

This script takes a search query as a GET parameter query and performs the search. The system then shows the found products store.

Opinion

The article describes one of the variants of implementation of the site search. This is the first version, so I'll be glad to know Your comments, opinions and suggestions. Join my project on Github: https://github.com/axilirator/firewind. Plans to add a whole bunch of other features, like caching search query suggestions as you type a search query and the algorithm for letter-by-letter comparison, which will help to deal with the typos.

Thank you all for your attention and, with the day of information security!

Article based on information from habrahabr.ru

Search This Blog

computer express