As we wrote the system of protection against sklikivanie

image

Our company works in the field of Internet advertising. About 2 years ago, we finally gave up in embedded in a network of contextual advertising systems protection sklikivanie and decided to do my, at that time still for internal use only.

Under the cut a lot of the technical details of the system, and also describe the challenges we faced in the process and their solutions. If you just want to see a system — the main image is clickable.

First task, which needed to be solved is the identification of unique users.
Ie, we need to identify the user even if he change the browser or clean cookies.
After a little thought and some experimentation, we began to write not only in cookies, as well as in storage of all possible plug-ins for browsers where such repositories are available, well, other things, third-party cookies, various JS storage.
As a result, we not only in most cases identify the user and have a digital replica of his PC(OS, screen resolution, color depth, the presence or absence of certain plugins, support of those or other JS storage and third-party cookies by browsers), which allows with high probability to identify the user, even if he manages to clean up everything we put him.
At this stage particular problems, which would be worth to write, arose.

Second task — transfer all user data to our server.
For the most complete data we use 2 scripts: server-side(PHP/Python/ASP.NET) and client-side JS. Thus, we are able to obtain information even about those users that have closed the page without waiting for a full load and, consequently, testing of client JS. Such clicks on the teaser is usually at least 30%, and other systems that are taken into account, we have found. Consequently, we obtain significantly more data than the same Metrics, Analytics and other JS system statistics counters.
The data is sent the server via AJAX crossdomain, and in the case that the browser does not support it — via an iframe. The sending is performed when the page is loaded, and a number of JS events. This allows us to analyze user behavior on the website and to distinguish bots from real users on the patterns of behavior on the site, and the imperfect imitation of certain Javascript events. For example, many bots simulate the onClick event, but it does not generate events onMouseDown, onMouseUp,

Here we gradually and approached the third task — choice of iron.
Architectural at the moment the system consists of 4 segments:
the
    the
  • Frontend
  • the
  • data Collection and processing
  • the
  • Indexing of your landing pages
  • the
  • Storage of usernames/passwords to third-party services

All domains tied to Amazon Route 53 with a ttl of 60 seconds, in case of any problems with servers, quickly migrate to the backup.
front much to say nothing. The load on it is small — can handle almost any vps.
gathering and processing all the more complicated because of the need to work with large volumes of data. Today we handle about 200 requests per second.
With the correct initial choice of hardware and software, with this volume copes one server.
For iron, the 8 — core AMD, from RAID10 SAS drives, 16Gb RAM.
Data collection is carried out tuningowanie the nginx+php-fpm+mysql processing script in C++.
Initially, we faced the problem of intensive consumption of CPU resources the script data collection. The solution was found quite unexpectedly. Replace all ereg_ functions php preg_ their counterparts, we reduced the CPU consumption is about 8 times than was very surprised.
In case of problems with the current server or the emergence of scaling in the other DTS in the wings waiting for the second server of similar configuration with the possibility of commissioning within one hour.
The node we have is not duplicated, however, its replacement or extension takes less than a day, and on the quality of the traffic analysis does not affect, in the worst case a few advertising campaigns will not stop if the client lie down or lost our code.
Storage of usernames/passwords to third-party services — a thing quite specific and in General terms security is not good.
However, most ad networks API does not give all the required functionality and have to parse their web interface without a password difficult to do. For example, in Google Adwords ban IP addresses is only possible through the web interface. Bonus, users have the option from the interface of our system click to go into their accounts ad networks.
This is the fourth goal is the security of your data while storing them in the clear.
For maximum secure storage, we have created the following schema:
the
    the
  • If the password is obtained by us via the web interface
    the
      the
    • We put it in a database front-end, symmetrically encrypting the password client to our service
    • the
    • Also in the database admin put the password, asymmetrically encrypted with a public key on the server
    • the
    • Storage periodically makes requests to the database server, taking the encrypted password ad networks, decodes them with the private key and puts it in your base
  • the
  • If the password is generated by us at the store
    the
      the
    • We put it in a database storage
    • the
    • At the next user login to the database admin put password, asymmetrically encrypted with a public key on the server
    • the
    • Storage periodically makes requests to the database server, taking the encrypted password, decrypts it with the private key
    • the
    • Then the vault symmetrically encrypts the password obtained from the database of user passwords and puts them in encrypted form on frontend
  • the
  • When user login to our service password specific method is stored in A and is used for decryption on the client side passwords from ad networks and login in by clicking them
  • the
  • Access to the repository is permitted only with the number PI, to which we have access.
  • the
  • IP storage is kept secret, incoming storage requests do not exist

Due to the fact that to parse some web interfaces without using full emulation of the browser we have not yet obtained, the storage demanding on RAM and CPU. In another DC in case of unforeseen circumstances also have a backup storage server, ready to start work within the hour.

Fifth and final was integrated with advertising networks for the automatic ban "bad" IP and sites.
With conditionally small networks such as the Runner and added not suits free service, no problems, all communication works via the API, if some methods are not enough, they quickly finish it.
But with Direktom and especially AdWords problems. Getting APIs AdWords turns into a quest. First month you receive it, then it turns out that half the functions there and still need to parse the web interface. Then it turns out that even those functions which are, strictly limited units, which are Basic API and not purchase something. And starts a new quest with getting APIs next level of access, with the expanded number of units. As you can see, the search giants are doing everything to complicate advertisers opportunities to optimize advertising expenditure. However, at this point, we nevertheless successfully analyzed their traffic and clean it in automatic mode.

The bottom line is, at this moment, our system has no real competitors with similar capabilities and, most importantly, comparable in quality detection of low-quality traffic. In some cases we have seen in 40-45% more traffic than other analytical systems.
The cost of the audit of traffic, on average, about 100 times less than the cost of purchased advertising and to separate promotional system service is not free. In this case the saving is from 10 to 50% of the advertising budget, and sometimes even up to 90%.

Join us!
Article based on information from habrahabr.ru

Comments

Popular posts from this blog

Powershell and Cyrillic in the console (updated)

Active/Passive PostgreSQL Cluster, using Pacemaker, Corosync

Automatic deployment ElasticBeanstalk using Bitbucket Pipelines