How to get to the top on Kaggle, or Matriksnet at home

Want to share the experience of participation in competition Kaggle and machine learning algorithms, which got the 18-th place from 1604 to the Avazu competition prediction CTR (click-through rate) of mobile advertising. In the process, tried to recreate the original algorithm Matrixnet, tested some options logistic regression and worked with characteristics. All this below plus put the complete code so you can see how it works.

The story is divided into the following sections:
1. Conditions of the competition;
2. The creation of new features;
3. Logistic regression charms adaptive gradient;
4. Matriksnet – reconstruction of the full algorithm;
5. Accelerating machine learning in Python.

1. Competition terms

Given data:
the

40.4 million records for training (10 days of ads Avazu);
4.5 million records to test (1 day).

The data itself can be downloaded here.

As the criterion was declared negative error probability (Negative Likelihood Loss – NLL):

Where N is the number of records, y is the value of a variable, click p – the likelihood that the event was a click ("1").

An important property of this error function is that if p based on the sigmoid function, then the partial derivative (the gradient) is (p-y).

2. Creating new features

For a start look at the raw data with which we can work:
the

сlick – "0" – not clicks "1" – was click, it's the target of prediction;
hour – the date and time of ad display;
banner_pos is the location of the banner (presumably the first screen, second screen, etc.);
site* app* features – information about the place of ad display;
device* features – information about the device on which the ad is displayed;
C1, C14-C21 — encrypted features (presumably including data on the location of the show, time zone and perhaps other information).

Sample data

This is not a very large field for work, since we have no historical data on the users, and most importantly — we know nothing about that ad is shown every time. And this is the important part (you did not click on all ads in a row?).

Create a new:

of the Polynomial characteristics of the 2nd level (3rd is too slow the learning process);
User_id. Tested some options, works best – device_id + device_ip + device_model + C14 (presumably geography at the city level / region). And Yes, device_id is not equal to user_id;
Frequency of contact with advertising. Generally, users who see the ad 100 times per day to react to it is not the same as those who see in the 1st. Therefore, we consider the frequency of each show for each user_id.

Example of a user id

Ideas tried above gave the best score. Their formation was mainly based on his experience in advertising and that it was possible to pull out of Avazu data.

Also do a little conversion/data transformation, aimed primarily at getting rid of duplicate information:
the

hour – select the hour, throw away the day;
C1 – assuming that was behind time zone, so after 2 numbers merge with column h;
C15 and C16 – combine, as they easily guessed the width and height of the banner, it makes no sense to leave extra features;
Site* and app* — transform in placement* as it is clear that the banner is displayed, either on the site or in the app, and the other value is simply an encrypted NULL, which additional information shall not be;

All changes were tested using logistic regression: they are either allowed to improve or accelerate the algorithm and does not impair the results.

3. Logistic regression (Logistic Regression) – the charm of adaptive gradient

Logistic regression is a popular algorithm for classification. There are 2 main reasons for this popularity:
1. Easy to understand and create an algorithm;

2. Speed and agility predictions for big data through stochastic gradient descent (stochastoc gradient descent, SGD).

For example, data Avazu view, both because of stochastic gradient we don't always "go" just to the minimum of:

3.1. Adaptive gradient

However, if over time, reduce the learning rate of the algorithm (learning rate), we will all come to a more accurate solution, since the gradient is not so much to respond to atypical data. The authors of the adaptive gradient (Adaptive Gradient, AdaGrad) propose to use the sum of all previous gradients to consistently reduce the speed of learning:

Thus, we obtain useful properties of the algorithm:
the

More smooth descent to the minimum (the learning rate decreases with time);
alpha for each feature is calculated individually (which is very important for sparse data where most of the characteristics are very rare);
In the calculation of alpha is taken into account, how much to change the parameter (w) characteristics: the more varied earlier, the less will change in the future.

Adaptive gradient begins to learn the same way as regular logistic regression, but then comes to a lower minimum:

In fact, if the usual stochastic gradient descent to be repeated many times on the same data, it may be close to AdaGrad. However, it will take more iterations. In my model I used a combination of these techniques: 5 iterations with the conventional algorithm, and then one with AdaGrad. Here are the results:

3.2. Transformation of data for logistic regression

To the logistic regression algorithm can work with data (and they are presented in the format of text values), you need to convert them to scalar values. I used one-hot encoding: text characteristics in a matrix NxM with values "0" and "1", where N is the number of records and M is the number of unique values this characteristics. The main reasons for the preserved maximum information, and feature hashing allows you to quickly work with spaces with millions of characteristics within the framework of logistic regression.

Example one-hot encoding

4. Matriksnet – Assembly at home

After I got good results with logistic regression, it was necessary to continue to improve. I was interested to understand how it works Matriksnet Yandex moreover, I was expecting a good result (after all, it is one of the tasks is to predict the CTR inside the Yandex advertising for the issue to find). If you collect all publicly available information about Matrixnet (see the full list of links at end of article), you can recreate the algorithm. I do not claim that in this form the algorithm running inside Yandex, I don't know, but in your code and in this article used all found "chips" and hints at them.

Let's go in order of what is Matrixnet:

Basic element – Classification and Regression Tree (CART);

the Basic algorithm is Gradient Boosting Machine (GBM) the

Update the main algorithm is Stochastic Gradient Boosting Machine (SGBM).

4.1. Classification and Regression Tree (CART)

classical algorithms decision tree

here

split into sheets (x_1≥0.5)
Height of tree (number of levels with the conditions in the above example of 2)
Rule predict p (the example above uses the mathematical expectation)

4.1.1. How to determine the characteristic for the condition of split

information gain and Gini impurity

4.1.2. Regularization in CART

correct

4.1.3. Forgetful trees (Oblivious Trees)

4.2. Gradient Boosting Machine

4.2.1. GBM for classification

4.3. Stochastic GBM

get

5. Attempts to accelerate machine learning in Python

CAD

code

PyPy – JIT compiler, which allows you to accelerate the standard CPython x20 times, the main problem is there is virtually no libraries for working with computations (NumPyPy is still in development), all you need to write without them. Perfectly suited for the implementation of logistic regression with stochastic gradient descent, as in our case;
Numba decorators jit allow pre-compile some functions (principle, as in PyPy), but the rest of the code can use all the advantages of CPython libraries. A big plus for precompiled functions, you can remove the GIL (Global Interpreter Lock) and use more cpu. What I used for acceleration Matrixnet. Problem Numba is the same as PyPy, — no libraries, the main difference is that Numba is the realization of some of the functions in NumPy.

the Complete code for all of the conversions and by the learning algorithms is here.

Logistic regression, and many more — www.coursera.org/course/ml
machine learning Algorithms — seat.massey.ac.nz/personal/s.r.marsland/MLBook.html
Adaptive Gradient — www.magicbroom.info/Papers/DuchiHaSi10.pdf
Gradient Boosting Machine — statweb.stanford.edu/~jhf/ftp/trebst.pdf
Stochastic Gradient Boosting — statweb.stanford.edu/~jhf/ftp/stobst.pdf
Matrixnet from Trofimova — wan.poly.edu/KDD2012/forms/workshop/ADKDD12/doc/a3.pdf
Matrixnet with ANN based on — arxiv.org/pdf/1412.6601.pdf

Article based on information from habrahabr.ru

Search This Blog

computer express