A little about Pivot tables in PostgreSQL and Python

Hello.

Working at the Institute, I have to deal with a large number of semi-structured information. Here the prefix "semi" means that, in General, all the data is similar, but usually spread out into the local folders on computers of employees .xls, .txt or binary formats. Information consists of data obtained with different devices( level sensors, temperature, speed of currents, atmospheric pressure, humidity and so on up to 20-30 different settings). All devices are uploading data to each format, either in ascii or binary format, which is then processed and the output is again obtained ascii. Well, in General all as always, and you present all this chaos.

I wanted the whole thing to cram into one common database that would not search the correct version in the correct folder, it takes very much time. Experience in the development of different systems (mostly GIS) is available. But what has been done before, contained only the processed data, and in General all these systems was done for the customer. No set of automation for themselves was not.

Processing of the entire economy — it is a standard thing, nothing new or interesting: verifying time series integrity(if you need interpolation), building a heap of different charts, and run different models on these data, processing of output models(again, a lot of graphs), the output statistics. About the last I'll discuss in this article.


Statistics


It would not be so sad, if it were not for the need to get not just the normal statistics(minimum, maximum, average, deviation, etc.), but also summary tables (Pivot tables) of various types.

Simple

We have data on speed, you need to build a table of the form:
the the the the
gradation of velocity, cm/s Occurence, %
0-10 60%
10-20 30%
20-30 10%

The source row is about 10K speeds. I'm not a guru of Excel, except for VBA, don't know simple way how to do that (no, there is of course the option to extend the formulas, then sort, etc., as well as to make a summary table in the excel, but each time for each device is too much).
In the database (I use PostgreSQL), the problem looks like this:

the
СREATE TABLE tbldata
(
dataid serial NOT NULL,
measurementdate timestamp without time zone,
velocity double precision,
direction double precision
);
Select 100*(count(velocity)/(select count(*) from tbldata)::float) from tbldata velocity>0 and measurement<10;


How many shades, so many queries, automatiseret easily, works quickly. You can write stored procedure which will accept values of "from","step", "number of steps" and that will output a SETOF RECORDS. In General, the problem is solved.

more Complicated

The problems start when you need a table of the form:
the the the the the the the the the the
terms of speed, cm/s Repeatability for Rhumbs, %
SV SE th Hughes S Sz
0 — 10 10.8 8.2 1.3 1.3 2.1 10.1 6.9 25.4
10 — 20 4.0 0.1 0.1 1.6 3.3 0.6 0.1 10.9
20 — 30 1.8 0.0 0.0 1.2 3.4 0.1 0.0 2.2
30 — 40 0.7 0.0 0.0 0.8 1.2 0.0 0.0 0.3
40 — 50 0.1 0.0 0.0 0.3 0.2 0.0 0.0 0.1
50 — 60 0.1 0.0 0.0 0.2 0.1 0.0 0.0 0.0
60 — 70 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0
70 — 80 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0

That is, we need to calculate the percentage of the grades for the two variables:
the
Select 100*(count(*)/(select count(*) from tbldata)::float) from tbldata where velocity > 0 and velocity < 10 and (direction > 337.5 or direction < 22.5);

Then you can take advantage of the excellent crosstab. To do this, modify the SQL to include the information about the index number of gradation of the speed and sequence number graduation in the direction of:
the
select 1 as VelGrd,1 as DirGrid, 100*(count(*)/(select count(*) from tbldata)::float) from tbldata where velocity > 0 and velocity < 10 and (direction > 337.5 or direction < 22.5);

Then wrap it all in a function that will be used to enter start, step and number of steps for speed and direction which will give the UNION ALL of all these queries.
As a result, our query will look like this
the
select * from crosstab(‘fnMakePercentageByTwoVariables(0,10,10,22.5,45,8)’) as tbldata(velGrid int dirGrid1 double precision, ...,dirGrid8 double precision);

Really bad

the the the the the the the the the the the the the
terms of speed, cm/s Repeatability for Rhumbs, % Amount Security, %
SV SE th Hughes S Sz
0 — 10 10.8 8.2 1.3 1.3 2.1 10.1 6.9 25.4 66.1 100.0
10 — 20 4.0 0.1 0.1 1.6 3.3 0.6 0.1 10.9 20.6 33.9
20 — 30 1.8 0.0 0.0 1.2 3.4 0.1 0.0 2.2 8.8 13.3
30 — 40 0.7 0.0 0.0 0.8 1.2 0.0 0.0 0.3 3.1 4.5
40 — 50 0.1 0.0 0.0 0.3 0.2 0.0 0.0 0.1 0.7 1.4
50 — 60 0.1 0.0 0.0 0.2 0.1 0.0 0.0 0.0 0.4 0.7
60 — 70 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.2 0.3
70 — 80 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.1
Amount 17.6 8.3 1.4 5.5 10.3 10.9 7.1 39.0 100.0
Average speed, cm/s 10.7 2.6 2.9 22.6 19.9 4.3 3.3 9.2 9.7
Max speed, cm/s 76.2 57.8 50.2 78.3 61.1 48.8 42.9 62.5 78.3


To be honest my patience has run out, the previous query you need to add three union and one join. And if you want to know the standard deviation and something else? Another union, and so on until a lack of imagination. Security (in fact, it is the cumulative percentage from the bottom up) I still don't know how to count but handles. In General, if the table doesn't change, then you can do this.

worse

But there is a task to compare the value of the two devices. Even if it turns out that each column still need to be divided into two:
the the the the the
gradation of velocity, cm/s Repeatability for Rhumbs, %
... NW
Device 1 Device 2 ... Device 1 Device 2
0 — 10 10.8 8.2 ... 1.3 1.3
... ... ... ... ... ...


We are curious to compare the readings immediately? Me – Yes. When I thought about such a comparison, I was very sad. There really was no option faster than download the data in SPSS, to recode everything and bring exactly what you need without tons of hate. It is also important to have a table where gradations of speed need to have also the cumulative percentage that also had to do in Excel by hand.

the Solution


My main tool for work – python, using Django to develop applications. And I eventually thought to hell with it all, enough pain and suffering. Stored procedures quickly and well, but to maintain them for me hard. Plus speed is not any factor – people who have dealt with this before me, God forbid if he didn't think all hands. So wait a minute or two instead of a couple of hours to shovel lots of data – it's just heaven.

And so salvation came in the person Pandas, about which on Habre I have not found a single mention. The major appeal of "Panda" in their data types. One – dimensional Series. In fact in the series there are many things that I lacked in the dict. Made a lot of bikes that would extend the functionality of a dictionary and this is all from ignorance about Panda.

And back to my task, I need table:

To begin, connect to Postgres and get the query result
the
import psycopg2
import pandas as pd
import numpy as np
from pandas.io.sql import frame_query

conn_string = "host='localhost' dbname='test' user='postgres' password='**********'"
conn = psycopg2.connect(conn_string)
#frame_query takes the result of the sql query, and returns DataFrame
df = frame_query('select velocity,direction from tbldata', con=conn)


Create gradations and labels for them

the
def makeVelocityLabels(c):
ret = dict()
for levidx in xrange(len(c.levels)):
lev=c.levels[levidx]
ret[levidx]=str(lev).replace('(',"").replace(", ","-").replace(']',"")
return ret
velGrid = np.linspace(0,100,11)
dirGrid = np.linspace(-22.5,337.5,9)
dirLabels=['C','SV',"","SE","s","SW","S","Sz"]
velLabels = makeVelocityLabels(pd.cut(velGrid,velGrid)).values()

Curious eyes noticed function cut. She deals with the fact that of the continuous row makes discrete, effectively turns a variable into a type Category. I always enjoyed numpy.where, piling something like:
the
for i in xrange(len(velGrid)-1):
veloсity[np.where((velocity>velGrid[i]) & & (velocity<velGrid[i+1]))]=velGrid[i]

now, I'm doing this
the
pd.cut(velocity,velGrid)


Next, create a new DataFrame, add the categorical variables:
the
resultDf = pd.DataFrame(index=df.index,columns=df.columns)
#df[df.direction>337.5].direction - will not work.
#little crutch, what would be considered normal directions.
df.direction[df.direction>337.5]-=360
resultDf.velocity = pd.cut(df.velocity,velGrid,labels=velLabels,include_lowest=True)
resultDf.direction = pd.cut(df.direction,dirGrid,labels=dirLabels,include_lowest=True)


Then the fun begins.As in the header areas will ( in fact I cross the speed on direction), then that would count statistics ( minimum, maximum, etc), I have to consider them the same for each direction.
Understand how much we have measurements for each gradation direction.
the
totalDir = resultDf.groupby("direction").size()
direction
97
Z 503
With 1251
SV 592
2773 Sz
Th 736
SE 388
SW 773

I would recommend to use instead of size() function count () which will return all values that are not missing.
Next we need to calculate statistics, as well in areas of high, medium and build the table itself.

the

dfDGr = df.copy()
dfDGr.direction = resultDf.direction

meanVel = dfDGr.groupby("direction").mean()
maxVel = dfDGr.groupby("direction").max()
totalVel = resultDf.groupby("velocity").size()
pivot = 100 * pd.pivot_table(resultDf,rows='velocity',cols='direction',aggfunc=len, fill_value=0)/totalDir.sum()

direction IN Z s SV Sz Yu SE SW
velocity 
0-10 1.279348 6.916913 10.839308 8.224378 25.432307 2.066639 1.265289 10.066076
10-20 0.056235 0.126529 3.992689 0.070294 10.881485 3.303810 1.560523 0.632644
20-30 0.000000 0.014059 1.813581 0.014059 2.235344 3.388163 1.223113 0.112470
30-40 0.014059 0.000000 0.000000 0.674821 0.323352 1.237171 0.801350 0.014059
40-50 0.000000 0.000000 0.014059 0.140588 0.056235 0.224940 0.253058 0.042176
50-60 0.014059 0.000000 0.084353 0.014059 0.042176 0.112470 0.168705 0.000000
60-70 0.028118 0.000000 0.000000 0.000000 0.000000 0.014059 0.014059 0.126529
70-80 0.014059 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.056235



It remains only to attach statistics to the plate, round and put the columns in the desired order.
the

def getCumulate(s):
ns = s.copy()
for val in xrange(s.size):
ns[ns.index[val]]=np.sum(s[s.index[val]:s.index[s.size-1]])
return ns

pivot["Amount"] = 100 * np.round(totalVel/float(np.sum(totalVel)),3)
pivot['Security%']=getCumulate(pivot["Amount"])

totalDirFr = pd.DataFrame(100*totalDir/float(np.sum(totalDir))).T
totalDirFr.index= ["Sum"]
pivot = pivot.append(totalDirFr)
meanVelFr = meanVel.T
meanVelFr.index=["Average speed"]
pivot = pivot.append(meanVelFr)
maxVelFr = maxVel.T
maxVelFr.index=[Maximum speed]
pivot = pivot.append(maxVelFr)
#if you forget what a column it will not, if not add sushestvuyut, it will pop up everywhere with NA
pivot=pivot.reindex(columns=np.append(dirLabels,["Sum","Security%"]))
pivot = np.round(pivot,1).fillna(0)

With the SW IN the SE, se SW W NW Sum Security,%
0-10 10.8 8.2 1.3 1.3 2.1 10.1 6.9 25.4 66.1 100.0
10-20 4.0 0.1 0.1 1.6 3.3 0.6 0.1 10.9 20.6 33.9
20-30 1.8 0.0 0.0 1.2 3.4 0.1 0.0 2.2 8.8 13.3
30-40 0.7 0.0 0.0 0.8 1.2 0.0 0.0 0.3 3.1 4.5
40-50 0.1 0.0 0.0 0.3 0.2 0.0 0.0 0.1 0.7 1.4

60-70 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.2 0.3
70-80 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.1
The amount 17.6 8.3 1.4 5.5 10.3 10.9 7.1 39.0 0.0 0.0
Average speed 10.7 2.6 2.9 22.6 19.9 4.3 3.3 9.2 0.0 0.0
Maximum speed 76.2 57.8 50.2 78.3 61.1 48.8 42.9 62.5 0.0 0.0


Summary:

The result is exactly what you need. Less blood? Yes. Easier to modify? — Yes.
In fact the entire hemorrhoids is that variables are not categories if the entrance would be transported once they are, it would be all a lot easier. On my weak laptop the entire script fulfills for 0.2 sec for 7113 values.

Thank you for your attention.
Article based on information from habrahabr.ru

Comments

Popular posts from this blog

Powershell and Cyrillic in the console (updated)

Active/Passive PostgreSQL Cluster, using Pacemaker, Corosync

Automatic deployment ElasticBeanstalk using Bitbucket Pipelines