Monday 3 December 2012

It's just the way I Roll...

This is data from Quebec on the number of children born every day. As you can see it's quite messy. There's alot of data in that plot, over 5000 day's figures.

By applying a 30 day rolling average a clear seasonal pattern emerges.

A 365 day rolling average produces a much more clear long term trend than either the daily or 30 day rolling figures.
 
Data: Number of daily births in Quebec, Jan. 01, 1977 to Dec. 31, 1990 from here.


Saturday 1 December 2012

ChRistmas with extRa R - XML & xtable

I like the R blog is.R(). They're doing an Advent CalendaR this Christmas looking at an R package everyday up until Christmas Eve. I thought I'd play along. Instead of doing exactly the same as them and using the US presidential election results I'll do a variation on a theme and scrape the last UK general election results data from Wikipedia.

> require(XML)
> myURL <-"http://en.wikipedia.org/wiki/United_Kingdom_general_election,_2010"
> allTables <-readHTMLTable(myURL)
> str(allTables)
#List of 29..... I just want the 11th
> stateTable <- allTables[[11]]
> head(stateTable)
#used the data editor in RGui to tidy it up a little
> fix(stateTable)
#Need to remove the first column as I'm not bothered about the #colours and adjust some on the column names
> stateTable <- stateTable[,-1]
> names(stateTable)
> colnames(stateTable)[7] <- 'Net Change in Seats'
> colnames(stateTable)[10] <- 'Change in % of Votes'
> require(xtable)
> resultsTable<-xtable(stateTable)
> print(resultsTable, type="html"))


Political Party Candidates Number of Votes Elected Seats Gained Seats Lost Net Change in Seats % of Seats % of Votes Change in % of Votes
1 Conservative 631 10,703,654 306 100 3 +97 47.1 36.1 +3.7
2 Labour 631 8,606,517 258 3 94 -91 39.7 29.0 -6.2
3 Liberal Democrat 631 6,836,248 57 8 13 -5 8.8 23.0 +1.0
4 UKIP 572 919,471 0 0 0 0 0 3.1 +0.9
5 BNP 338 564,321 0 0 0 0 0 1.9 +1.2
6 SNP 59 491,386 6 0 0 0 0.9 1.7 +0.1
7 Green 310 265,243 1 1 0 +1 0.2 0.9 -0.2
8 Sinn Féin 17 171,942 5 0 0 0 0.8 0.6 -0.1
9 Democratic Unionist 16 168,216 8 0 1 -1 1.2 0.6 -0.3
10 Plaid Cymru 40 165,394 3 1 0 +1 0.5 0.6 -0.1
11 SDLP 18 110,970 3 0 0 0 0.5 0.4 -0.1
12 Conservatives and Unionists 17 102,361 0 0 1 -1 0 0.3 -0.1
13 English Democrats 107 64,826 0 0 0 0 0 0.2 0.2
14 Alliance 18 42,762 1 1 0 +1 0.2 0.1 0.0
15 Respect 11 33,251 0 0 1 -1 0 0.1 -0.1
16 Traditional Unionist Voice 10 26,300 0 0 0 0 0 0.1 N/A
17 Speaker 1 22,860 1 0 0 0 0.2 0.1 0.0
18 Independent - Rodney Connor 1 21,300 0 0 0 0 0 0.1 N/A
19 Independent - Sylvia Hermon 1 21,181 1 1 0 +1 0.2 0.1 N/A
20 Christian 71 18,623 0 0 0 0 0 0.1 +0.1
21 Green 20 16,827 0 0 0 0 0 0.1 0.0
22 Health Concern 1 16,150 0 0 1 -1 0 0.1 0.0
23 Trade Unionist &amp Socialist 42 12,275 0 0 0 0 0 0.0 N/A
24 Independent - Bob Spink 1 12,174 0 0 1 -1 0 0.0 N/A
25 National Front 17 10,784 0 0 0 0 0 0.0 0.0
26 Buckinghamshire Campaign for Democracy 1 10,331 0 0 0 0 0 0.0 N/A
27 Monster Raving Loony 27 7,510 0 0 0 0 0 0.0 0.0
28 Socialist Labour 23 7,219 0 0 0 0 0 0.0 -0.1
29 Liberal 5 6,781 0 0 0 0 0 0.0 -0.1
30 Blaenau Gwent People's Voice 1 6,458 0 0 1 -1 0 0.0 -0.1
31 Christian Peoples 17 6,276 0 0 0 0 0 0.0 0.0
32 Mebyon Kernow 6 5,379 0 0 0 0 0 0.0 0.0
33 Lincolnshire Independents 3 5,311 0 0 0 0 0 0.0 N/A
34 Mansfield Independent Forum 1 4,339 0 0 0 0 0 0.0 N/A
35 Green (NI) 4 3,542 0 0 0 0 0 0.0 0.0
36 Socialist Alternative 3 3,298 0 0 0 0 0 0.0 0.0
37 Trust 2 3,233 0 0 0 0 0 0.0 N/A
38 Scottish Socialist 10 3,157 0 0 0 0 0 0.0 -0.1
39 People Before Profit 1 2,936 0 0 0 0 0 0.0 N/A
40 Local Liberals People Before Politics 1 1,964 0 0 0 0 0 0.0 N/A
41 Independent - Esther Rantzen 1 1,872 0 0 0 0 0 0.0 N/A
42 Alliance for Green Socialism 6 1,581 0 0 0 0 0 0.0 0.0
43 Social Democrat 2 1,551 0 0 0 0 0 0.0 N/A
44 Pirate 9 1,340 0 0 0 0 0 0.0 N/A
45 Communist 6 947 0 0 0 0 0 0.0 0.0
46 Democratic Labour 1 842 0 0 0 0 0 0.0 0.0
47 Democratic Nationalist Party 2 753 0 0 0 0 0 0.0 N/A
48 Workers Revolutionary 7 738 0 0 0 0 0 0.0 0.0
49 Peace 3 737 0 0 0 0 0 0.0 0.0
50 New Millennium Bean Party 1 558 0 0 0 0 0 0.0 0.0
51 - 29,687,604 650 - - - Turnout 65.1 -

Wednesday 28 November 2012

A Tutorial for ggplot2

Here is a quick ggplot2 tutorial from Isomorphismes from which I've completed the plots below


These two lines of code produce the same plot but show the differences between qplot and ggplot.

> qplot(clarity, data=diamonds, fill=cut, geom="bar")
> ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()



Data displayed with a continuous scale (top) and discrete scale (bottom) 

> qplot(wt, mpg, data=mtcars, colour=cyl)
> qplot(wt, mpg, data=mtcars, colour=factor(cyl))

I think it works better with different colours for the factors but you can change shapes too. 

>qplot(wt, mpg, data=mtcars, shape=factor(cyl))



Dodge is probably better for comparing data but lets face it fill is prettier

> qplot(clarity, data=diamonds, geom="bar", fill=cut, position="dodge")
>qplot(clarity, data=diamonds, geom="bar", fill=cut, position="fill")


Not with this data but this is the great plot to use for comparing over a time series.

> qplot(clarity, data=diamonds, geom="freqpoly", group=cut, colour=cut, position="identity")





Changed this one to get better smoothers. More info on that here.

> qplot(wt, mpg, data=mtcars, colour=factor(cyl), geom=c("smooth", "point"), method=glm)





When dealing with lots of data points overplotting is a common problem as you can see from the first plot above. 


> t.df <- data.frame(x=rnorm(4000), y=rnorm(4000))
> p.norm <- ggplot(t.df, aes(x,y))
> p.norm + geom_point()


There are 3 easy ways to deal with it. Make the points more transparent. Reduce the size of the points. Make the points hollow.

> p.norm + geom_point(alpha=.15)

> p.norm + geom_point(shape=".")

> p.norm + geom_point(shape=1)

This is also helpful for saving plots


> jpeg('rplot.jpg')
> plot(x,y)
> dev.off()
#Don't forget to turn it back on again
> dev.new()



Wednesday 14 November 2012

Data Munging in R


Whatever the fancy analysis or visualization carried out after work in R practically always starts with some data munging. Load the data, get rid of some columns you don't need, rename some of the other columns, check the data structure and make sure the data is in the format you want. This is pretty standard it can also be pretty baffling when you start to learn R's notoriously steep learning curve.

So to help out I've posted some of my code with notes. This was for data in CSV format and local was what I named the dataframe as the data referred to local councils. Remember if you don't know quite how to use a function eg gsub then try ?gsub and that'll bring up the help file which often contains helpful examples. So why not find some data of your own and try out out this yourself.


local<-read.table(file.choose(), header=T, sep=",")
head(local)
#tidy the dataframe up a bit. Removing unnecessary columns ect.
names(local)
local <- subset( local, select = -c(Old.ONS.code, ONS.code, Party.code ))
#check they have been removed
names(local)
# Rename some other columns by column number
> names(local)[1]<-"Name"
> names(local)[13]<-"CutPerHead"
> names(local)[14]<-"Benefit"
> names(local)[15]<-"YouthBenefit"
> names(local)[16]<-"DeprivationRanking"
> names(local)[17]<-"PublicSector"
> names(local)[18]<-"ChildPoverty"
#check they have been amended
names(local)
#check the structure of the dataframe
> str(local)
#Notice that the £ sign has got CutPerHead defined as a factor which we don't want
#As all the numbers are minus we can simply remove the all non numerical characters
local$CutPerHead <- gsub("[^0-9]", "", local$CutPerHead)
#Lets check that went OK
head(local$CutPerHead)
#Oops have removed the decimal point. Lets put it back in.
local$CutPerHead<-as.numeric(local$CutPerHead)
local$CutPerHead<-(local$CutPerHead/100)

Wednesday 18 April 2012

The 11th London Mayoral Twitter Poll







Click on pictures to enlarge

Poll findings
1)Oh Ken what are Labour going to do with you? You were recovering nicely from your tax troubles and then what do you do? Make the point that you thought a mass murdering terrorist who has already been shot shouldn't have been shot. I'm guessing Ken missed the class at professional politician school on message discipline. It's cost him a bad day on twitter and he really doesn't have a lot time to be having too many of those. Think I'm going back to predicting the race lean's Boris

2) Boris didn't have a brilliant day. His negatives went up but if your main rival is less popular than you then Boris will be the one taking comfort from today's poll. If there are as many people who love Boris as want to do terrible things because their transport is late then he could be alot worse off.

3)No real change on Brian and Jenny's figures today. Suggest one or both of them have a stand up blazing row with Boris with swearing and passion and loads of press convienently there to capture the moment for posterity. At least people would be talking about them. Will it help them win? Probably not but what have they got to lose?

4)Benita wins the Mary Poppins award AGAIN for best positive rating and lowest negative. Good coverage on Twitter. I noticed her followers followers had gone up over 4000. Then I thought to get 5% in this race Benita will need about 200,000 votes. What a poxy 5% for 200,000 votes! Yes I know that's rather alot even if you're getting coverage in the national press and have lots of Twitter activity. Not impossible but that just illustrates the reality of this massive election.


Results


    Candidate Pos11 Neut11 Neg11 Tot11 Pospercent11 Negpercent11
1    BorisCON   118    305   247   670           18           37
2      KenLAB   154    403   421   978           16           43
3 BrianLIBDEM    53    103    52   208           25           25
4  JennyGREEN    84    150    95   329           26           29
5  SiobhanIND   170    198    70   438           39           16
6   CarlosBNP     2     15     8    25            8           32

Tuesday 17 April 2012

The 10th London Mayoral Twitter Poll




Click on pictures to enlarge

Poll findings
1) The sentiment ratings aren't being very helpful today as really all the candidates aren't that far apart. But look at the negative poll graph what aren't you seeing that has been a solid featurewhen I started polling? Yes that massive Ken Livingstone colume in the negative poll. Infact for the last few days he negative ratings have been quite respectable. Not a clear election winning advantage but Ken'll just be glad he's not in position where the sentiment analysis clearly puts Boris in the lead. I think the EMA stuff helped Ken today. No I can't believe he's top of the positive ratings either.

2) Boris is treading along nicely. He'll not be too happy that Livingstone's volume went up alot more than his today. I think the election is still a toss up. It's leaned more to Boris over time but Ken's better rating over the last few days may give the Boris campaign cause for concern that they haven't yet sealed the deal.

3) Jenny and Siobhan both down a little bit on volume. So that late surge to pick them off the 2% rating they got from Yougov has been delayed somewhat it seems. Jenny will be happy her negatives have gone down too.  Not a bad day for Brian at all his positives not far off doubling and volume up a touch as well.

4)UKIP and the BNP failed to make the cut. AGAIN.


Results

   Candidate Pos10 Neut10 Neg10 Tot10 Pospercent10 Negpercent10
1    BorisCON   195    349   176   720           27           24
2      KenLAB   468    448   304  1220           38           25
3 BrianLIBDEM    86    122    70   278           31           25
4  JennyGREEN    49     89    42   180           27           23
5  SiobhanIND    92    151    53   296           31           18

Twitter vs Yougov



I've never said that polling Twitter is going to be a total replacement for traditional polling far from it but I do think it could be a useful addition is certain circumstances. So I was interested to see how yesterdays Twitter poll compared with the Yougov poll on the London Mayoral election that came out yesterday. I tested the correlation between the two and it came back at 89% quite a bit higher than I was expecting.


> a
[1] 45 40  7  2  2  1  3
> b
[1] 875 544 235 203 348  52   0
> cor(a,b)
[1] 0.8948858