June 9, 2016

EU Referendum from the Linguistic Perspective

The United Kingdom European Union membership referendum, a.k.a the EU referendum, is scheduled for 23 June 2016. I have the pleasure to be in the UK during all the news broadcasts, endless debates, and friends disputing about whether the UK is better off without the EU.

Obviously, as a European I do have an opinion about the referendum, but there are already more opinions about it than UK citizens. Hence, I decided to have a closer look at the debates, especially on two politicians who constantly appear on television stating the dangers of either outcome. On the one side, there is the current PM David Cameron, the leader of the Conservative Party. On the other side, there is the former mayor of London Boris Johnson, a member of the Conservative Party. Both are members of the same party, went to the same school, but their opinions cannot be any more different. I asked myself how does their political agenda change their language? Does their use of language rather reflect their background of the campaign they are supporting?

I searched the Internet for speeches they have given about the EU referendum. It was much easier to find transcripts of speeches of Cameron than from Johnson. I ended up adding edited interviews and essays, which may make some of the conclusions a bit shaky. The total corpus comprises of 28,960 words from Cameron and 12,300 words from Johnson. The size of the corpus is not really huge, but it might still shed some light on the debates and might help readers to deal with the facts.

Word Usage

Let’s start by having a look on the word usage of both politicians. The following two word clouds show the most frequent words they use: Obviously, both talk a lot of Europe and Britain. A closer look reveals that Boris Johnsons mentions quite often words like elite, control, and money. Cameron mentions more often words like trade, poverty, and war.


A so-called word cloud is a great tool for gathering an overview of the underlying text, but in the next few days, I want to dig deeper and analyze how and about what entities these politicians actually talk about.


A standard way of comparison is computing the readability of two texts. There are plenty of readability measures available and each of them has its strengths and weaknesses. Hence, I decided to compare the two politicians according to a range of five readability measures. Keep in mind that the readability highly depends on the type of text. Written text typically has a much higher complexity than the transcript of speeches. Since I mixed different text types, the results are not entirely reliable.

Readability of politicians

The radar graphic shows that the complexity of Johnson’s speeches is a bit higher than those of Cameron across all measures. The thinner line in the center shows the readability of Shakespeare’s “Romeo and Juliet” for comparison. In a previous version of the graphic, I added Thomas Moore’s “Utopia” but it was basically off the scale.

The conclusion regarding the readability is that both politicians are not too far apart. We still need to dig deeper and analyse about whom they are talking.

About whom are they talking

So far, the analysis was based on the number of sentences, words and syllables. It’s time to get a better understanding of what entities (people, location, organizations, …) both politicians talk about. Ambiverse provides an API to extract all entities from the from any text. The following table shows the top-20 most frequently used entities by James Cameron and Boris Johnson. Evidently, both mention the European Union (2nd most) and several countries. Many of mentioned the countries are european ones; some within the EU (Germany and France), others outside the EU (Norway and Switzerland). Interestingly, Johnson mentions David Cameron about one in every 128 entities, while it is the other way around only one in roughly 1,200 entities.

No, Cameron, Johnson
1, United Kingdom (22.82%), Germany (16.57%)
2, European Union (18.62%), European Union (14.23%)
3, Europe (8.89%), United Kingdom (11.89%)
4, Germany (8.39%), Europe (11.31%)
5, United States (3.36%), Brussels (3.51%)
6, England (2.10%), France (3.51%)
7, NATO (1.43%), United States (2.53%)
8, Syria (1.17%), European Economic Community (1.95%)
9, Norway (0.84%), Greece (1.75%)
10, Brussels (0.84%), London (1.56%)
11, Labour Party (0.67%), Robert Schuman (1.56%)
12, India (0.67%), Jean Monnet (1.17%)
13, China (0.67%), China (1.17%)
14, Iran (0.67%), India (1.17%)
15, Conservative Party (0.67%), European Court of Justice (1.17%)
16, France (0.50%), European Parliament (0.97%)
17, European Parliament (0.50%), Trade Descriptions Act 1968 (0.78%)
18, Parliament of the United Kingdom (0.50%), National Health Service (0.78%)
19, Switzerland (0.50%), NATO (0.78%)
20, European Council (0.50%), David Cameron (0.78%)

As most of the mentioned entities are actually countries, it might be worth having a closer. The following table shows the top-5 mentioned countries with their frequency among the other countries. Surprisingly, the country with the highest frequency in Johnson’s speeches is Germany and not the United Kingdom. To be fair, the frequency does not account for coreference. It may thus be that Boris Johnson uses other expressions instead of saying “United Kingdom”.

Country, Cameron, Johnson
United Kingdom, Rank 1 (35.10%), Rank 2 (20.40%)
Germany, Rank 3 (12.90%), Rank 1 (28.43%)
Europe, Rank 2 (13.68%), Rank 3 (19.40%)
United States, Rank 4 (5.16%), Rank 5 (4.35%)
France, Rank 10 (1.34%), Rank 4 (6.02%)

An inspection about the location of the mentioned countries in the world shows that most of them are indeed in Europe. The following two graphics highlight the mentioned countries. The stronger the colour is, the more often they were mentioned:

The world map of Cameron looks as follows:

Johnson’s world map does not look much different:

As discussed before, Germany takes a darker colour, but overall it seems that they both share a similar opinion about which countries to mention in their speeches.

Bot alike in dignity?

Although we extended the statistical analysis by further analysing the mentioned entities, there is not much of a difference between Cameron’s and Johnson’s use of language. It might be that a bigger corpus will reveal more insights about their specific style. So far, it seems that both politicians have quite a lot in common, not only their education.

I wonder whether there will be more differences in speeches from opposing politicians in other countries. Aren’t there elections coming up next in the United States?


The software used for computing the presented statistics will soon be published as an open-source project on GitHub. It is based on DKPro Core, Kumo from Kenny Cason, and the Natural Language Understanding API from Ambiverse.

Leave a Reply