The presidential debates were supposed to give the audience an understanding of the most controversial issues and how both candidates plan to deal with them if they become president. When Bill Clinton ran for presidency, he coined the phrase “It’s the economy, stupid.” This was back in 1992. 24 years later, the format of presidential debates hasn’t changed much, however, the tone has changed dramatically. One might proudly say they have become politically incorrect, other find the level of aggression disturbing. In this post, I’ll analyze the transcripts of the three Presidential Debates and the Vice Presidential Debate and show that it’s not longer about the controversial issues, it’s mainly about the candidates in the race.
The first debate between Donald Trump and Hillary Clinton covered among others the issues jobs, transparency, and race in America. The following analysis covers all debates. Please contact me for a detailed analysis of each debate. To better assess how much they have said, I divided the debate in speech events, the time both candidates spoke without being interrupted. A dialog consists of several ones following each other. The following table shows length statistics for both candidates. The length is measured in the number of characters.
The time both candidates spoke was more or less equal, so it seems that Trump talks a bit faster than Clinton, although other factors might play a role as well. The mean lengths are almost identical, but Trump had 37% more of the so-called speech events.
The following table gives insights about the lexical diversity of the candidates’ speech events and the average length of words and sentences to compare both candidates’ speeches on a more linguistic level.
|Number of words||26,616||21,801||9,186||9,234|
|Average word length||3.66||3.90||4.00||3.98|
|Average sentence length||13.5||18.8||19.59||16.58|
Donald Trump has obviously said more words than Hillary Clinton, as he had more speech events. However, the length of the words and the sentences were slightly lower than for Clinton. On average, he uses four words less for every sentence, which is a good indicator for less complex language. Lexical diversity is another measure for calculating how many different words one uses. A lexical diversity of 1 means that every word is used only once. A higher lexical diversity means that words are reused very often. There is a significant difference between the lexical diversity in the transcript of Trump’s and Clinton’s speeches. The lexical diversity of both candidates shows that Hillary Clinton has a higher coverage of words and repeats fewer words. In the next section, readability measures are used to further analyze how complex both candidates talk.
A standard way of comparison is computing the readability of two texts. There are plenty of readability measures available and each of them has its strengths and weaknesses. Hence, I decided to compare the two candidates according to a range of five readability measures. Keep in mind that the readability highly depends on the type of text. Transcripts of speeches usually have a lower value for readability than written text. The higher the value, the more complicated the text is. Some correlate the value of the readability measure to the grade level. This comparison is not entirely correct but makes the value much easier to interpret.
The radar graphic shows that the complexity of Clinton’s speeches is a bit higher than those of Trump across all measures. The thinner line in the center shows the readability of Shakespeare’s “Romeo and Juliet” for comparison. In a previous version of the graphic, I added Thomas Moore’s “Utopia” but it was basically off the scale. This finding supports the analysis in the previous section that Trump’s speech is indeed much simpler than Clinton’s speech.
It’s all just words
Let’s have a closer look on what both candidates said. To be exact: Which words do they use? It’s common knowledge that everybody has specific words they use very often and the distribution of word frequencies is unique and rather stable for every person. In the following analysis, I show which words are used very frequently. It’s not the first time this has to be done for the debate. NBC News has published another article about it. The numbers presented here might differ, because I used a different list of stopwords (which were not considered) and DKPro Core for tokenizing the text. The following table shows the ranked list of most frequent words for both candidates:
|1||I (622)||I (538)||I (181)||I (134)|
|2||we (163)||people (112)||Clinton (61)||Trump (83)|
|3||people (136)||think (110)||Hillary (51)||Donald (72)|
|4||but (58)||Donald (91)||Trump (50)||Hillary (45)|
|5||country (117)||we (82)||Donald (47)||Governor (38)|
|6||look (102)||going (75)||going (43)||but(36)|
|7||know (95)||want (77)||Senator (45)||Clinton (35)|
|8||you (93)||know (73)||we (40)||Pence (33)|
|9||think (86)||but (71)||American (40)||he (31)|
|10||say (82)||well (71)||people (38)||want (26)|
The table shows the word frequencies after removing all stopwords. The absolute values are not entirely comparable as Trump spoke roughly 20% more words (10,690 to 8,936). There are some words both candidates like to use frequently, e.g. “people”. This is kind of normal for political debates since candidates want to address the nation. Trump also uses the word “people” very often (136 times), while Clinton uses it 112 times. Both candidates use “I” most often, however Donald Trump uses it much more often than Hillary Clinton. This is in line with the image of the successful businessman he wants to represent. Interestingly, Clinton uses “Donald” as the sixth most frequent non-stopword. When I first watched the debate, I was surprised that Clinton called her opponent “Donald”, while Trump called her “Secretary Clinton”. Thus, I counted frequencies of specific words, shown in the following table:
Obviously, both candidates don’t use their own names very often. Hillary Clinton avoids using Trump’s last name. I’d happy for an explanation for this. I can think of several reasons: (i) avoid using the strong brand name “Trump”, (ii) being more friendly and colloquial, or (iii) establishing a kind of hierarchy. What do you think?
There is another imbalance about the usage of the word “wrong”. It has been noticed that Trump interrupted Clinton several times, very often showing his disagreement. This explains the frequent usage of the word “wrong”. I could not resist analyzing two more words. The word “tremendous”, which I have rarely heard before the rise of Trump and the re-occurring notion that something is “great”. Trump seems to be rather fond of both words as he used “great” 70 times and “tremendous” 32 times. Clinton, on the other hand, never used “tremendous” and only 27 times the word “great”.
Next level of analytics: entities and categories
Let’s take the analysis to the next level and analysis about what both candidates have been talking. Usually, when analyzing entities, named entity recognition is required. I use the Natural Language Understanding API by Ambiverse to not only detect entities, but also disambiguate them. Both the phrases “Secretary Clinton” and “Hillary” will be identified as the entity “Hillary Clinton” and thus avoids counting synonyms. Ambiverse also offers an API client for Java which makes it extremely easy to integrate it into a language processing pipeline.
The word cloud is generated by extracting entities with the Natural Language Understanding API and visualizing them with Kumo. The blue-colored words are those entities mainly used by Clinton and the red-colored words are those mainly used by Trump.
Counting entities 2.0
One of the major issues when counting words is that two different words mean essentially the same entity. Two candidates talking about Barack Obama could either say President Obama, the current President, or just Barak. With Ambiverse technology, it is possible to disambiguate the entire text and come to more accurate numbers about which entities are mentioned. The following table lists the top-10 entities mentioned by the candidates.
|1||Hillary Clinton (57)||United States (92)||United States (90)||Donald Trump (82)|
|2||United States (52)||Donald Trump (91)||Hillary Clinton (48)||Hillary Clinton (48)|
|3||Barack Obama (39)||ISIS (19)||Donald Trump (47)||United States (32)|
|4||Russia (34)||Syria (19)||Russia (20)||Mike Pence (28)|
|5||Iran (27)||Iraq (17)||Tim Kaine (17)||Russia (20)|
|6||Mosul (26)||Supreme Court (16)||Iran (16)||Vladimir Putin (19)|
|7||ISIS (25)||Russia (15)||Syria (14)||Elaine Chao (10)|
|8||Iraq (20)||Islam (14)||Barack Obama (12)||Virginia (8)|
|9||Bill Clinton (19)||Iran (14)||Indiana (10)||China (7)|
|10||China (15)||Barack Obama (13)||Clinton Foundation (8)||Roe v. Wade (7)|
For all four candidates, it’s evident that United States is among the top-3 most frequently used entities. Obviously, they are all very patriotic, but might not be as positive about their opponents which are among the two most frequent entities. For both, Trump and Clinton, on of the major topics in this election is the conflict with ISIS in Iraq and Syria. Russia and China have seemed to play a major role in this election, so let’s have a closer at the geographics.
The US presidential race is not only about America. It heavily affects countries all over the world. There have been rather strong reactions in the stock market whenever results from a poll were published. Obviously, it makes a difference for Mexico whether a wall will be build, but it also has a strong effect on countries in Europe and Asia since they are currently discussing trade agreements. In the following graphic, the countries of the world are colored based on the frequency they were mentioned by either one of the four candidates. (For a visualization of each candidate’s mentioned countries check here.)
As expected, the United States are mentioned most frequently, followed by Russia, Iran, Syria, and Iraq. China and Mexico are still used quite often but in the end don’t seem to play such a critical role. Africa and Europe seem to play a rather unimportant role in this election and rather surprisingly, there is no country in Latin America which has been mentioned once.
Sources on GitHub
All the code used for this analysis is provided open-source (ASL) on GitHub. Feel free to fork it, share it, or use it for other debates.
The software used for computing the presented statistics will soon be published as an open-source project on GitHub. It is based on DKPro Core, Kumo by Kenny Cason, and the Natural Language Understanding API by Ambiverse.
Featured image CC BY 2.0 by DonkeyHotey.