SQL Text Analysis with Donald Trump’s Tweets

Want to make data analysis fast for everyone?

Join Us!

There were many unique aspects of the 2016 U.S. presidential election, with one of the most interesting being Donald Trump’s use of Twitter. As luck would have it, tweets happen to be very easy to extract and put into relational database format. Being the data junkies that we are, we couldn’t resist downloading President Trump’s entire tweet history and loading it into Periscope Data. It’s an excellent opportunity to slice, analyze, and quantify presidential communications while also exploring methods for analyzing text in SQL at the same time.

So let’s get started.

The Data Set

First, we formatted Trump’s tweets into 5 columns of data:

  • Tweet Text
  • Tweet Date
  • Favorites
  • Likes
  • Tweet ID

With these columns it was easy to do some quick exploration. @realDonaldTrump tweeted for the first time on May 4, 2009 with a tweet he almost certainly didn’t write.

select
   tweet_text as first_tweet
from
  trump_tweets
order by tweet_date 
    limit 1

Around July of 2011 Trump’s tweet volume quickly escalated. It’s not clear what triggered this increase in activity, but based on his first tweet a reasonable guess is that Trump began tweeting himself, either in addition to or lieu of staff. In total, Donald Trump has tweeted over 34,000 times. At his peak he was tweeting over 30 times per day. A remarkable number by any measure.

A look at Trump’s tweet frequency over time looks like this:

select
 month(tweet_date) as mnth
  , count(*)
from
  trump_tweets
group by 1

Trumps Monthly Tweet Frequency

Keyword Frequency: A History of MAGA

Trump first used the phrase ‘Make America Great Again’ long before his announcement to run for President in the 2016 election as a Republican candidate. He filed papers in Iowa in 2012 for the “Make America Great Again” Party to begin a run as an independent candidate. His first use of the phrase on Twitter:

His use of the phrase has obviously increased since. If we sum the number of tweets containing the words ‘Make America Great Again’ or ‘MAGA’, and divide that number by the count of all tweets, we can easily define Trump’s MAGA rate. With a simple query we can look at the MAGA rate over time. Trump’s MAGA rate has a slightly bimodal distribution centered first around the announcement of his candidacy in 2015, then more strongly after his victory in in the presidential election. Trump held an astounding peak monthly MAGA rate of 14% in September 2016.

-- Note the use of ilike and like. Ilike compares all strings
-- as lowercase, while like is case-sensitive. I don’t want to
-- count magazines.
-- Here we will use a float conversion to avoid integer division.
-- This is Redshift specific. Adjust accordingly.
select
  month(tweet_date) mnth
  , sum(
     case when tweet_text ilike '%make america great again%' or
     tweet_text like '%MAGA%' then 1
          else 0
     end
  )
  / count(*)::float as maga_rate
from
  trump_tweets
group by
  1

Trump's Monthly MAGA Rate

Sentiment: The Anger and Joy of Donald J. Trump

Given Trump’s prolific use of Twitter leading up to and throughout the campaign, we can apply a quantitative approach to measuring the tone and sentiment of his campaign with text analysis.

SQL is not the first tool people think of when doing text analytics, but it has some interesting advantages. It’s fast, well integrated with existing ETL processes, and easy to build into regular reports. We will use a simple but effective approach, by relating individual words to sentiment scores to get a quick read on the emotional state of DJT.

This technique we’re using was explored in the paper “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.” Using Twitter as a foundation, Finn Årup Nielsen from the Technical University of Denmark (DTU) scored 2,477 words from -5 to 5 based on their perceived sentiment, with -5 being a word linked to a strongly negative emotional content. By joining the words in Trump’s tweets to this list, we can quickly get a per-tweet sentiment score. The data set for this topic is published by the department of Informatics and Mathematical Modelling at DTU.

The data is in the form:

Scored Sentiment Words

In order to begin this analysis, the tweets must be broken down into individuals words to join to our sentiment table. The method to do this is outlined in our blog post about NPS and survey analytics.

Once we have a resulting table of individual words, we can begin joining our tweets to our sentiment scores.

Scored words per Tweet

With a few simple queries we can find Trump’s most celebratory and antagonistic tweets. First we sum the sentiment scores by tweet.

His most positive:

with
  sentiment as (
    select
     tweet_text
      , tweet_date
      , sum(sentiment_score) 
      as sentiment_sum
      , count(*)::float as words
    from trump_words
    group by 1,2
  )

  select
    tweet_text
    , tweet_date
    , sentiment_sum
  from
    sentiment
  where
    sentiment_sum is not null
  order by sentiment_sum desc
  Limit 10

Trump's Most Positive Tweets

And his most negative:

Trump's Most Negative Tweets

These are Trump’s most emotionally charged tweets, but since we have summed the sentiment scores we are biased towards longer phrases. Simple declarations such as “Great!” or “Sad!” convey a lot of emotional content without summing to large numbers. As we dig a little deeper we will divide the summed scores by the number of words in the Tweet to gauge the net sentiment.

Let’s look at Trump’s sentiment over time. This has been normalized to his average sentiment over the data history so that it is centered on 0. This is done because the sentiment data has little absolute meaning without a reference point.

with
  monthly_sentiment as (
    select
      month(tweet_date) as mnth
      , tweet_text
      , sum(sentiment_score) / max(num_words)::float
      as sentiment_score
    from
      trump_words
    group by 1, 2
  )

, avg_sentiment as (
    select
       avg(sentiment_score) as sentiment_avg
    from
      trump_words
  )

select
  mnth
  , avg(sentiment_score) - sentiment_avg as sentiment 
from
  monthly_sentiment
-- I want the average sentiment available for every row. 
-- You could use a window function as well. 
left join 
  avg_sentiment on true

Trump's Monthly Sentiment

As can be seen, Trump had some serious highs and lows before 2013. The highs are often associated with the Miss USA and Miss Universe pageants, or were tweets written by people (we suspect) other than Trump. The very low scores in 2011 - 2012 are mostly political in nature. Let’s take a closer look at some of these:

Tweets from Trump's Angry Years.

Most worrying though, Trump has only become more negative over the past 4 years, especially from 2015 and onwards. Zooming in on the period of the election we can see that Trump generally trended downwards, exhibiting increasing negativity over the course of his Presidential campaign. There were some high points, such as the week of his victory in November, but the overall sentiment of Trump’s tweets continues to decline. We can easily summarize Trump’s tweets with a single word: Sad!

Trump's Weekly Sentiment over 2016 Presidential Campaign

Sentiment by Time of Day: Angry, Early Morning Tweets

Another interesting perspective is Trump’s sentiment as a function of the hour of day. Much has been made of his proclivity for early morning tweeting and what the ramifications of this might be. With a bit of SQL, we can look at this quickly and easily. We use the same query as above, continuing to limit our timeframe to the presidential campaign, but we use a 3 hour rolling average and use an extract, like so:

with
  monthly_sentiment as (
    select
      extract(hour from (tweet_date)::timestamp) as hr
      , tweet_text
      , sum(sentiment_score) / max(num_words)::float
      as sentiment_score
    from
      Trump_words
    where date >= '2015-06-16'
    group by 1, 2
  )

, avg_sentiment as (
    select
       avg(sentiment_score) as sentiment_avg
    from
      trump_words
  )

select
  hr
  , count(*) as tweets
  , avg(avg(sentiment_score)) over(order by hr rows between 1 
  preceding and 1 following) - sentiment_avg as sentiment
from
  monthly_sentiment
left join 
  avg_sentiment on true
group by 
  1, sentiment_avg

Trump's Sentiment by Time of Day

He’s at his most negative in the morning. On average, between the hours of 4 am and 7 am, Trump unleashes a high volume of negative tweets. Our data does reveal a counterbalance to this finding. Trump’s tweets become more positive throughout the day and this positive sentiment peaks at night, along with his overall tweet volume.

Wrap Up

Many companies have an abundance of text data, but don’t have a strategy for fully making use of it. After a few simple joins in SQL, keyword frequency and sentiment analysis are two excellent methods for gaining context and insight into text data. Both are broadly applicable beyond tweets. Survey data, comments, competitive information, or web copy can all be explored with the methods discussed above.

We hope our advice conducting text analysis in SQL was helpful. If you’d like a CSV file containing the tweet or sentiment dataset we used for this post, please email us at hello@periscopedata.com and we’ll send it your way.

Happy Periscoping!

Thank you