Net Promoter Score and Survey Analysis in SQL

Want to make data analysis fast for everyone?

Join Us!

Introduced by Fred Reichheld in 2003, Net Promoter Score (NPS) is a simple method for measuring the likelihood your customers will recommend your product or service. It requires that you survey respondents and ask one simple question:

“How likely are you to recommend [Company/Product/Service] to a friend or colleague?”

Typically respondents are presented with a scale from 0 to 10 to answer this question. Here’s how the NPS question might appear on one of our own customer surveys:

0-10 scale

After collecting results, calculating NPS is easy. A response of 0 through 6 categorizes respondents as ‘Detractors’, a 7 or 8 as ‘Passives’, and a 9 or 10 as ‘Promoters’. Your Net Promoter Score is a measure of the ratio of Detractors to Promoters. This is calculated by subtracting the number of Detractors from the number of Promoters, dividing the result by the total number of survey responses, and then multiplying by 100 to turn the result into an integer. NPS scores have a range of -100 to 100.

That’s a little wordy, here’s how we like to think about it:

(promoters - detractors) / count(*) * 100 as nps

This gives us a clear quantitative measure of the likelihood to recommend, but what about the “why?” Most NPS surveys also collect qualitative data to pair with the quantitative measure. Something along the lines of a question like this:

The majority of the Fortune 500 measure NPS for a very good reason: moving people from Detractors and Passives to Promoters improves business outcomes. It provides a simple mechanism for companies to bring the voice of the customer into product and business decisions, leading to reduced churn and increased customer satisfaction.

In this post we’re going to talk about how you can analyze numeric and text NPS results using SQL. The latter can be especially difficult - text analytics software is expensive, often difficult to implement in existing ETL processes, and can be challenging to understand. We’ll provide clear approaches to successfully tackling both.

NPS at the Hotel California

We collected 2,700 hotel reviews for the Hotel California from TripAdvisor for our sample dataset. Each review rates the hotel between 1 and 5. We map these reviews into NPS terminology using 5 as a Promoter, 4 as a Passive, and 1-3 as a Detractor. A bit different from above, but the same principles apply.

The table we will work from has 4 columns: Name, Date, Verbatim, and Review. It is called hotel_reviews and one row looks like:

Simple Charts

The first thing you need to do is partition your data by the review score to take a look. All of our further analysis will be based around our segments, but we’re analysts and we like raw data.

select
  review
  , count(*)
from
  hotel_reviews
group by 1

And voila:

We can see that our hotel has pretty good scores! Our NPS will certainly be high, but to obtain the final score we’ll need to bucket our data into the NPS segments - Promoters, Passives, and Detractors. We can see our segments with a case statement:

select
  case 
    when review = 5 then 'Promoter'
    when review = 4 then 'Passive'
    when review in (1,2,3) then 'Detractor'
    else 'Error'
  end as np_segment
  , count(*)
from
  hotel_reviews
group by 1

In order to calculate our NPS we will need to do some math. This is a simple operation that can be done in SQL.

First bucket the surveys and assign a value of 1 to Promoters, 0 to Passives and -1 to Detractors. This can appear to be an odd skew, but treating Detractors and Promoters as equal but opposite is implicit to the metric. We can now calculate NPS on the fly like so:

with nps_segments as (
 select 
   case
    when review = 5 then 1
    when review = 4 then 0
    when review in (1,2,3) then -1
    else null
   end as np_score
from
  hotel_reviews)
-- Here we will use a float conversion to avoid integer division.
-- This is Redshift specific. Adjust accordingly.
select
  round(sum(np_score) / count(*)::float * 100, 0) as nps
from
  nps_segments

We can now see our hotel’s NPS is a respectable 48. NPS should be compared carefully across verticals, but a quick look at the NPS of some common companies can provide some frame of reference. Apple has an NPS of 72 and Comcast an NPS of -3. Those are reasonable boundaries for our expectations.

If we add a date column to our query we can examine our scores over time. We have also added a count column so we can see how many survey responses we have been collecting. As you can see, NPS appears to be trending slightly upwards, although 2005 and 2006 were rough years!

with nps_segments as (
  select 
    date
    , case
        when review = 5 then 1
        when review = 4 then 0
        when review in (1,2,3) then -1
        else 0
      end as np_score
from
  hotel_reviews)

select
  year(date) as dte
  , count(*) as surveys
  , round(sum(np_score) / count(*)::float * 100, 2) as nps
from
  nps_segments
group by 1

A similar query can show you how your relative proportions of Promoters and Detractors are changing over time. NPS changes as people change segments. You can improve NPS by moving your customers from Detractors to Passives, or Passives to Promoters. Both have value and understanding the mix-shift can help craft a complete NPS strategy.

We know that our score is reasonably good, and that it seems to be increasing. Everyone has patted themselves on the back, but now it’s time to focus on next steps. Luckily Hotel California has asked customers why they provided the scores they did.

Text Analysis

Now that we have a grasp on our NPS numbers, it’s time to examine the wealth of text data we have collected. This is much more challenging. It requires more data preparation and more human interpretation. Survey response volume also matters - the more survey responses the better.

First, it’s important to clean your text responses. For our purposes this means:

  1. Making the text lower case. This will reduce work and computation later on by allowing simpler joins across tables.

  2. Removing grammar and numbers. These characters provide context for humans but for our analysis will cause errors. For instance, we want ‘food’ and ‘food!’ to be treated in the same fashion. This will allow us to provide accurate counts of word frequency in our next step.

  3. Correcting spacing, so that multiple consecutive spaces become a single space. We will use spaces to count words below. Wait for it……it’s clever!

We used Redshift’s regex functionality to achieve each of these goals. The inner statement makes the verbatim lower case and removes everything but letter characters. The second takes consecutive spaces and reduces them to a single space. If you would like a refresher on regex syntax, we have a blog post to help you. The code used for this exercise was:

regexp_replace(
regexp_replace(lower(verbatim), '[^a-zA-Z ]', ' ') ,
'( ){1,}', ' ')

The iterative process of refining your selection is made much easier using SQL Snippets with Periscope Data by maintaining a standardized set of code which can be improved over time.

I put this into a SQL snippet called [verbatims_clean], so I could reuse it quickly and easily. Here is a row, pre and post-cleanup. The first is easier to read for us, but the second is easier to work with in SQL.

The first thing you can do is add the ability to filter based on words. Filters can be applied to the clean verbatim, and the original verbatim can be displayed. In the incredibly flexible Periscope Data Filters, this looks like:

-- This is a Snippet inside a Filter.
-- Look how nice that snippet works.
-- You can even add parameters. 
Select
  date
  , verbatim
  , review
from hotel_reviews
where [[verbatim_clean]=Words]

This will be done differently in different visualization systems, but the core principle remains the same. By filtering based on a cleaned verbatim set, you can return comprehensive results to the user.

Once you have these cleaned verbatims, you can break them down into an even more interesting table, with one row per word per verbatim. We can now perform operations based on individual words rather than on sentences. This can help create powerful visualizations and can even be used for semantic analysis and categorization, topics we plan to cover in a future post.

The table should then look like:

The SQL to perform this operation looks like so, and will be further explained below:

– Here we will use the length function to count the number of words in each text response. After our regex clean up step, each text responses will have words equal to the number of spaces, plus one. For example, ‘live long and prosper’ has 3 spaces and 4 words.

with word_count as (
  select
    name
  , date
  , verbatim
  , review
  , [verbatim_clean]  as verbatim_clean
  , length([verbatim_clean]) -
      length(regexp_replace([verbatim_clean], '( ){1,}', '')) + 1 
      as num_words
  from
    hotel_reviews
)

-- By using our num_words column in a range join can can split_part
-- each text responses into its individual words
, word_parse as (
  select
    name
    , date
    , review
    , split_part(verbatim_clean, ' ', nmbr::integer) as word
  from
    word_count
     join numbers on
       nmbr <= num_words and nmbr <> 0
  order by 1,2
)

-- The stop words table is important! See point 3 below.
select * from word_parse 
where word not in (select * from stopwords)

You’ll note a few interesting things about this query.

  1. In word_parse we are looking to count the number of words in each text response. We subtract the length of the cleaned verbatim with spaces removed from the length of the cleaned verbatim with spaces, plus one. This gives us the number of words in the sentence. To do this, it is important you use regex to strip all whitespace down to a single character, as double spaces would lead to inaccurate word counts.

  2. We join to a table called ‘numbers’. This is a sequence of all numbers from 1 to 1 million, used in interesting cases like this. Here, it allows us to split the verbatims based on the number of words per sentence. Make sure this table is properly indexed, or sub-select a much smaller set of numbers! Read more about sequence tables here.

  3. In the final part of the query I remove every word defined as a ‘stop word’. A stop word is any word that is frequent but provides little value to semantic analysis. They are essentially grammar. A list can be found here: http://www.lextek.com/manuals/onix/stopwords1.html
    This list should be kept up-to-date with words related to your own business. For instance, your company name is a stop word which will be in every verbatim but does not provide any information.

Now that you have all your text broken into individual words, you can create more interesting visualizations. For example, the following query will give you the word ranked by frequency of appearance among people who give you different scores:

with word_count as (
  select
    review
    , word
    , count(*) as cnt
  from
    hotel_words
  group by
    1 , 2)
,  ranked_words as (
    select
      *
      , rank() over(partition by review order by cnt desc)
      as word_rank
    from
      word_count
  )

-- I define pain and delight points below to limit the number of
-- lines returned. I am only interested in the top ten things
-- among the 1’s and the 5’s to begin. I will dig in more later. 
select
  *
from
  ranked_words
  left join(
    select
      word
      , min(word_rank) as min_rank
      , max(
        case when review in (1,5) and word_rank <=10
          then 1
          else 0
        end) as pain_and_delight_points
    from
      ranked_words
    group by 1
  ) important_words on ranked_words.word = important_words.word
  where pain_and_delight_points = 1

Keeping in mind that a low rank means a high frequency of occurrence, this chart can quickly show great insights and tell you where you need to focus attention. In the following chart, we have the frequency rank along the Y axis and the review score on the X axis. Here we can see that ‘food’ is highly ranked among low scores, whereas ‘staff’ is the opposite. The staff are doing great, but it’s time to begin working on the kitchen!

Wrap Up

In a few minutes we calculated Hotel California’s Net Promoter Score, examined their history, and began to explore some of the deeper reasons why they aren’t beating the Ritz. This is valuable insight, and will help drive the goals of potential Circle-back Projects, Rewards Programs, Tactical Churn Groups, or Voice of Consumer Product Development Task Forces, some of which are certain to follow.

Bring this together and you will have an awesome, filterable dashboard to explore your survey data. Have fun, and Happy Periscoping!

Thank you