properties of Google two-factor authentication codes

Google sends me text messages all the time. It’s because (i) I set cookies to clear every time I close my browser set session cookies and (ii) I use Google’s two-factor authentication (2FA) when I log into my email account. So every time I open a browser to load my Gmail, Google has to send me an SMS-encoded text message. Usually I get two or three messages a day, each telling me something like “Your Google verification code is 853438”.

Here’s the thing: after several months of staring at those messages and dutifully typing each new six-digit code to complete my login, I came to notice certain patterns in the numbers. Or at least I thought I did. The six-digit numbers seemed to contain repeated digits far more often than I’d expected. For example, the “853438” code above has a repeated “8” in the first and sixth digits, and a repeated “3” in the third and fifth digits. Coincidence? Maybe, I thought, but maybe Google generates 2FA codes with lots of repeated digits for mnemonic convenience. I sought out to quantify the frequency of these repeats using R, convinced I was going to cleverly reveal that Google’s algorithm for generating two-factor authentication (2FA) codes was not completely random.

For this exercise, I used 36 SMS messages (I used to have hundreds but accidentally forgot about this project when reformatting my phone a few weeks ago — oops). Step one was converting the SMS data to .csv for import into R. A very simple, free Android app creatively named “SMS to Text” did just that. I manually removed the non-Google related SMS messages and imported the remaining data to R.

dat <- read.table('sms_20141205_google2fa.csv', sep = ',', header = T)

dat <- dat[,c('Date', 'Message')]

Regular expressions make it easy to extract the 6-digit numbers whose properties interested me. And the stringr library makes everything involving strings in R better. One small trick was the use of the positive lookbehind regular expression in R. These operators are only supported when using perl-like regex parsing. So the regular expression needs to be wrapped in the perl() function.

library(stringr)
dat$code <- str_extract(dat$Message,
perl('(?<=Google verification code is )[0-9]*')
)

The 2FA codes are now a character column in the dat data frame. I want to keep only unique codes (sometimes Google sends you multiple SMSs within a very short time window that contain the same 2FA code, and these repeats should not be counted and independent data points). It’s easy with duplicated().

dat <- dat[!duplicated(dat$code),]

My strategy for counting repeated digits is to use the colsplit() function from the reshape2 library to split each character to its own column. The function numRepeats() below acts on a dataframe and counts unique entries in each row by substracting the length of the unique() entries from each from from the row length. Overall this approach is not very good : splitting columns using '' results in an extra NA column at the beginning of each string that I am forced to manually remove with vector indexing (the [,2:7] in the code below), and it’s very slow for large datasets, as we’ll see below. There must be a better way (please set me straight in the comments…)

numRepeats <- function(x, N)
{
apply(x, 1, function(y){N - length(unique(y))})
}

dat$digits <- colsplit(dat$code, '',
names = c('NA', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6'))[,2:7]
dat$numRepeats <- factor(numRepeats(dat$digits, 6), levels = 0:(6-1))

I store the number of repeated digits as a factor to make the default plot labels look easier (see below) but storing as an integer vector would also be fine, probably better for many applications.

We now know the distribution of repeated digits in 2FA codes from Google but we need something to compare it to. For 6-digit decimals, there are only one million possible codes, so enumerating them all is feasible. This is the best possible comparison. Those are easily enumerated in R with the formatC() function.

all6mer <- formatC(0:999999, width = 6, format = 'd', flag = '0')
allDF <- colsplit(all6mer, pattern = '', names = c('NA', paste('d', 1:6, sep = '')))[2:7]
allDF$numRepeats <- factor(numRepeats(allDF, 6), levels = 0:(6-1))

This is where the slowness of the colsplit() approach for counting gets annoying — it took more than a few minutes on my laptop. But it did work. With counts for repeated digits from both 2FA codes and for comparison from all 6-digit decimal numbers, we can now plot the distribution of repeats in both datasets.

allDF$type <- 'all 6-digit numbers 000000 - 999999'
dat$type <- 'Google 2FA codes'
plotdf <- data.frame(numRepeats = as.factor(c(allDF$numRepeats, dat$numRepeats)-1),
type = c(allDF$type, dat$type)
)

quartz(height = 4, width = 6)
ggplot(data = plotdf, aes(x = numRepeats, y = ..count../sum(..count..),
group = type, fill = type, color = type, width = type)) +
geom_bar(data = plotdf[as.numeric(plotdf$type)==1,],
position = 'identity', alpha = 0.5, width = 0.8) +
geom_bar(data = plotdf[as.numeric(plotdf$type)==2,], position = 'identity',
alpha = 0.5, width = 0.5) +
theme_bw() +
ylab('Frequency') +
xlab('Number of repeated digits') +
scale_fill_manual(values = c('pink', 'light blue'), name = '') +
scale_color_manual(values = c('red', 'blue'), name = '') +
theme(legend.position = c(0.7, 0.85))
quartz.save('histogram_of_repeats_Google2FA_vs_random.pdf', type = 'pdf')

The main challenge in making this plot was (i) clearly overlaying the frequency distribution for each dataset to allow easy visual comparison while (ii) still being able to rely on the nice automatically generated legends created by ggplot(). Overlay instead of something with facet_grid() was something that made much more aesthetic sense to me, but opinions likely vary on that point.

Some approaches for the graph that didn’t work:

  • using y = ..density.. in the ggplot calls. I wanted to have different bar widths for each data set for better visualization. But ..density.. is affected by the width parameter in bar charts because it really isn’t designed for integer or categorical data like these. A brief discussion from 2010 explains a bit more.
  • using multiple layers to overlay entirely separate plots of each data frame separately. This works except it makes the legend very difficult to generate automatically. My solution above sets the aesthetic for the entire combined dataset plotdf which allows for easy legend generation while still allowing for different bar widths and frequency counts for each dataset.
  • a single geom_bar() call on the whole dataset, i.e. not broken down by the plotdf[as.numeric(plotdf$type)==x subsetting I use above. The problem here is that ..count.. operates on the entire data frame. I have only 33 unique 2FA codes but one million 6-digit numbers. For example, the frequency of 0 repeats in the 33 unique 2FA codes is 10/33, but with only a single geom_bar() call the frequency is computed as 10 / 1,000,033 and becomes invisibly small on the chart. This issue is frequently raised on stackoverflow, for example see here.

Anyway, the code above circumvents all those issues and generates the (to my mind) nice plot below.
histogram_of_repeats_Google2FA_vs_random
The question is: are these distributions really different? This is easy to test with the dgof library, designed for testing goodness-of-fit between empirical data and theoretical distributions with discrete support.

library(dgof)
ks.htest <- ks.test(x = as.numeric(as.character(dat$numRepeats)),
y = ecdf(as.numeric(as.character(allDF$numRepeats))))

cvm.htest <- cvm.test(x = as.numeric(as.character(dat$numRepeats)),
y = ecdf(as.numeric(as.character(allDF$numRepeats))),
type = 'W2')

That results in:

> ks.htest

One-sample Kolmogorov-Smirnov test

data: as.numeric(as.character(dat$numRepeats))
D = 0.1518, p-value = 0.4322
alternative hypothesis: two-sided

> cvm.htest

Cramer-von Mises - W2

data: as.numeric(as.character(dat$numRepeats))
W2 = 0.4324, p-value = 0.05355
alternative hypothesis: Two.sided

The results — both simple observation of the graph and of the Kolmogorov-Smirnov and Cramer-von Mises tests suggest that the distributions *are* different, but certainly not in the way I’d expected: Google 2FA codes have fewer repeats than random 6-digit decimal numbers! Just further proof that subjective impressions can be dangerous. The p-values aren’t terribly low, however, primarily because I don’t have a large sample of 2FA codes here. So I’m not declaring significance yet. Maybe I’ll have to revisit this issue when I collect a few hundred 2FA codes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s