Tag Archives: R-blogging

CODATA values for physical constants in [R]

The R language (i) is widely used by physical, chemical, and biological researchers and (ii) comes with lots of interesting and easily loaded data sets. Unfortunately and surprisingly, a data set of commonly encountered physical constants is not one of them. If you’re writing R code to transform data based on physical equations, chances are these equations will involve physical constants such as Planck’s constant or the Newtonian gravitational constant.

Examples of data bundled in R include mtcars and other toy data that pop up in examples and tutorials everywhere. They can be loaded via commands like data(mtcars). It seems natural to have a data set for physical constants that could be loaded similarly, but as far as I know there isn’t one. And it isn’t just me that’s been wondering; the issue has come up on StackOverflow more than three years ago.

The best source for values of physical constants in NIST, whose Committee on Data for Science & Technology (CODATA) routinely aggregates data from experiments which measure a constant and adjusts all the aggregated values so the overall data set is self consistent. As of today it seems like the most recent update was from 2010. NIST data for the physical constants is browsable on their web page; a flat ASCII file is here.

Here’s some R code for transforming this ASCII table into an R data frame.

#parsing NIST CODATA data for physical constants 
require(stringr)	#simplifies regex usage syntax in R

#web page with data
link <- 'http://physics.nist.gov/cuu/Constants/Table/allascii.txt'

allstr <- readLines('link')
codata.str   <- allstr[11:345]  #data starts at line 11

#at least 3 spaces separate columns
codata.str2  <- str_replace_all(codata.str, '[ ]{3,100}', '\t') 

#four cols in source table
codata.mat 	 <- str_split_fixed(codata.str2, '\t', n=4) 

#eliminate spaces separating every three decimal digits 
codata.mat[,c(2,3)] <- str_replace_all(codata.mat[,c(2,3)], " ", "") 

codata <- data.frame(quantity 	 = codata.mat[,1],
		     value	 = as.numeric(codata.mat[,2]),
		     uncertainty = as.numeric(codata.mat[,3]),
		     units 	 = codata.mat[,4]

And that’s it. Now it’s easy to search — in R — the data to find the constants you want, and assign constants that you will use to variable names, without having to manually type them.

> codata[str_detect(codata$quantity, 'electron mass'),]
                                  quantity        value uncertainty units
2       alpha particle-electron mass ratio 7.294300e+03     2.9e-06      
63            deuteron-electron mass ratio 3.670483e+03     1.5e-06      
89                           electron mass 9.109383e-31     4.0e-38    kg
90         electron mass energy equivalent 8.187105e-14     3.6e-21     J
91  electron mass energy equivalent in MeV 5.109989e-01     1.1e-08   MeV
92                      electron mass in u 5.485799e-04     2.2e-13     u
130             helion-electron mass ratio 5.495885e+03     5.0e-06      
195               muon-electron mass ratio 2.067683e+02     5.2e-06      
223            neutron-electron mass ratio 1.838684e+03     1.1e-06      
264             proton-electron mass ratio 1.836153e+03     7.5e-07      
310                tau-electron mass ratio 3.477150e+03     3.1e-01      
320             triton-electron mass ratio 5.496922e+03     5.0e-06 

> me.u < codata[str_detect(codata$quantity, 'electron mass in u'),'value']
> me.u
[1] 0.0005485799

I like the dataframe format because it’s what I (and I think many other R users) am used to, but there are other useful solutions at StackOverflow.

I wonder how to get the core R development team to include this listing or a similar one in base R for easy loading via a command like data(codata)


properties of Google two-factor authentication codes

Google sends me text messages all the time. It’s because (i) I set cookies to clear every time I close my browser set session cookies and (ii) I use Google’s two-factor authentication (2FA) when I log into my email account. So every time I open a browser to load my Gmail, Google has to send me an SMS-encoded text message. Usually I get two or three messages a day, each telling me something like “Your Google verification code is 853438”.

Here’s the thing: after several months of staring at those messages and dutifully typing each new six-digit code to complete my login, I came to notice certain patterns in the numbers. Or at least I thought I did. The six-digit numbers seemed to contain repeated digits far more often than I’d expected. For example, the “853438” code above has a repeated “8” in the first and sixth digits, and a repeated “3” in the third and fifth digits. Coincidence? Maybe, I thought, but maybe Google generates 2FA codes with lots of repeated digits for mnemonic convenience. I sought out to quantify the frequency of these repeats using R, convinced I was going to cleverly reveal that Google’s algorithm for generating two-factor authentication (2FA) codes was not completely random.

For this exercise, I used 36 SMS messages (I used to have hundreds but accidentally forgot about this project when reformatting my phone a few weeks ago — oops). Step one was converting the SMS data to .csv for import into R. A very simple, free Android app creatively named “SMS to Text” did just that. I manually removed the non-Google related SMS messages and imported the remaining data to R.

dat <- read.table('sms_20141205_google2fa.csv', sep = ',', header = T)

dat <- dat[,c('Date', 'Message')]

Regular expressions make it easy to extract the 6-digit numbers whose properties interested me. And the stringr library makes everything involving strings in R better. One small trick was the use of the positive lookbehind regular expression in R. These operators are only supported when using perl-like regex parsing. So the regular expression needs to be wrapped in the perl() function.

dat$code <- str_extract(dat$Message,
perl('(?<=Google verification code is )[0-9]*')

The 2FA codes are now a character column in the dat data frame. I want to keep only unique codes (sometimes Google sends you multiple SMSs within a very short time window that contain the same 2FA code, and these repeats should not be counted and independent data points). It’s easy with duplicated().

dat <- dat[!duplicated(dat$code),]

My strategy for counting repeated digits is to use the colsplit() function from the reshape2 library to split each character to its own column. The function numRepeats() below acts on a dataframe and counts unique entries in each row by substracting the length of the unique() entries from each from from the row length. Overall this approach is not very good : splitting columns using '' results in an extra NA column at the beginning of each string that I am forced to manually remove with vector indexing (the [,2:7] in the code below), and it’s very slow for large datasets, as we’ll see below. There must be a better way (please set me straight in the comments…)

numRepeats <- function(x, N)
apply(x, 1, function(y){N - length(unique(y))})

dat$digits <- colsplit(dat$code, '',
names = c('NA', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6'))[,2:7]
dat$numRepeats <- factor(numRepeats(dat$digits, 6), levels = 0:(6-1))

I store the number of repeated digits as a factor to make the default plot labels look easier (see below) but storing as an integer vector would also be fine, probably better for many applications.

We now know the distribution of repeated digits in 2FA codes from Google but we need something to compare it to. For 6-digit decimals, there are only one million possible codes, so enumerating them all is feasible. This is the best possible comparison. Those are easily enumerated in R with the formatC() function.

all6mer <- formatC(0:999999, width = 6, format = 'd', flag = '0')
allDF <- colsplit(all6mer, pattern = '', names = c('NA', paste('d', 1:6, sep = '')))[2:7]
allDF$numRepeats <- factor(numRepeats(allDF, 6), levels = 0:(6-1))

This is where the slowness of the colsplit() approach for counting gets annoying — it took more than a few minutes on my laptop. But it did work. With counts for repeated digits from both 2FA codes and for comparison from all 6-digit decimal numbers, we can now plot the distribution of repeats in both datasets.

allDF$type <- 'all 6-digit numbers 000000 - 999999'
dat$type <- 'Google 2FA codes'
plotdf <- data.frame(numRepeats = as.factor(c(allDF$numRepeats, dat$numRepeats)-1),
type = c(allDF$type, dat$type)

quartz(height = 4, width = 6)
ggplot(data = plotdf, aes(x = numRepeats, y = ..count../sum(..count..),
group = type, fill = type, color = type, width = type)) +
geom_bar(data = plotdf[as.numeric(plotdf$type)==1,],
position = 'identity', alpha = 0.5, width = 0.8) +
geom_bar(data = plotdf[as.numeric(plotdf$type)==2,], position = 'identity',
alpha = 0.5, width = 0.5) +
theme_bw() +
ylab('Frequency') +
xlab('Number of repeated digits') +
scale_fill_manual(values = c('pink', 'light blue'), name = '') +
scale_color_manual(values = c('red', 'blue'), name = '') +
theme(legend.position = c(0.7, 0.85))
quartz.save('histogram_of_repeats_Google2FA_vs_random.pdf', type = 'pdf')

The main challenge in making this plot was (i) clearly overlaying the frequency distribution for each dataset to allow easy visual comparison while (ii) still being able to rely on the nice automatically generated legends created by ggplot(). Overlay instead of something with facet_grid() was something that made much more aesthetic sense to me, but opinions likely vary on that point.

Some approaches for the graph that didn’t work:

  • using y = ..density.. in the ggplot calls. I wanted to have different bar widths for each data set for better visualization. But ..density.. is affected by the width parameter in bar charts because it really isn’t designed for integer or categorical data like these. A brief discussion from 2010 explains a bit more.
  • using multiple layers to overlay entirely separate plots of each data frame separately. This works except it makes the legend very difficult to generate automatically. My solution above sets the aesthetic for the entire combined dataset plotdf which allows for easy legend generation while still allowing for different bar widths and frequency counts for each dataset.
  • a single geom_bar() call on the whole dataset, i.e. not broken down by the plotdf[as.numeric(plotdf$type)==x subsetting I use above. The problem here is that ..count.. operates on the entire data frame. I have only 33 unique 2FA codes but one million 6-digit numbers. For example, the frequency of 0 repeats in the 33 unique 2FA codes is 10/33, but with only a single geom_bar() call the frequency is computed as 10 / 1,000,033 and becomes invisibly small on the chart. This issue is frequently raised on stackoverflow, for example see here.

Anyway, the code above circumvents all those issues and generates the (to my mind) nice plot below.
The question is: are these distributions really different? This is easy to test with the dgof library, designed for testing goodness-of-fit between empirical data and theoretical distributions with discrete support.

ks.htest <- ks.test(x = as.numeric(as.character(dat$numRepeats)),
y = ecdf(as.numeric(as.character(allDF$numRepeats))))

cvm.htest <- cvm.test(x = as.numeric(as.character(dat$numRepeats)),
y = ecdf(as.numeric(as.character(allDF$numRepeats))),
type = 'W2')

That results in:

> ks.htest

One-sample Kolmogorov-Smirnov test

data: as.numeric(as.character(dat$numRepeats))
D = 0.1518, p-value = 0.4322
alternative hypothesis: two-sided

> cvm.htest

Cramer-von Mises - W2

data: as.numeric(as.character(dat$numRepeats))
W2 = 0.4324, p-value = 0.05355
alternative hypothesis: Two.sided

The results — both simple observation of the graph and of the Kolmogorov-Smirnov and Cramer-von Mises tests suggest that the distributions *are* different, but certainly not in the way I’d expected: Google 2FA codes have fewer repeats than random 6-digit decimal numbers! Just further proof that subjective impressions can be dangerous. The p-values aren’t terribly low, however, primarily because I don’t have a large sample of 2FA codes here. So I’m not declaring significance yet. Maybe I’ll have to revisit this issue when I collect a few hundred 2FA codes.