Monthly Archives: December 2014

Three new things about enzymes

Digital StillCamera
       From nerovivo on flickr, CC Share-alike license.

Enzymes are nature’s best catalysts. The how and why of it are the central questions of enzymology. Over the decades, enzymologists have come a long way in understanding and even designing enzyme function. But three new results are highlighting previously unappreciated complexities of enzyme function, complexities that are often neglected by enzyme designers.
Continue reading

Assorted links

  1. Fish at more than 8000 meters deep.
  2. The difficulties of assessing “normal” and abnormal snowpack extent in the Western US. (Nice bonus: disagreement among climate scientists is highlighted without trotting out the D-word.)
  3. Cool paper on using hyperpolarization to get quantitative NMRs at analyte concentrations of ~1μM.
  4. Headlines predicted by no one:Ibuprofen blocks tryptophan importers, extends life.

Assorted links from last week

  1. Synthetic biology is the 2014 Breakthrough of the Year. Here is the original Science writeup from May.
  2. The war over bisphenol A continues.
  3. Syngas fermentation company LanzaTech adds $60 million more to its coffers. Chicagoland is a growing hub for syngas fermentation with LanzaTech, Coskata, and INEOS Bio all in the area.
  4. Solidia, which consumes CO2 to manufacture concrete, trumpets its newest successful fundraise.

Villain scientists

Dr. Evil: A Science Villain

All scientists have heroes. Some of our heroes are internationally renowned, like Galileo, Darwin, or Curie. Other heroes are personal, whether they be an inspiring high school teacher, an always-helpful senior colleague, or a wise, generous professor or boss. But what about villains? Do scientists have villains too?
Continue reading

Microbial knockout

For decades the knockout has been an indispensable tool for geneticists and molecular biologists. Removing something — and for geneticists, the something is usually a gene or the protein synthesized from it — and then watching for any resulting effects is a time-honored approach to biology, used on organisms from E. coli to mice.

Extension of the knockout idea to ecosystems instead of organisms is conceptually straightforward: instead of removing a gene from the organism, you remove an organism from the ecosystem. And then see what happens. Unfortunately for science, the the most interesting ecosystems are large-scale, complex, and can’t be replicated in a laboratory. While theoretically one could go out into the wild and remove particular species from real ecosystems just to see what happens, in practice that would be (a) technically difficult and (b) a horribly unethical for many reasons. Lucky for us, science has restrained itself from adopting this approach.

But the gedanken version of the experiment is always available, and that’s the route Jack Gilbert and Josh Neufeld take:

What if all prokaryotic microorganisms on Earth disappeared suddenly? If someone were to wave an antimicrobial wand and eliminate all bacterial and archaeal life on the planet, what would happen? The usual rhetoric is that life as we know it would end, human societies would collapse, and eukaryotic life would cease to exist. Is all of this true?

[…] Plants require fixed nitrogen, and bacteria play an essential biological role in the fixation process. […] Without help from humans, most global photosynthesis would cease within a year. Humans could potentially increase synthetic fertilizer production via the Haber-Bosch process and initiate a massive global fertilization scheme, alleviating some of the enormous losses of life. Such human intervention would be facilitated, to some extent, by the absence of bacterial denitrification and anaerobic ammonia oxidation, which would otherwise deplete fixed nitrogen. Ultimately, nitrogen would begin to accumulate in the global oceans. One possibility is that life would distribute along the coasts, where N-rich fish could be harvested and fixed nitrogen scavenged from seawater when atmospheric nitrogen depletion, due to the Haber-Bosch process, exhausted atmospheric reserves.

Unfortunately, the inevitable increase in atmospheric CO2 concentration due to animal respiration and human fossil fuel use would lead to rapid global warming through the greenhouse effect. Lipson points out that the process would require hundreds of years to eliminate life on the planet—ample time to find a carbon capture solution? In this way, some degree of agricultural food production and marine photosynthesis could continue indefinitely, supporting a subset of humans. Nonetheless, the world’s oceans and soils would likely begin a process of stagnation due to the myriad absent contributions to global biogeochemistry.

…and that’s only the scenario of intermediate horror. The rest gets worse.

Hydrology is more complicated than it seems

Time for another installment of Hydrology is More Complicated Than it Seems.

  1. California’s estimated rainfall in last 10 days: 10 trillion gallons
  2. California needs 11 trillion gallons of water to end its drought, and it isn’t going to get it.

properties of Google two-factor authentication codes

Google sends me text messages all the time. It’s because (i) I set cookies to clear every time I close my browser set session cookies and (ii) I use Google’s two-factor authentication (2FA) when I log into my email account. So every time I open a browser to load my Gmail, Google has to send me an SMS-encoded text message. Usually I get two or three messages a day, each telling me something like “Your Google verification code is 853438”.

Here’s the thing: after several months of staring at those messages and dutifully typing each new six-digit code to complete my login, I came to notice certain patterns in the numbers. Or at least I thought I did. The six-digit numbers seemed to contain repeated digits far more often than I’d expected. For example, the “853438” code above has a repeated “8” in the first and sixth digits, and a repeated “3” in the third and fifth digits. Coincidence? Maybe, I thought, but maybe Google generates 2FA codes with lots of repeated digits for mnemonic convenience. I sought out to quantify the frequency of these repeats using R, convinced I was going to cleverly reveal that Google’s algorithm for generating two-factor authentication (2FA) codes was not completely random.

For this exercise, I used 36 SMS messages (I used to have hundreds but accidentally forgot about this project when reformatting my phone a few weeks ago — oops). Step one was converting the SMS data to .csv for import into R. A very simple, free Android app creatively named “SMS to Text” did just that. I manually removed the non-Google related SMS messages and imported the remaining data to R.

dat <- read.table('sms_20141205_google2fa.csv', sep = ',', header = T)

dat <- dat[,c('Date', 'Message')]

Regular expressions make it easy to extract the 6-digit numbers whose properties interested me. And the stringr library makes everything involving strings in R better. One small trick was the use of the positive lookbehind regular expression in R. These operators are only supported when using perl-like regex parsing. So the regular expression needs to be wrapped in the perl() function.

dat$code <- str_extract(dat$Message,
perl('(?<=Google verification code is )[0-9]*')

The 2FA codes are now a character column in the dat data frame. I want to keep only unique codes (sometimes Google sends you multiple SMSs within a very short time window that contain the same 2FA code, and these repeats should not be counted and independent data points). It’s easy with duplicated().

dat <- dat[!duplicated(dat$code),]

My strategy for counting repeated digits is to use the colsplit() function from the reshape2 library to split each character to its own column. The function numRepeats() below acts on a dataframe and counts unique entries in each row by substracting the length of the unique() entries from each from from the row length. Overall this approach is not very good : splitting columns using '' results in an extra NA column at the beginning of each string that I am forced to manually remove with vector indexing (the [,2:7] in the code below), and it’s very slow for large datasets, as we’ll see below. There must be a better way (please set me straight in the comments…)

numRepeats <- function(x, N)
apply(x, 1, function(y){N - length(unique(y))})

dat$digits <- colsplit(dat$code, '',
names = c('NA', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6'))[,2:7]
dat$numRepeats <- factor(numRepeats(dat$digits, 6), levels = 0:(6-1))

I store the number of repeated digits as a factor to make the default plot labels look easier (see below) but storing as an integer vector would also be fine, probably better for many applications.

We now know the distribution of repeated digits in 2FA codes from Google but we need something to compare it to. For 6-digit decimals, there are only one million possible codes, so enumerating them all is feasible. This is the best possible comparison. Those are easily enumerated in R with the formatC() function.

all6mer <- formatC(0:999999, width = 6, format = 'd', flag = '0')
allDF <- colsplit(all6mer, pattern = '', names = c('NA', paste('d', 1:6, sep = '')))[2:7]
allDF$numRepeats <- factor(numRepeats(allDF, 6), levels = 0:(6-1))

This is where the slowness of the colsplit() approach for counting gets annoying — it took more than a few minutes on my laptop. But it did work. With counts for repeated digits from both 2FA codes and for comparison from all 6-digit decimal numbers, we can now plot the distribution of repeats in both datasets.

allDF$type <- 'all 6-digit numbers 000000 - 999999'
dat$type <- 'Google 2FA codes'
plotdf <- data.frame(numRepeats = as.factor(c(allDF$numRepeats, dat$numRepeats)-1),
type = c(allDF$type, dat$type)

quartz(height = 4, width = 6)
ggplot(data = plotdf, aes(x = numRepeats, y = ..count../sum(..count..),
group = type, fill = type, color = type, width = type)) +
geom_bar(data = plotdf[as.numeric(plotdf$type)==1,],
position = 'identity', alpha = 0.5, width = 0.8) +
geom_bar(data = plotdf[as.numeric(plotdf$type)==2,], position = 'identity',
alpha = 0.5, width = 0.5) +
theme_bw() +
ylab('Frequency') +
xlab('Number of repeated digits') +
scale_fill_manual(values = c('pink', 'light blue'), name = '') +
scale_color_manual(values = c('red', 'blue'), name = '') +
theme(legend.position = c(0.7, 0.85))'histogram_of_repeats_Google2FA_vs_random.pdf', type = 'pdf')

The main challenge in making this plot was (i) clearly overlaying the frequency distribution for each dataset to allow easy visual comparison while (ii) still being able to rely on the nice automatically generated legends created by ggplot(). Overlay instead of something with facet_grid() was something that made much more aesthetic sense to me, but opinions likely vary on that point.

Some approaches for the graph that didn’t work:

  • using y = ..density.. in the ggplot calls. I wanted to have different bar widths for each data set for better visualization. But ..density.. is affected by the width parameter in bar charts because it really isn’t designed for integer or categorical data like these. A brief discussion from 2010 explains a bit more.
  • using multiple layers to overlay entirely separate plots of each data frame separately. This works except it makes the legend very difficult to generate automatically. My solution above sets the aesthetic for the entire combined dataset plotdf which allows for easy legend generation while still allowing for different bar widths and frequency counts for each dataset.
  • a single geom_bar() call on the whole dataset, i.e. not broken down by the plotdf[as.numeric(plotdf$type)==x subsetting I use above. The problem here is that ..count.. operates on the entire data frame. I have only 33 unique 2FA codes but one million 6-digit numbers. For example, the frequency of 0 repeats in the 33 unique 2FA codes is 10/33, but with only a single geom_bar() call the frequency is computed as 10 / 1,000,033 and becomes invisibly small on the chart. This issue is frequently raised on stackoverflow, for example see here.

Anyway, the code above circumvents all those issues and generates the (to my mind) nice plot below.
The question is: are these distributions really different? This is easy to test with the dgof library, designed for testing goodness-of-fit between empirical data and theoretical distributions with discrete support.

ks.htest <- ks.test(x = as.numeric(as.character(dat$numRepeats)),
y = ecdf(as.numeric(as.character(allDF$numRepeats))))

cvm.htest <- cvm.test(x = as.numeric(as.character(dat$numRepeats)),
y = ecdf(as.numeric(as.character(allDF$numRepeats))),
type = 'W2')

That results in:

> ks.htest

One-sample Kolmogorov-Smirnov test

data: as.numeric(as.character(dat$numRepeats))
D = 0.1518, p-value = 0.4322
alternative hypothesis: two-sided

> cvm.htest

Cramer-von Mises - W2

data: as.numeric(as.character(dat$numRepeats))
W2 = 0.4324, p-value = 0.05355
alternative hypothesis: Two.sided

The results — both simple observation of the graph and of the Kolmogorov-Smirnov and Cramer-von Mises tests suggest that the distributions *are* different, but certainly not in the way I’d expected: Google 2FA codes have fewer repeats than random 6-digit decimal numbers! Just further proof that subjective impressions can be dangerous. The p-values aren’t terribly low, however, primarily because I don’t have a large sample of 2FA codes here. So I’m not declaring significance yet. Maybe I’ll have to revisit this issue when I collect a few hundred 2FA codes.