I Googled Myself (Part 2)

In my last post, I set up an A/B test through Google Optimize and learned Google Tag Manager (GTM), Google Analytics (GA) and Google Data Studio (GDS) along the way. When I was done, I wanted to learn how to integrate Enhanced E-commerce and Adwords into my mock-site, so I set that as my next little project.

As the name suggests, Enhanced E-commerce works best with an e-commerce site—which I don’t quite have. Fortunately, I was able to find a bunch of different mock e-commerce website source code repositories on Github which I could use to bootstrap my own. After some false starts, I found one that worked well for my purposes, based on this repository that made a mock e-commerce site using the “MEAN” stack (MongoDB, Express.js, AngularJS, and node.js).

Forking this repository gave me an opportunity to learn a bit more about modern front-end / back-end website building technologies, which was probably overdue. It was also a chance to brush up on my javascript skills. Tackling this new material would have been much more difficult without the use of WebStorm, the javascript IDE by the same makers of my favorite python IDE, PyCharm.

Properly implementing Enhanced E-commerce does require some back end development—specifically to render static values on a page that can then be passed to GTM (and ultimately to GA) via the dataLayer. In the source code I inherited, this was done through the nunjucks templating library, which was well suited to the task.

Once again, I used Selenium to simulate traffic to the site. I wanted to have semi-realistic traffic to test the GA pipes, so I modeled consumer preferences off of the beta distribution with \alpha  = 2.03 and \beta = 4.67 . That looks something like this:


The x value of the beta distribution is normally constrained to the (0,1) interval, but I multiplied it by the number of items in my store to simulate preferences for my customers. So in the graph, the 6th item (according to an arbitrary indexing of the store items) is the most popular, while the 22nd and 23rd items are the least popular.

For the customer basket size, I drew from a poisson distribution with \lambda  = 3 .  That looks like this:


Although the two distributions do look quite similar, they are actually somewhat different. For one thing, the Poisson distribution is discrete while the beta distribution is continuous—though I do end up dropping all decimal figures when drawing samples from the beta distribution since the items are also discrete. However, the two distributions do serve different purposes in the simulation. The x axis in the beta distribution represents an arbitrary item index, and in the poisson distribution, it represents the number of items in a customer’s basket.

So putting everything together, the simulation process goes like this: for every customer, we first draw from the Poisson distribution with \lambda = 3 to determine q , i.e. how many items that customer will purchase. Then we draw q times from the beta distribution to see which items the customer will buy. Then, using Selenium, these items are added to the customer’s basket and the purchase is executed, while sending the Enhanced Ecommerce data to GA via GTM and the dataLayer.

When it came to implementing Adwords, my plan had been to bid on uber obscure keywords that would be super cheap to bid on (think “idle giraffe” or “bellicose baby”), but unfortunately Google requires that your ad links be live, properly hosted websites. Since my website is running on my localhost, Adwords wouldn’t let me create a campaign with my mock e-commerce website 😦

As a workaround, I created a mock search engine results page that my users would navigate to before going to my mock e-commerce site’s homepage. 20% of users would click on my ‘Adwords ad’ for hoody sweatshirts on that page (that’s one of the things my store sells, BTW) . The ad link was encoded with the same UTM parameters that would be used in Google Adwords to make sure the ad click is attributed to the correct source, medium, and campaign in GA. After imposing a 40% bounce probability on these users, the remaining ones buy a hoody.

It seemed like I might as well use this project as another opportunity to work with GDS, so I went ahead and made another dashboard for my e-commerce website (live link):


If you notice that the big bar graph in the dashboard above looks a little like the beta distribution from before, that’s not an accident. Seeing the Hoody Promo Conv. Rate hover around 60% was another sign things were working as expected (implemented as a Goal in GA).

In my second go-around with GDS, however, I did come up against a few more frustrating limitations. One thing I really wanted to do was create a scorecard element that would tell you the name of the most popular item in the store, but GDS won’t let you do that.

I also wanted to make a histogram, but that is also not supported in GDS. Using my own log data, I did manage to generate the histogram I wanted—of the average order value.


I’m pretty sure we’re seeing evidence of the Central Limit Theorem kicking in here. The CLT says that the distribution of sample means—even when drawn from a distribution that is not normal—will tend towards normality as the sample size gets larger.

A few things have me wondering here, however. In this simulation, the sample size is itself a random variable which is never that big. The rule of thumb says that 30 counts as a large sample size, but if you look at the Poisson graph above you’ll see the sample size rarely goes above 8. I’m wondering whether this is mitigated by a large number of samples (i.e. simulated users); the histogram above is based on 50,000 simulated users. Also, because average order values can never be negative, we can only have at best a truncated normal distribution, so unfortunately we cannot graphically verify the symmetry typical of the normal distribution in this case.

But anyway, that’s just me trying to inject a bit of probability/stats into an otherwise implementation-heavy analytics project. Next I might try to re-implement the mock e-commerce site through something like Shopify or WordPress. We’ll see.



I Googled Myself

As a huge enthusiast of A/B testing, I have been wanting to learn how to run A/B tests through Google Optimize for some time. However, it’s hard to do this without being familiar with all the different parts of the Google product eco-system. So I decided it was time to take the plunge and finally Google myself. This post will cover my adventures with several products in the Google product suite including: Google Analytics (GA), Google Tag Manager (GTM), Google Optimize (GO), and Google Data Studio (GDS).

Of course, in order to do A/B testing, you have to have A) something to test, and B) sufficient traffic to drive significant results. Early on I counted out trying to A/B test this blog—not because I don’t have sufficient traffic—I got tons of it, believe me . . . (said in my best Trump voice). The main reason I didn’t try do it with my blog is that I don’t host it, WordPress does, so I can’t easily access or manipulate the source code to implement an A/B test. It’s much easier if I host the website myself (which I can do locally using MAMP).

But how do I send traffic to a website I’m hosting locally? By simulating it, of course. Using a nifty python library called Selenium, I can be as popular as I want! I can also simulate any kind of behavior I want, and that gives me maximum control. Since I can set the expected outcomes ahead of time, I can more easily troubleshoot/debug whenever the results don’t square with expectations.

My Mini “Conversion Funnel”

When it came to designing my first A/B test, I wanted to keep things relatively simple while still mimicking the general flow of an e-commerce conversion funnel. I designed a basic website with two different landing page variants—one with a green button and one with a red button. I arbitrarily decided that users would be 80% likely to click on the button when it’s green and 95% likely to click on the button when it’s red (these conversion rates are unrealistically high, I know). Users who didn’t click on the button would bounce, while those who did would advance to the “Purchase Page”.


To make things a little more complicated, I decided to have 20% of ‘green’ users bounce after reaching the purchase page. The main reason for this was to test out GA’s funnel visualizations to see if they would faithfully reproduce the graphic above (they did). After the purchase page, users would reach a final “Thank You” page with a button to claim their gift. There would be no further attrition at this point; all users who arrived on this page would click the “Claim Your Gift” button. This final action was the conversion (or ‘Goal’ in GA-speak) that I set as the objective for the A/B test.

Google Analytics

With GA, I jumped straight into the deep end, adding gtag.js snippets to all the pages of my site. Then I implemented a few custom events and dimensions via javascript. In retrospect, I would have done the courses offered by Google first (Google Analytics for Beginners & Advanced Google Analytics) . These courses give you a really good lay of the land of what GA is capable of, and it’s really impressive. If you have a website, I don’t see how you can get away with not having it plugged into GA.

In terms of features, the real time event tracking is a fantastic resource for debugging GA implementations. However, the one feature I wasn’t expecting GA to have was the benchmarking feature. It allows you to compare the traffic on your site with websites in similar verticals. This is really great because even if you’re totally out of ideas on what to analyze (which you shouldn’t be given the rest of the features in GA), you can use the benchmarking feature as a starting point for figuring out the weak points in your site.

The other great thing about the two courses I mentioned is that they’re free, and at the end you can take the GA Individual Qualification exam to certify your knowledge about GA (which I did). If you’re gonna put it the time to learn the platform, it’s nice to have a little endorsement at the end.

Google Tag Manager

After implementing everything in gtag.js, I did it all again using GTM. I can definitely see the appeal of GTM as a way to deploy GA; it abstracts away all of that messy javascript and replaces it with a clean user interface and a handy debug tool. The one drawback of GTM seems that it doesn’t send events to GA quite as well as gtag.js. Specifically, in my GA reports for the ‘red button` variant of my A/B test, I saw more conversions for the “Claim Your Gift” button than conversions for the initial click to get off the landing page. Given the attrition rates I defined, that’s impossible. I tried to configure the tag to wait until the event was sent to GA before the next page was loaded, but there still seemed to be some data meant to be sent to GA that got lost in the mix.

Google Optimize

Before trying out GO, I implemented my little A/B test through Google’s legacy system, Content Experiments. I can definitely see why GO is the way of the future. There’s a nifty tool that lets you edit visual DOM elements right in the browser while you’re defining your variants. In Content Experiments, you have to either provide two separate A or B pages or implement the expected changes on your end. It’s a nice thing to not have to worry about, especially if you’re not a pro front-end developer.

Also, it’s clear that GO has more powerful decision features. For one thing, it has Bayesian decision logic which is more comprehensible for business stakeholders and is gaining steam in online a/b testing. Also, it has the ability to do multivariate testing, which is a great addition, though I don’t use that functionality for this test.

The one thing that was a bit irritating with GO was setting it up to run on localhost. It took a few hours of yak shaving to get the different variants to actually show up on my computer. It boiled down to 1) editing my etc/hosts file with an extra line in accordance with this post on the Google Advertiser Community forum and 2) making sure the Selenium driver navigated to localhost.domain instead of just localhost.

Google Data Studio

Nothing is worth doing unless you can make a dashboard at the end of it, right? While GA has some amazing report power generating capabilities, it can feel somewhat rigid in terms of customizability. GDS is a relatively new program that gives you way more options to visualize the data sitting in GA. But while GDS has an advantage over GA, it does have some frustrating limitations which I hope they resolve soon. In particular, I hope they’ll let you show percent differences between two score cards soon. As someone who’s done a lot of  A/B test reports, I know that the thing stakeholders are most interested in seeing is the % difference, or lift, caused by one variant versus another.

Here is a screenshot of the ultimate dashboard (or a link it you want to see it live):


The dashboard was also a good way to do a quick check to make sure everything in the test was working as expected. For example, the expected conversion rate for the “Claim Your Gift” button was 64% versus 95%, and we see more or less those numbers in the first bar chart on the left. The conditional conversion rate (the conversion rate of users conditioned on clicking off the landing page) is also close to what was expected: 80% vs. 100%

Notes about Selenium

So I really like Selenium, and after this project I have a little personal library to do automated tests in the future that I can apply to any website, not just this little dinky one I ran locally on my machine.

When you’re writing code dealing with Selenium, one thing I’ve realized is that it’s important to write highly fault tolerant code. Things that depend on the internet imply many things that can go wrong—the wifi in the cafe you’re in might go down. Resources might randomly fail to load. So many different things that can go wrong… But if you’ve written fault-tolerant code, hitting one of these snags won’t cause your program to stop running.

Along with fault-tolerant code, it’s a good idea to write good logs. When stuff does go wrong, this helps you figure out what it was. In this particular case, logs also served as a good source of ground truth to compare against the numbers I was seeing in GA.

The End! (for now…)

I think I’ll be back soon with another post about AdWords and Advanced E-Commerce in GA…


A Possible Explanation why America does Nothing about Gun Control

Ever since the Las Vegas mass shooting last October, I’ve wanted to blog about gun control. But I also wanted to wait—to see whether that mass shooting, though the deadliest to date in U.S. history, would quickly slip into the dull recesses of the American public subconsciousness just like all the rest. It did, and once again we find ourselves in the same sorry cycle of inaction that by this point is painfully familiar to everyone.

I also recently came across a 2016 study, by Kalesan, Weinberg, and Galea, which found that on average, Americans are 99% likely to know someone either killed or injured by gun violence over the course of their lifetime. That made me wonder: how can it possibly be that Americans remain so paralyzed on this issue if it affects pretty much everyone?

It could be that the ubiquity of gun violence is actually the thing that actually causes the paralysis. That is, gun violence affects almost everyone just as Kalesan et al argue, but the reactions Americans have to the experience are diametrically opposed to one another. These reactions result in hardened views that inform people’s voting choices, and since these choices more or less divide the country in half across partisan lines, the result is an equilibrium where nothing can ever get done on gun control. So on this reading, it’s not so much a paralysis of inaction so much as a tense political stalemate.

But it could also be something else. Kalesan et al calculate the likelihood of knowing someone killed or injured by general gun violence over the course of a lifetime, but they don’t focus on mass shootings in particular. Their methodology is based on basic principles of probability and some social network theory that posits people have an effective social network numbering a little fewer than 300 people. If you look at the Kalesan et al paper, it becomes clear that their methodology can also be used to calculate the likelihood of knowing someone killed or injured in a mass shooting. It’s just a matter of substituting the rate of general gun violence for the rate of mass shooting in their probability calculation.

It turns out that the probability of knowing someone killed/injured in a mass shooting is much, much lower than for gun violence more generally. Even with a relatively generous definition of what counts as a mass shooting (four or more people injured/killed not including the shooter, according to the Gun Violence Archive), this probability is about 10%. When you only include incidents that have received major national news media attention—based on a list compiled by Mother Jones—that probability drops to about 0.36%.

So, it’s possible the reason Americans continue to drag their feet on gun control is that the problem just doesn’t personally affect enough people. Curiously, the even lower likelihood of knowing someone killed or injured in a terrorist attack doesn’t seem to hinder politicians from working aggressively to prevent further terrorist attacks. Still, if more people were personally affected by mass shootings, more might change their minds on gun control like Caleb Keeter, the Josh Abbot band guitarist who survived the Las Vegas shooting.

Front Page Clues

They say the best place to hide a dead body is on page two of the Google search results. I’d argue that a similar rule applies to reading the news, especially online. If a story is not on the landing page of whatever news site I’m looking at, chances are I’m not gonna find it. All this is to say: news outlets wield considerable power to direct our attention where they want it simply by virtue of how they organize content on their sites.

During presidential elections, the media is often criticized for giving priority to the political horse race between dueling candidates, preoccupying us with pageantry over policy. But to what extent is this true? And if it is true, which specific policy issues suffer during election cycles? Do some suffer more than others? What are we missing out on because we are too busy keeping up with the horse race instead?

If you want to go straight to the answers to these questions (and some other interesting stuff), skip down to the Findings section of the post. For those interested in the technical details, the next two sections are for you.


Lucky for us (or maybe just me), the New York Times generously makes a ton of its data available online for free, easily retrievable via calls to a REST API (specifically, their Archive API). Just a few dozen calls and I was in business. This amazing resource not only has information going back to 1851 (!!), it also includes keywords from each article as part of its metadata. Even better, since 2006, they have ranked the keywords in each article by their importance. This means that for any article that is keyword-ranked, you can easily extract its main topic—whatever person, place, or subject it might be.

Having ranked keywords makes this analysis much easier. For one thing, we don’t have to sift through words from mountains of articles in order to surmise what each article is about using fuzzy or inexact NLP methods. And since they’ve been ranking keywords since 2006, this gives us three presidential elections to include as part of our analysis (2008, 2012, and 2016).

The other crucial dimension included in the NYT article metadata is the print page. Personally, I don’t ever read the NYT on paper anymore (or any newspaper, for that matter—they’re just too unwieldy), so you might argue that the print page is irrelevant. Possibly, but unfortunately we don’t have data about placement on the NYT’s website. And moreover, I would argue that the print page is a good proxy for this. It gets at the essence of what we’re trying to measure, which is the importance NYT editors place on a particular topic over others.


logit(\pi_t) = log(\frac{\pi_t}{1-\pi_t}) = \alpha + \sum_{k=1}^{K} \beta_{k} * Desk_k + \beta * is\_election

A logistic regression model underpins the analysis here. The log-odds that a topic \textit{t} will appear on the front page of the NYT is modeled as a function of the other articles appearing on the front page (the Desk variables, more on those below), as well as a dummy variable indicating whether or not the paper was published during an election cycle.

Modeling the other articles on the front page is essential since they have obvious influence over whether topic \textit{t} will make the front page on a given day. But in modeling these other articles, a choice is made to abstract from the topics of the articles to the news desk from which they originated. Using the topics themselves unfortunately leads to two problems: sparsity and singularity. Singularity is a problem that arises when your data has too many variables and too few observations. Fortunately, there are statistical methods to overcome this issue—namely penalized regression. Penalized regression is often applied to machine learning problems, but recent developments in statistics have extended the methodology of significance testing to penalized models like ridge regression. This is great since we are actually concerned with interpreting our model rather just pure prediction—the more common aim in machine learning applications.

Ultimately though, penalized methods do not overcome the sparsity problem. Simply put, there are too many other topics that might appear (and appear too infrequently) on the front page to get a good read on the situation. Therefore as an alternative, we aggregate the other articles on the front page according to the news desk they came from (things like Foreign, Style, Arts & Culture, etc). Doing so allows our model to be readily interpretable while retaining information about the kinds of articles that might be crowding out topic \textit{t}.

The  is\_election variable is a dummy variable indicating whether or not the paper was published in an election season. This is determined via a critical threshold illustrated by the red line in the graph below. The same threshold was applied across all three elections.


In some specifications, the is\_election variable might be broken into separate indicators, one for each election. In other specifications, these indicators might be interacted with one or several news desk variables—though only when the interactions add explanatory value to the overall model as determined by an analysis of deviance.

Two other modeling notes. First, for some topics, the model might suffer from quasi or complete separation. This occurs when for example, all instances of topic \textit{t} appearing on page one occur when there are also less than two Sports desk articles appearing on page one. Separation can mess up logistic regression coefficient estimates, but fortunately, a guy named Firth (not Colin, le sigh) came up with a clever workaround, which is known as Firth Regression. In cases where separation is an issue, I switch out the standard logit model for Firth’s alternative. This is easily done using R‘s logistf package, and reinforces why I favor R over python when it comes to doing serious stats.

Second, it should be pointed out that our model does run somewhat afoul of one of the basic assumptions of logistic regression—namely, independence. In regression models, it is regularly assumed that the observations are independent of (i.e. don’t influence) each other. That is probably not true in this case, since news cycles can stretch over the span of several newspaper editions. And whether a story makes front page news is likely influenced by whether it was on the front page the day before.

Model-wise, this is a tough nut to crack since the data is not steadily periodic—as is the case with regular time series data. It might be one, two, or sixty days between appearances of a given topic. In the absence of a completely different approach, I test the robustness of my findings by including an additional variable in my specification—a dummy indicating whether or not topic \textit{t} appeared on the front page the day before.


For this post, I focused on several topics I believe are consistently relevant to national debate, but which I suspected might get less attention during a presidential election cycle. It appears the 2016 election cycle was particularly rough on healthcare coverage. The model finds a statistically significant effect (\beta = -4.519; p = 0.003), which means that for the average newspaper, the probability that a healthcare article made the front page dropped by 60% during the 2016 election season—from a probability of 0.181 to 0.071. This calculation is made by comparing the predicted values with and without the 2016 indicator activated—while holding all other variables fixed at their average levels during the 2016 election season.

Another interesting finding is the significant coefficient (p = 0.045) found on the interaction term between the 2016 election and articles from the NYT’s National desk, which is actually positive (\beta = 0.244).  Given that the National desk is one of the top-five story-generating news desks at the New York Times, you would think that more National stories would come at the expense of just about any other story. And this is indeed the case outside of the 2016 election season, where the probability healthcare will make the front page drops 45% when an additional National article is run on the front page of the average newspaper. During the 2016 election season, however, the probability actually increases by 25%.  These findings were robust to whether or not healthcare was front page news the day before.

The flip on the effect of National coverage here is curious, and raises the question as to why it might be happening. Perhaps NYT editors had periodic misgivings about the adequacy of their National coverage during the 2016 election and decided to make a few big pushes to give more attention to domestic issues including healthcare. Finding the answer requires more digging. In the end though, even if healthcare coverage was buoyed by other National desk articles, it still suffered overall during the 2016 election.

The other topic strongly associated with election cycles is gun control (\beta = -3.822; p = 0.007). Articles about gun control are 33% less likely to be run on the front page of the average newspaper during an election cycle. One thing that occurred to me about gun control however is that it generally receives major coverage boons in the wake of mass shootings. It’s possible that the association here is being driven by a dearth of mass shootings during presidential elections, but I haven’t looked more closely to see whether a drop off in mass shootings during election cycles actually exists.

Surprisingly, coverage about the U.S. economy is not significantly impacted by election cycles, which ran against my expectations. However, coverage about the economy was positively associated with coverage about sports, which raises yet more interesting questions. For example, does our attention naturally turn to sports when the economic going is good?

Unsurprisingly, elections don’t make a difference to coverage about terrorism. However, when covering stories about terrorism in foreign countries, other articles from the Foreign desk significantly influence whether the story will make the front page cut (\beta = -1.268; p = 8.49e-12). Starting with zero Foreign stories on page one, just one other Foreign article will lower the chances that an article about foreign terrorism will appear on page one by 40%. In contrast, no news desk has any systematic influence on whether stories about domestic terrorism make it on to page one.

Finally, while elections don’t make a difference to front page coverage about police brutality and misconduct, interestingly, articles from the NYT Culture desk do. There is a significant and negative effect (\beta = -0.364; p = 0.033), which for the average newspaper, means a roughly 17% drop in the probability that a police misconduct article will make the front page with an additional Culture desk article present. Not to knock the Culture desk or nothing, but this prioritization strikes me as somewhat problematic.

In closing, while I have managed to unearth several insights in this blog post, many more may be surfaced using this rich data source from the New York Times. Even if some of these findings raise questions about how the NYT does its job, it is a testament to the paper as an institution that they are willing to open themselves to meta-analyses like this. Such transparency enables an important critical discussion about the way we consume our news. More informed debate—backed by hard numbers—can hopefully serve the public good in an era when facts in the media are often under attack.


Movies! (Now with More AI!!)

Earlier, I played around with topic modeling/recommendation engines in Apache Spark. Since then, I’ve been curious to see if I could make any gains by adopting another text processing approach in place of topic modeling—word2vec. For those who don’t know word2vec, it takes individual words and maps them into a vector space where the vector weights are determined by a neural network that trains on a corpus of text documents.

I won’t go into major depth on neural networks here (a gentle introduction for those who are interested), except to say that they are considered among many to be the bleeding-edge of artificial intelligence. Personally, I like word2vec because you don’t necessarily have to train the vectors yourself. Google has pre-trained vectors derived from a massive corpus of news documents they’ve indexed. These vectors are rich in semantic meaning, so it’s pretty cool that you can leverage their value with no extra work. All you have to do is download the (admittedly large 1.5 gig) file onto your computer and you’re good to go.

Almost. Originally, I had wanted to do this on top of my earlier spark project, using the same pseudo-distributed docker cluster on my old-ass laptop. But when I tried to load the pre-trained Google word vectors into memory, I got a big fat MemoryError, which I actually thought was pretty generous because it was nice enough to tell me exactly what it was.

I had three options: commandeer some computers in the cloud on Amazon, try to finagle spark’s configuration like I did last time, or finally, try running Spark in local mode. Since I am still operating on the cheap, I wasn’t gonna go with option one. And since futzing around with Spark’s configuration put me in a dead end last time, I decided to ditch the pseudo-cluster and try running Spark in local mode.

Although local mode was way slower on some tasks, it could still load Google’s pre-trained word2vec model, so I was in business. Similar to my approach with topic modeling, I created a representative vector (or ‘profile’) for each user in the Movielens dataset. But whereas in the topic model, I created a profile vector by taking the max value in each topic across a user’s top-rated movies, here I instead averaged the vectors I derived from each movie (which were themselves averages of word vectors).

Let’s make this a bit more clear. First you take a plot summary scraped from Wikipedia, and then you remove common stop words (‘the’, ‘a’, ‘my’, etc.). Then you pass those words through the pre-trained word2vec model. This maps each word to a vector of length 300 (a word vector can in principle be of any length, but Google’s are of length 300). Now you have D vectors of length 300, where D is the number of words in a plot summary. If you average the values in those D vectors, you arrive at a single vector that represents one movie’s plot summary.

Note: there are other ways of aggregating word vectors into a single document representation (including doc2vec), but I proceeded with averages because I was curious to see whether I could make any gains by using the most dead simple approach.

Once you have an average vector for each movie, you can get a profile vector for each user by averaging (again) across a user’s top-rated movies. At this point, recommendations can be made by ranking the cosine similarity between a user’s profile and the average vectors for each movie. This could power a recommendation engine its own—or supplement explicit ratings for (user, movie) pairs that aren’t observed in the training data.

Cognizant of the hardware limitations I ran up against last time, I opted for the same approach I adopted then, which was to pretend I knew less about users and their preferences than I really did. My main goal was to see whether word2vec could beat out the topic modeling approach, and in fact it did. With 25% of the data covered up, the two algorithms performed roughly the same against the covered up data. But with 75% of the data covered up, word2vec resulted in an 8% performance boost (as compared with 3% gained from topic modeling)

So with very little extra work (simple averaging and pre-trained word vectors), word2vec has pretty encouraging out of the box performance. It definitely makes me eager to use word2vec in the future.

Also a point in word2vec’s favor: when I sanity checked the cosine similarity scores of word2vec’s average vectors across different movies, The Ipcress File shot to the top of the list of movies most similar The Bourne Ultimatum. Still don’t know what The Ipcress File is? Then I don’t feel bad re-using the same joke as a meme sign-off.


Which Countries are the Most Addicted to Coffee?

This is my last blog post about coffee (I promise). Ever since stumbling upon this Atlantic article about which countries consume the most coffee per capita, I’ve pondered a deeper question—not just who drinks the most coffee, but who’s most addicted to the stuff. You might argue that these questions are one and the same, but when it comes to studying addiction, it actually makes more sense to look at it through the lens of elasticity rather than gross consumption.

You might recall elasticity from an earlier blog post. Generally speaking, elasticity is the economic concept relating sensitivity of consumption to changes in another variable (in my earlier post, that variable was income). When it comes to studying addiction, economists focus on price elasticity—i.e. % change in quantity divided by the % change in price. And it makes sense. If you want to know how addicted people are to something, why not see how badly they need it after you jack up the price? If they don’t react very much, you can surmise that they are more addicted than if they don’t. Focusing on elasticity rather than gross consumption allows for a richer understanding of addiction. That’s why economists regularly employ this type of analysis when it comes to designing public policy around cigarette taxes.

I would never want to tax coffee, but I am interested in applying the same approach to calculate the price elasticity of coffee across different countries. While price elasticity is not a super complicated idea to grasp, in practice it is actually quite difficult to calculate. In the rest of this blog post, I’ll discuss these challenges in detail.

Gettin’ Da Data

The first problem for any data analysis is locating suitable data; in this case, the most important data is information for the two variables that make up the definition of elasticity: price and quantity. Thanks to the International Coffee Organization (ICO), finding data about retail coffee prices was surprisingly easy for 26 different countries reaching (in some cases) back to 1990.

Although price data was remarkably easy to find, there were still a few wrinkles to deal with. First, for a few countries (e.g. the U.S.), there was missing data in some years. To remedy this, I used the handy R package imputeTS, to generate reasonable values for the gaps in the affected time series.

The other wrinkle related to inflation. I searched around ICO’s website to see if their prices were nominal or whether they controlled for the effects of inflation. Since I couldn’t find any mention of inflation, I assumed the prices were nominal. Thus, I had to make a quick stop at the World Bank to grab inflation data so that I could deflate nominal prices to real prices.

While I was at the Bank, I also grabbed data on population and real GDP. The former is needed to get variables on a per capita basis (where appropriate), while the latter is needed as a control in our final model. Why? If you want to see how people react to changes in the price of coffee, it is important to hold constant any changes in their income. We’ve already seen how positively associated coffee consumption is with income, so this is definitely a variable you want to control for.

Getting price data might have been pretty easy, but quantity data posed more of a challenge. In fact, I wasn’t able to find any publicly available data about country-level coffee consumption. What I did find, however, was data about coffee production, imports and exports (thanks to the UN’s Food and Agriculture Organization). So, using a basic accounting logic (i.e. production – exports + imports), I was able to back into a net quantity of coffee left in a country in a given year.

There are obvious problems with this approach. For one thing, it assumes that all coffee left in a country after accounting for imports and exports is actually consumed. Although coffee is a perishable good, it is likely that at least in some years, quantity is carried over from one year to another. And unfortunately, the UN’s data gives me no way to account for this. The best I can hope for is that this net quantity I calculate is at least correlated with consumption. Since elasticity is chiefly concerned with the changes in consumption rather than absolute levels, if they are correlated, then my elasticity estimates shouldn’t be too severely biased. In all, the situation is not ideal. But if I couldn’t figure out some sort of workaround for the lack of publicly available coffee consumption data, I would’ve had to call it a day.

Dogged by Endogeneity

Once you have your data ducks in a row, the next step is to estimate your statistical model. There are several ways to model addiction, but the most simple is the following:

log(cof\_cons\_pc) = \alpha + \beta_{1} * log(price) + \beta_{2} * log(real\_gdp\_pc) + \beta_{3} * year + \varepsilon

The above linear regression equation models per capita coffee consumption as a function of price while controlling for the effects of income and time. The regression is taken in logs mostly as a matter of convenience, since logged models allow the coefficient on price, \beta_1, to be your estimate of elasticity. But there is a major problem with estimating the above equation, as economists will tell you, and that is the issue of endogeneity.

Endogeneity can mean several different things, but in this context, it refers to the fact that you can’t isolate the effect of price on quantity because the data you have is the sum total of all the shocks that shift both supply and demand over the course of a year. Shocks can be anything that affect supply/demand apart from price itself—from changing consumer tastes to a freak frostbite that wipes out half the annual Colombian coffee crop. These shocks are for the most part unobserved, but all together they define the market dynamics that jointly determine the equilibrium quantity and price.

To isolate the effect of price, you have to locate the variation in price that is not also correlated with the unobserved shocks in a given year. That way, the corresponding change in quantity can safely be attributed to the change in price alone. This strategy is known as using an instrumental variable (IV). In the World Bank Tobacco Toolkit, one suggested IV is lagged price (of cigarettes in their case, though the justification is the same for coffee). The rationale is that shocks from one year are not likely to carry over to the next, while at the same time, lagged price remains a good predictor of price in the current period. This idea has its critics (which I’ll mention later), but has obvious appeal since it doesn’t require any additional data.

To complete implementing the IV strategy, you must first run the model:


price_t = \alpha + \beta_{1} * price_{t-1} + \beta_{2} * real\_gdp\_pc_t + \beta_{3} * year_t + \varepsilon_t

You then use predicted values of price_t from this model as the values for price in the original elasticity model I outlined above. This is commonly known as as Two Stage Least Squares regression.

Next Station: Non-Stationarity

Endogeneity is a big pain, but it’s not the only issue that makes it difficult to calculate elasticity. Since we’re relying on time series data (i.e. repeated observations of country level data) as our source of variation, we open ourselves to the various problems that often pester inference in time series regression as well.

Perhaps the most severe problem posed by time series data is the threat of non-stationarity. What is non-stationarity? Well, when a variable is stationary, its mean and variance remain constant over time. Having stationary variables in linear regression is important because when they’re not, it can cause the coefficients you estimate in the model to be spurious—i.e. meaningless. Thus, finding a way to make sure your variables are stationary is rather important.

This is all made more complicated by the fact that there are several different flavors of stationarity. A series might be trend stationary, which means that it’s stationary around a trend line. Or it might be difference stationary. That means that it’s stationary after you difference the series, where differencing is to subtract away the value of the series from the year before, so you’re just left with the change from year to year. A series could also have structural breaks, like an outlier or a lasting shift in the mean (or multiple shifts). And finally, if two or more series are non-stationary but related to one another by way of something called co-integration, then you have to apply a whole different analytical approach.

At this point, a concrete example might help to illustrate the type of adjustments that need to be made to ensure stationarity of a series. Take a look at this log-scaled time series of coffee consumption in The Netherlands:


It seems like overall there is a slightly downward trend from 1990 through 2007 (with a brief interruption in 1998/99). Then, in 2008 through 2010, there was a mean shift downward, followed by another shift down in 2011. But starting in 2011, there seems to be a strong upward trend. All of these quirks in the series are accounted for in my elasticity model for The Netherlands using dummy variables—interacted with year when appropriate to allow for different slopes in different epochs described above.

This kind of fine-grained analysis had to be done on three variables per model—across twenty-six. different. models. . . Blech. Originally, I had hoped to automate much of this stage of the analysis, but the idiosyncrasies of each series made this impossible. The biggest issue were the structural breaks, which easily throw off the Augmented Dickey-Fuller test, the workhorse for detecting statistically whether or not a series is stationary.

This part of the project definitely took the longest to get done. It also involved a fair amount of judgment calls—when should a series be de-trended, or differenced, or how to separate the epochs when structural breaks were present. All this lends to a critique of time series analysis that it can often be more of an art than science. The work was tedious, but at the very least, it gave me confidence that it might be a while before artificial intelligence replaces humans for this particular task. In prior work, I actually implemented a structural break detection algorithm I once found in a paper, but I wasn’t impressed with its performance, so I wasn’t going to go down that rabbit hole again (for this project, at least).

Other Complications

Even after you’ve dealt with stationarity, there are still other potential problem areas. Serial correlation is one of them. What is serial correlation? Well, one of the assumptions in linear regression is that the error term, or \varepsilon, as it appears in the elasticity model above, is independent across different observations. Since you observe the same entity multiple times in time series data, the observations in your model are by definition dependent, or correlated. A little serial correlation isn’t a big deal, but a lot of serial correlation can cause your standard errors to become biased, and you need those for fun stuff like statistical inference (confidence intervals/hypothesis testing).

Another problem that can plague your \varepsilon‘s is heteroskedascity, which is a complicated word that means the variance of your errors is not constant over time. Fortunately, both heteroskedascity and serial correlation can be controlled for using robust covariance calculations known as Newey-West estimators. These methods are easily accessible in R via the sandwich package, and I used it whenever I observed heteroskedascity or serial correlation in my models.


A final issue is the problem of multicollinearity. Multicollinearity is not strictly a time series related issue; it occurs whenever the covariates in your model are highly correlated with one another. When this happens, your beta estimates become highly unreliable and unstable. This occurred in the models for Belgium and Luxembourg between the IV price variable and GDP. There are not many good options when your model suffers from multicollinearity. Keep the troublesome covariate, and you’re left with some really weird coefficient values. Throw it out, and your model could suffer from omitted variable bias. In the end, I excluded GDP from these models because the estimated coefficients looked less strange.


In the end, only two of the elasticities I estimated ended up being statistically significant—or reliable—estimates of elasticity (for Lithuania and Poland). The rest were statistically insignificant (at \alpha = 0.05), which means that positive values of elasticity are among the plausible values for a majority of the estimates. From an economic theory standpoint this makes little sense, since it is a violation of the law of demand. Higher prices should lead to a fall in demand, not a rise. Economists have a name for goods that violate the law of demand—Giffen goods—but they are very rare to encounter in the real world (some say that rice in China is one of them; I’m pretty sure coffee is a normal, well-behaved good.

Whenever you wind up with insignificant results, there is always a question of how to present them (if at all). Since the elasticity models produce actual point estimates for each country, I could have just put those point estimates in descending order and spit out that list as a ranking of the countries most addicted to coffee. But that would be misleading. Instead, I think it’s better to display entire confidence intervals—particularly to demonstrate the variance in certainty (i.e. width of the interval) across the different country estimates. The graphic below ranks countries from top to bottom in descending order by width of confidence interval. The vertical line at -1.0% is a reference for the threshold between goods considered price elastic ( \epsilon < -1.0% ) versus price inelastic ( -1.0% > \epsilon >  0.0%).


When looking at the graphic above, it is important to bear in mind that apart from perhaps a few cases, it is not possible to draw conclusions about the differences in elasticities between individual countries. You cannot, for example, conclude that coffee is more elastic in the United States relative to Spain. To generate a ranking of elasticities across all countries (and arrive at an answer to the question posed by the post title), we would need to perform a battery of pairwise comparisons between all the the different countries ([26*25]/2 = 320 in total). Based on the graphic above, however, I am not convinced this would be worth the effort. Given the degree of overlap across confidence intervals—and the fact that the significance-level correction to account for multiple comparisons would only make this problem worse—I think the analysis would just wind up being largely inconclusive.

In the end, I’m left wondering what might be causing the unreliable estimates. In some cases, it could just be a lack of data; perhaps with access to more years—or more granular data taken at monthly or quarterly intervals—confidence intervals would shrink toward significance. In other cases, I might have gotten unlucky in terms of the variation of a given sample. But I am also not supremely confident in the fidelity of my two main variables, quantity and price, since both variables have artificial qualities to them. Quantity is based on values I synthetically backed into rather than coming from a concrete, vetted source, and price is derived from IV estimation. Although I trusted the World Bank when it said lagged price was a valid IV, I subsequently read some literature that said it may not solve the endogeneity issue after all. Specifically, it argues the assumption that shocks are not serially correlated is problematic.

If lagged price is not a valid IV, then another variable must be found that is correlated with price, but not with shocks to demand. Upon another round of Googling, I managed to find data with global supply-side prices through the years. It would be interesting to compare the results using these two different IVs. But then again, I did promise that this would be my last article about coffee… Does that mean the pot is empty?





Do Hangovers Make Us Drink More Coffee?


After finishing my last blog post, I grew curious about the relationship between coffee and another beverage I’ve noticed is quite popular amongst backpacker folk: alcohol. Are late-night ragers (and their accompanying brutal hangovers) associated with greater levels of coffee consumption? Or is the idea about as dumb as another Hangover sequel?

When you look at a simple scatter plot associating per capita alcohol and coffee consumption on a national level, you might think that yes, alcohol does fuel coffee consumption (based on the apparent positive correlation).


But does this apparent relationship hold up to closer scrutiny? In my last article, we discovered that variables like country wealth could explain away much of the observed variation in coffee consumption. Could the same thing be happening here as well? That is, do richer countries generally consume more coffee and alcohol just because they can afford to do so?

The answer seems to be: not as much as I would have thought. The thing to notice in the graphs above is how much less sensitive alcohol consumption is to income compared to coffee. In other words, it don’t matter how much money is in your wallet, you gonna get your alcohol on no matter what. But for coffee, things are different. This is evident from the shapes of the data clouds in the respective graphs. That bunch of data points in the top left of the alcohol graph? You don’t see a similar shape in the coffee chart. And that means that consumption of alcohol depends much less on income, relative to coffee.

It’s not too surprising that alcohol consumption is less sensitive to income changes than coffee, but it’s always cool to see intuition borne out in real-life data that you randomly pull of the internet. By looking at what economists call income elasticity of demand—the % change in consumption of a good divided by the % change in income—we can more thoroughly quantify what we’re seeing. Using a log-log model, standard linear regression can be used to get a rough estimate* of the income elasticity of demand. In these models, the beta coefficient on log(income) ends up being the elasticity estimate.

When the elasticity of a good is less than 1, it is considered an inelastic good, i.e. a good that is not very sensitive to changes in income. Inelastic goods are also sometimes referred to as necessity goods. By contrast, goods with elasticity greater than 1 are considered elastic goods, or luxury goods. Sure enough, when you fit a log-log model to the data, the estimated elasticity for coffee is greater than 1 (1.08), while the estimated elasticity for alcohol is less than 1 (0.54). Hmm, so in the end, alcohol is more of a ‘necessity’ than coffee. Perhaps this settles any debate over which beverage is more addictive.

When it comes to drinking however (either coffee or alcohol), one cannot ignore the role that culture plays in driving the consumption of both beverages. Perhaps cultures that drink a lot of one drink a lot of the other, too. Or perhaps a culture has a taboo against alcohol, like we find in predominantly Muslim countries. To control for this, I included region-of-the-world controls in my final model relating alcohol and coffee consumption.

Unfortunately for my initial hypothesis, once you account for culture, any statistically significant relationship between alcohol and coffee consumption vanishes. To be sure that controlling for culture in my model was the right call, I performed a nested model analysis—a statistical method that basically helps make sure you’re not over-complicating things. The nested model analysis concluded that yes, culture does add value to the overall model, so I can’t just ignore it.

Echoing my last article, this is not the final word on the subject, as again, more granular data (at the individual level) could show a significant link between the two. Instead what this analysis says is that if a relationship does exist, it is not potent enough to show up in data at the national level. Oh well, it was worth a shot. Whether or not alcohol and coffee consumption are legit linked to one another, one fact remains indisputably true: hangovers suck.



Data Links:

  • Alcohol – World Health Organization – link
  • Economic Data – Conference Board – link
  • Coffee – Euromonitor (via The Atlantic) – link

* “rough” because ideally, you would look at changes in income in a single country to estimate elasticity rather than look at differences in income across different countries.