Friday, April 13, 2018

Science, Open data, and your Privacy

If you are a regular reader of this blog, chances are you are either a fellow (social) psychologist or somebody with an above-average interest in social psychology. In both cases, you have likely heard of the “replication crisis” or, as I have recently heard it called on a podcast dedicated almost entirely to the topic, the “transparency revolution” (episode 57). The second is my favorite by far, for reasons that will be clear by the end of this post.

Data safe from hackers, but certainly not "open access": punch-card storage from the 1950

If you haven’t heard of it, this post will hopefully still make sense, but if you would like to read up on roughly what is going on, I recommend these two summaries of the situation. A very brief description of the situation (and all you need to know for understanding this post) is that we, as a scientific community, realized that the way we do science, i.e. how we hypothesize, set up experiments, test participants, analyze data and publish results, did not often lead to reliable, robust results (Open Science Collaboration, 2015, Klein et al., 2017). This is not very productive for both our scientific field nor those who pay for our research (most likely you, the taxpayer) in the long term. If by now you are worried about your favorite psychological effect, you can check whether it still holds up here.
As the problem cannot be tracked down to one single culprit, neither a specific person nor one single step of the scientific process, there are many different attempts to continually improve the way we conduct our work. Calls to address publication bias, improve theoretical reasoning (link to pdf), pre-registration to avoid digging for results and justifying them only after they were found (known as p-hacking and harking), more participants per experiment and more replications, better or entirely new statistical methods, using new tools designed for a transparent research process like the open science framework… the list goes on.
One of the puzzle pieces scientists interested in seeing better, more reliable research practices are championing is the practice of open access to anonymised experimental and survey data. This practice is called “open data”, and yes, you can get a badge for it: 
Open access badge, so every reader knows instantly that data for this study is freely accessible

There are several reasons for providing open access to our data: Others can check whether I did my analysis correctly, re-analyze the data, combine it with other data collected for similar purposes to conduct a meta-analysis, or to investigate new questions.

The advantage of providing the data directly on a site like is simple: it is accessible for anyone, forever, without having to pass through me, the researcher. To illustrate why this is important: I recently tried to track down a questionnaire, translated for several publications within the last 10 years. One scientist was retired, one unfortunately passed away, one never had the questionnaire themselves despite being the corresponding author of the paper, and the co-author had left academia. This is without considering those who never responded, and the multiple broken email addresses. In short, having data (or any material used in a study, for that matter) accessible independently of a human, can be extremely useful. Luckily, there is hope and more and more effort to improve the situation.
And here, I need to stop and make two clear statements:
  1. I am - very much - in favor of open data.
  2. I am also German.
No big surprise at point one. But you might wonder about the second one. Germans are notoriously concerned about their data protection. Yes, we post everything you never wanted to know on social media, but not like citizens of any other (democratic) country. I know exactly one German who uses their real name on facebook, and a recent court ruling supports us in this behaviour. Nobody I know spells out their kid’s names or posts pictures of their (recognizable) child. And yes, most people know how to turn off location services. Don’t trust my anecdata? The German love for privacy goes so far that when google streetview was first introduced in Germany, we had the possibility to opt-out and have our house or apartment building blurred out. Major social media companies (yes, evenfacebook/whatsapp, a detailed account on the clear-name ruling is here) have had to adapt their privacy policies to the German market, and Germany is one of the few countries actually prepared for the new General Data Protection Regulation (GDPR) of the EU (to my delight, GDPR translates into one single word in German: “Datenschutzgrundverordnung”).
All this to say: I value data protection very, very highly.
And therefore, I have conflicted feelings about open data. Yes, we can take out birth dates, names, email- and IP-addresses before we upload a data set on the platform of choice. But no, this does by no means make the data truly anonymous. Could somebody figure out who you are without knowing you, personally? Likely not. But what of people who do know you, know you participated in a particular study, and have now access to the data set? If this sounds far-fetched, consider the case of the netflix prize dataset (released back in 2006, newer data is, to my knowledge, not publicly available): computer scientists (Narayanan & Shmatikov, 2006) managed to reverse-engineer the identity of individuals in the data set with astonishing accuracy, by combining different sources of information, in this case, the netflix data set and movie ratings on Now, the only additional information needed to identify the netflix rental history (yes, at the time, the movies were still send as physical copies via mail), are a few movies watched in a 2 week time span (Narayanan & Shmatikov, 2006). If you think finding out personal information like this is difficult, I dare you not to talk about the last series you binge-watched at any social gathering! With the increase of different data sets available, each storing slightly different information about us, re-identification can get much easier.
While this might not be troubling right now, contrary to some “evidence”*, we cannot predict the future. We do not know whether a person who seems trustworthy now will still be so a few years from now. As I wrote above, you might reveal information helpful to de-identify your data simply by making small talk, not knowing that the information you volunteer can be instrumental in reducing your data privacy.
One fictional example for the case of research: imagine you contribute to research by answering online questionnaires. How do we, the researchers, know that leaving the date & time of your participation in the data set won’t cause problems to you? maybe you were doing the study during work-hours, and a suspecting tech-savvy superior took note and compared the time with a few open access data sets published shortly after.
More generally, we do not know whether currently harmless data is still benign in the future. 10 years ago, I did not expect that my online shopping behavior has the potential to predict my mental health status (Katikalapudi, Chellappan, Montgomery, Wunsch, & Lutzen, 2012, link to pdf). I could not imagine that the use of “absolout language”, or writing in black-and-white terms in my online writing might earn me a “diagnosis” of a depression, regardless of the actual content of my writing (Al-Mosaiwi & Johnstone, 2018). So who knows what we can do tomorrow with data collected today.
As technology changes, so do the means necessary to protect data. This is a German key card for punch-cards, the precursor of today's computer storage systems

All these cases presume legal access to, and use of, data. What is possible using fraud, dishonesty (just think “Cambridge Analytica”) or data obtained through hacking** an entirely different story.
So what can we do?
For the scientists, our first question should not be “how can I protect the data”, but “do I need (all) the data”. Much like refining the theoretical basis of an experiment before starting the data collection, we need to think about what questions are included “because we always ask them”, or because it might make for a fun exploratory research question, and whether we could do without those. Added bonus: this means less work cleaning data, and less possibilities for messing up the statistics. If we ask a question relevant to exclude the participant later on, this can also be the end-point of the study for those participants who won’t be included. It will protect their privacy AND save both parties time.
For sensitive data necessary for the research project, such as questions on alcohol and drug consumption (sometimes needed in cognitive psychology) or sexual preferences (useful in relationship research), we can assess the possibility to store this data in a separate file we won’t include in the open access data set, even if anonymized. The Datenschutzgrundverordnung (GDPR) mentioned above is actually also valid (with adaptions) for researchers, and provides clear definitions on what constitutes anonymous data. One of my supervisors conveniently mentioned to me his collaborative pre-print (a not-yet-officially published scientific article) on transparency in scientific research, including pointers on how to prepare a data set so it is both compliant with the regulation and accessible for others. Truly anonymous data does not fall under the GDPR (but see above how difficult this can be). You can find more information from the UCL legal service here (linkt to pdf). Sometimes, however, a data set might simply not be suitable for unrestricted access. But even then, with some restrictions, it is possible to simulate a data set with the same characteristics as the study data, which can then be shared.***
Finally, we need to be transparent with you, the participant and tax-paying financer of our research: You have a right to know who stands behind this study, why do we ask these questions, what do we intent to do with the data and how will we protect your privacy. This way, you can make an informed decision about your participation in the study.
In the end, this is a matter of trust. We trust you, the participant, to honestly answer our questions. And our work is only possible as long as you trust us, the scientists, to be responsible with the information you provided. The transparency revolution is not just for scientists, but also for you: Open science allows you to hold us accountable, to understand what we are doing and why, and how your contribution is used, now and in the future.

* incidentally, this study is a good case for open science, and one of the root causes of the transparency revolution. 
** fun fact: there is a search engine designed to detect vulnerable devices connected to the internet. It is called “Shodan”. And yes, this is legal, because of course, it will only be used by good guys trying to expose security weaknesses. 
*** One example how this can be done is the R package simPop. It is intended to be used on large data sets, but from my (admittedly limited) point of view, it should work with our usual sample sizes as well. Unfortunately, this limits the use of the data to cases replicating the same or very similar statistical analyses (i.e., running the same model with less covariates) as the original authors, especially in smaller data sets. It would, however, give other researchers the possibility to check whether a method has been applied correctly, or whether leaving out some variables would drastically change results. 

Julia Eberlen: I am a PhD student at the CeSCuP, interested in stereotype learning, especially in the social context typically known as "networks". If you use twitter, you can find me under @JulCharlotte where I tweet mostly about psychology and occasionally about cycling in Brussels, knitting and (currently) astronauts.


  • Al-Mosaiwi, M., & Johnstone, T. (2018). In an Absolute State: Elevated Use of Absolutist Words Is a Marker Specific to Anxiety, Depression, and Suicidal Ideation. Clinical Psychological Science, 2167702617747074.
  • Katikalapudi, R., Chellappan, S., Montgomery, F., Wunsch, D., & Lutzen, K. (2012). Associating Internet Usage with Depressive Behavior Among College Students. IEEE Technology and Society Magazine, 31(4), 73–80. 
  •  Klein, O., Hardwicke, T. E., Aust, F., Breuer, J., Danielsson, H.,… Frank, M. C. (2018, March 25). A practical guide for transparency in psychological science.
  • Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Jr., Bahník, Š., Bernstein, M. J., … Nosek, B. A. (2017, October 19). Investigating Variation in Replicability: A “Many Labs” Replication Project. Retrieved from
  • Narayanan, A., & Shmatikov, V. (2006). How To Break Anonymity of the Netflix Prize Dataset. ArXiv:Cs/0610105. Retrieved from 
  • Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

Image sources:   

Open data badge: David Mellor via 
Punch card storage: By NARA -[dead link], Public Domain, 
German punch card key: By LvZJaY -,_Schlitz-Hessen.jpeg



  1. Thanks for a wonderful post. Just one, basic question - when we share open data with a research project, would we commonly share all of the data for the project, or just the data associated with the particular analysis?

  2. Thank you for your question!
    For me, this depends on your reasons for sharing the data: Do you want to make it possible for others to replicate your analysis? Then the variables you used for the analysis should be sufficient. I would recommend including all participants (even those you excluded) so it becomes clear what, when and why you did exclude them. The easiest way to do this is to share your analysis code along the data set, for example R notebooks are a great way to do this without having to create several different files and documents! However, you have to be aware that this approach might be problematic if somebody wants to verify that you didn't exclude a variable because it didn't fit your hypothesis, so be prepared to include a section on why you didn't include specific variables in the data set.
    If you want to prohibit this situation, or enable others to use your data for their own (scientific) projects, for example to validate a computational model, or test a hypothesis before collecting new, similar data, then you need to include all variables. Here, it might be more difficult to render the data set truly anonymous, first, because there are simply more variables to go by for re-identification, and second, because you might not have thought about potential identification risks as carefully.
    The third option (and, in my opinion, the best) is to think about sharing the data before starting data collection. This way, you can collect the data in the way you want to share it, no modification needed! Unfortunately, this is not always possible: if you want to prohibited participants from taking part in your study twice, for example, or if you want to make it possible to return to a questionnaire later in the day, your software will very likely need to collect identifying information like IP-addresses. I am hopeful that there are workarounds for this, depending on the tools you use, but it requires additional work and planning. But it makes sharing later on much easier, safe for your participants and very transparent!

  3. Awesome blog post,
    Digital Marketing Training in KPHB with 100% Internships & Job Assistance.

  4. This post is very simple to read and appreciate without leaving any details out. Great work!
    data science coaching in hyderabad

  5. I really loved reading your blog. It was very well authored and easy to understand. Unlike other blogs I have read which are really not that good.Thanks alot!
    data analytics courses in hyderabad with placements