The environment around research data management and open data has become incredibly complex—and the evolution doesn’t appear to be slowing down at all. At the core of many of today’s challenges are machine learning, natural language processing (NLP), and predictive analytics—the methods used for processing tremendous quantities of data for a variety of intended purposes.
On a daily basis, the news is full of stories from the private sector and government agencies that are mining massive, internally-collected, sets of data for all sorts of outcomes. Technology is making it easier for organizations to become proactive in response to patterns in data. For example, with early alert systems, it is now possible for universities to identify students who might be on the cusp of dropping out early enough for an advisor to intervene. Companies want to mine their customer data to achieve greater profitability and inventory data to forecast demand for products in a timely manner.
But these types of methods aren’t restricted to closed data. In fact, one of the advantages of open data is that it allows data from disparate datasets to be combined—re-mixing or merging many “small data” sets to convert them into “big data.” From the perspective of funding agencies, this type of re-use is one of the intended benefits of open data. If datasets use common variables, include well-structured and organized data elements, are deposited into interoperable repositories that can be found by harvesters, and include Creative Commons Attribution (CC-BY) licenses (or other similar license allowing for re-use), other researchers are encouraged to find, access, and re-use these datasets without restrictions.
Re-use without restrictions is what sparks fear in many researchers. Once data has been published and is out in the world, you lose all control over your dataset. It can be used, combined, and repurposed in all sorts of ways—including ways you never considered, ways that could potentially put someone else in harm’s way, or for more morally-ambiguous purposes.
Although we’re proponents of open data, it’s useful to know about some incidents where open data has led to problems.