OKF and iCommons comment on proposed UK exception for information mining
Comments responding to the UK Intellectual Property Office consultation “Consultation on proposals to change the UK’s copyright system” submitted by iCommons in collaboration with the Open Knowledge Foundation’s Open Data in Science Working Group on 21 March 2012.
Response of the Open Knowledge Foundation’s Open Data in Science Working Group and of iCommons Ltd to BIS0312: Exception for copying of works for use by text and data analytics.
The Open Data in Science Working Group at the Open Knowledge Foundation and iCommons Ltd strongly support the Government’s response to the Hargreaves Review of Intellectual Property and Growth and encourages it to follow through on its many excellent proposals. As scientists and scholars, we are both creators and users of intellectual property. Our creations, however, only have their full value when they are shared with other researchers. Our data becomes exponentially more useful when combined with the data of others. The intention of copyright law is to support public dissemination and enable the appropriate and effective recombination of work. Unfortunately, in the area of science, current copyright actually delays or blocks the effective re-use of research results in the current digital environment. We encourage adaptation of the law to benefit the progress of science and its attendant economic advantages. In our professional experience we have found that the ability to freely use research data benefits science as well as creators of scientific work products.
In particular, we strongly urge implementation of Recommendation 5 to allow specific exceptions in copyright law for data and text mining.
Information mining is the way that modern technology locates digital information. The sheer number of publications and data sets means that thorough and accurate searching can no longer be done by the hand and eye of an individual researcher. Not only is it a deductive tool to analyze research data, it is how search engines operate to allow discovery of content. To prevent mining is therefore to force UK scientists into blind alleys and silos where only limited knowledge is accessible. Science does not progress if it cannot incorporate the most recent findings and move forward from there. Because digitized scientific information comes from hundreds of thousands of different sources in today’s globally connected scientific community, and because current data sets can be measured in terabytes, it is no longer possible to simply read a scholarly summary in order to make scientifically significant use of such information.
Hence, one must be able to copy information, recombine it with other data and otherwise “re-use” it so as to produce truly helpful results. It would require extraordinarily time-consuming efforts to secure permission to mine each and every relevant article from hundreds, even thousands of sources. A recent report by the Joint Information Services Council (JISC) on the issues raised in this consultation demonstrate the value of time lost under the current system. By their estimates, a single researcher obtaining permission to mine PubMedCentral articles mentioning malaria could lose over 60% of their working year contacting the 1024 journals necessary to obtain access to the complete corpus of literature. A blanket exception is the only way to ensure that UK scientists can truly stay abreast of scientific progress.
Therefore, we agree with the Government’s opinion that it is inappropriate for ”certain activities of public benefit such as medical research obtained through text mining to be in effect subject to veto by the owners of copyrights in the reports of such research, where access to the reports was obtained lawfully.” Restricting such transformative use is not in the UK’s overall scientific, much less economic interests. It is also not in the interest of Europe as a whole. The Ghent Declaration states “European researchers, while remaining in touch with the whole world, could also benefit from tools that would allow for better collaboration within Europe. In particular, such goals can be assisted by computers using techniques ranging from data mining to the semantic web. However, machine-based inference techniques work well only if documents are freely accessible…”.
We also applaud the recent draft policy statement from the UK Research Councils asserting their support for all research they fund becoming open access and their emphasis on a CC-BY or similar license which allows “unrestricted use of manual and automated text and data mining tools.” While this will lead to positive developments going forward, access to the pre-existing literature is equally vital.
Response to Specific Consultation Questions
77. Would an exception for text and data mining that is limited to non-commercial research be capable of delivering the intended benefits? Can you provide evidence of the costs and benefits of this measure? Are there any alternative solutions that could support the growth of text and data mining technologies and access to them?
Non-commercial limitation is not helpful and would reduce the benefits delivered to the economy. Researchers in both academia and industry are often reliant on the same information e.g. libraries of chemical structures. Therefore, to impede non-commercial access to mined text/data would result in duplicated time, effort and expense to obtain the information. Reducing dissemination to SMEs and other commercial organisations that could use it to generate useful and value added products with economic returns would appear counterproductive. More details on the economic value of chemical information can be found in the submission to this consultation from Dr. Peter Murray-Rust.
In addition to the potential loss of both scientific opportunity and economic returns, implementing a non-commercial clause is non-trivial. There is no clear local or global definition of “non-commercial use”, particularly when one recognizes that scientific data is merged across many jurisdictional boundaries. If the term is to be employed, it needs to be clearly defined, but this is difficult given that the full range of non-commercial opportunities cannot be foreseen. The difficulties of defining non-commercial were laid out in a detailed report by Creative Commons and is a topic of ongoing discussion as they review their licenses, which are among the most commonly used for open access scientific literature.
Even if clearly and strictly defined, non-commercial clauses could lead to data sharing problems within collaborations between academia and industry. Certainly, any prohibition against commercial use would hinder the widest possible sharing of data. It would probably also increase the cost to consumers of products based on such data and reduce the potential economic return by discouraging commercial organisations from creating value added products based on the results of text and data mining.
Disallowing downstream uses also complicates the publication and licensing of results from information mining. Having to lock or watermark the initial results of non-commercial mining in order to prevent its later commercial use would quickly prove cumbersome and problematic. Would each piece of mining information need to be labeled as restricted? Or only the fully compiled set of results from a mining operation? Would researchers have to track all subsequent uses and re-uses of mining simply to protect themselves from the threat of infringement litigation? By reducing interoperability and complicating the licensing situation, the labour and associated costs of information mining would significantly increase.
For these reasons, this group does not favor restricting the exception to non-commercial uses, but rather supports an exception for all mining purposes. At a minimum, any subscriber to closed access journals regardless of their commercial status should be able to mine the information for which they have paid subscription fees.
Evidence of costs and benefits
The recent JISC study, The Value and Benefits of Text Mining, includes many examples of the economic benefits and costs of text mining. Rather than repeat their excellent work, we wish to add the observations of one of our working group, Dr. Peter Murray-Rust of the University of Cambridge, who has been working in text- and data-mining for 30 years. During that time the relevant technology has developed dramatically to the point where he can, for example, extract 10 million chemical reactions from the published text literature. However, after two years of negotiations with a major publisher, the Murray-Rust research group was told that it could mine the publishers’ corpus only if all of the results belonged to the publisher and were not published.
Similarly, Murray-Rust developed methods for crystallographic data mining which have become accepted in the crystallographic community. Some such data is published openly alongside conventional articles so a dataset of 250,000 compounds has already been processed. The results of this analysis can be highly valuable for drug design but extension of the dataset is hampered by restrictive conditions of reuse. For example, other data are deposited with a non-profit organisation that makes individual data sets freely available to the scientific community, but under terms that may limit re-use of the mined data. In terms of text mining, factual data embedded in text is a grey area. For example, the factual statement ”the melting point of X is 30 degrees” is usually hidden behind a pay wall even if related data is available freely in supplementary files.
Further information is available in Peter Murray-Rust’s personal submission to this consultation.
Although not an alternative, one step that might encourage data owners to permit information mining would be to provide a type of blanket immunity that would protect them from claims based on a user’s reliance on data that later proves to be inaccurate or fraudulent. A major cost and disincentive to data providers who allow public access to data may thereby be reduced whether those providers are individual researchers, their employers or journals.
However, to the extent that journals wish to protect copyright in information mining for the purpose of protecting their own opportunities to extract revenue from such use, the exemption will not be enough to offset the burden on researchers to secure permission from individual publishers and therefore would function as an adjunct to copyright exceptions to address one of publishers’ concerns about a requirement to allow mining and reuse.
103. What are the advantages and disadvantages of allowing copyright exceptions to be overridden by contracts? Can you provide evidence of the costs or benefits of introducing a contract-override clause of the type described above?
To permit contractual override of an information mining exception is to risk voiding the exception. Text and data mining in academia are usually prohibited by contract at present, on top of any copyright. In some cases automatic systems will shut down access to institutions if they believe that the contract has been violated (including on occasions when it has not). A survey of licence agreements with institutions across 11 major publishers revealed that 7 out of 11 explicitly ban text and data mining or automated indexing by web crawlers. A more comprehensive survey of licenses and contractual agreements reveals a similar pattern across a wide array of publishers. Some of these organisations provide proprietary systems, often charged for, which enable some limited text-mining functionality. There is no reason to expect that such policies would change voluntarily.
Unfortunately, the majority of publishers when contacted directly with text mining requests have been “extremely unhelpful” if not unresponsive. Hence, a blanket exception is the only way to assure access and re-use.
We respectfully ask that you take these comments into consideration in making your recommendations.
For OKF Open Data in Science Working Group:
Jenny Molloy, Coordinator
For iCommons Ltd:
UK Co. Reg. No. 5398065 UK Charity Reg. No. 1111577 · Registered Office: Churchill Court, 36 Merrivale Square, Oxford OX2 6QX UK
1] Government’s response to Hargreaves report at http://www.bis.gov.uk/assets/biscore/innovation/docs/g/11-1199-government-response-to-hargreaves-review.
 Hargreaves report at http://www.ipo.gov.uk/ipreview.htm
 The Value and Benefits of Text Mining, JISC, Report Doc #811, March 2012, Section 3.3.8 at http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx, citing P.J.Herron, “Text Mining Adoption for Pharmacogenomics-based Drug Discovery in a Large Pharmaceutical Company: a Case STudy,” Library, 2006, claiming that text mining tools evaluated 50,000 patents in 18 months, a task that would have taken 50 person years to manually.
 See MEDLINE® Citation Counts by Year of Publication, at http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html and National Science Foundation, Science and Engineering Indicators: 2010, Chapter 5 at http://www.nsf.gov/statistics/seind10/c5/c5h.htm asserting the annual volume of scientific journal articles published is on the order of 2.5%.
 Panzer-Steindel, Bernd, Sizing and Costing of the CERN T0 center, CERN-LCG-PEB-2004-21, 09 June 2004, at http://lcg.web.cern.ch/lcg/planning/phase2_resources/SizingandcostingoftheCERNT0center.pdf.
 Van Noorden, Richard; Trouble at the text mine, Nature, V. 48, Issue 7388, 08 March 2012. See also JISC, op.cit., Section 4.2 “Consequently, in this example, a researcher would need to contact 1,024 journals at a transaction cost (in terms of time spent) of £18,630; 62.1% of a working year.”
 Ghent Declaration, February 2011, OpenAIRE, at http://www.openaire.eu/en/component/content/article/223-seizing-the-opportunity-for-open-access-to-european-research-ghent-declaration-published.
 RCUK Proposed Policy on Access to Research Outputs, RCUK, March 2012 at http://www.openscholarship.org/upload/docs/application/pdf/2012-03/rcuk_proposed_policy_on_access_to_research_outputs.pdf.
 Creative Commons, Defining “Noncommercial” A Study of How the Online Population Understands “Noncommercial Use,” September 2009 at http://wiki.creativecommons.org/Defining_Noncommercial.
 Dr Murray-Rust, Peter; ‘‘Information mining and Hargreaves: I set out the absolute rights for readers. Non-negotiable’ at http://blogs.ch.cam.ac.uk/pmr/2012/03/04/information-mining-and-hargreaves-i-set-out-the-absolute-rights-for-readers-non-negotiable/.
 JISC, op.cit., Chapters 4 and 5.
 Creative Commons, ‘Journal Licensing Agreements’, Spreadsheet of Results, 8 March at https://docs.google.com/spreadsheet/ccc?key=0AtV3tIqIu0UZdGVMNTAtejhBUlFySGk4QWdrVHJNdkE&authkey=CKC-_LQP&hl=en_US#gid=0. [Available on request.]
 Max Haeussler, CBSE, UC Santa Cruz, 2012, tracking data titled Current coverage of Pubmed, Requests for permission sent to publishers, at http://text.soe.ucsc.edu/progress.html .
 Murray-Rust, Peter; cited in ‘Journal Article Mining’, Publishing Research Consortium, Amsterdam, May 2011 at http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf. See also, Murray-Rust, Peter; ‘Wiley: Cambridge scientist require to text-mine content in Wiley journals: please switch off the lawyers and the robots‘ at http://blogs.ch.cam.ac.uk/pmr/2012/03/07/wiley-cambridge-scientist-require-to-text-mine-content-in-wiley-journals-please-switch-off-the-lawyers-and-the-robots/.