Dr Matthew Addis, Co-founder and CTO of Arkivum, discusses their most recent Horizon 2020 project, ARCHIVER, and its role securing and protecting valuable scientific data.
Last month a multinational scientific buyer group led by CERN – home of the Large Hadron Collider – launched the design phase of ARCHIVER. This three-year, €4.8m project, funded from the European Union’s Horizon 2020 research and innovation programme, will radically improve the archiving and digital preservation of petabyte-scale data-intensive research. Covering the full research lifecycle, ARCHIVER embraces such issues as extreme data-scaling, network connectivity, service interoperability, and business models, all in a hybrid cloud environment. The multidisciplinary project will produce a range of archiving solutions, to be made available through the European Open Science Cloud (EOSC) and beyond.
The European Open Science Cloud is conceived as a trusted, virtual, federated environment to store, share and re-use research data across borders and scientific disciplines. As an interoperable data common that ensures data are Findable, Accessible, Interoperable, Reusable (FAIR), it will enable researchers to find, share, and reuse publications, data, and software. The EOSC’s aims are to engender new insights and innovations, higher productivity in research, and improved reproducibility in science. Ultimately, by giving scientists and other experts ready access to reliable data, its declared intention is to benefit mankind.
Tackling large volumes of scientific data
“The vast volumes of data that are now being produced around the world, in so many areas of endeavour, hold major long-term opportunities for producing substantial economic, scientific and societal benefits,” says Dr Matthew Addis, Co-founder and CTO of Arkivum, a UK-headquartered company spun out of the University of Southampton some ten years ago. Arkivum specialises in the long-term archiving and digital preservation of high-value digital content for major institutions and commercial organisations in a diversity of sectors. In partnership with Google Cloud, the company has been selected to participate in the initial, four-month design phase of the ARCHIVER project.
“ARCHIVER recognises that, in order to address the specific complex data requirements of many scientific disciplines, commercial services for digital preservation now need to be reliably and certifiably scaled to the petabyte region and beyond,” continues Addis. “As long ago as 2013, a McKinsey report found that open data could help create $3trn a year of value in seven areas of the global economy. Clearly, long-term access to reliable data that has been properly stewarded and preserved is essential for an undertaking like the EOSC. It highlights the increasing need for new Trusted Digital Repository solutions.”
As an article in Nature recently affirmed: ‘To make data FAIR whilst preserving them over time requires trustworthy digital repositories (TDRs) with sustainable governance and organisational frameworks, reliable infrastructure, and comprehensive policies supporting community-agreed practices.’ Those values are embodied in the TRUST principles, namely that stewards of research data should consistently consider Transparency, Responsibility, User Focus, Sustainability and Technology.
Increasing the capacity of researchers
By making research data open and accessible, projects such as the EOSC and ARCHIVER stand to accelerate discovery and the development of new products and services, and so increase research’s capacity to address societal challenges. These principles have achieved dramatic prominence in the response to the current COVID-19 pandemic and in the efforts of the international community to share data rapidly and in an open and trustworthy way. The Research Data Alliance COVID-19 working group is at the forefront of establishing guidelines to ensure that open data on COVID-19 gives rise to maximum benefits today and in the future.
But these benefits can only be fully realised if research data is FAIR. FAIR encourages and supports high-quality research that follows good practice which produces results that are repeatable, verifiable, and re-usable. Only under these circumstances can research data be used with confidence and exploited to its full potential. “Crucially, FAIR data needs to be made reliably available for the long term, often many decades,” emphasises Matthew Addis. “Only then will the value of open access be fully realised.”
Arkivum’s business is based on a fully managed service (SaaS) for the entire lifecycle of data, providing a flexible solution for the long-term safeguarding, digital preservation and access of digital content, while enabling client organisations to achieve and apply a range of standards and good practice, including OAIS, TDR, CTS and DPC RAM.
Arkivum’s role in ARCHIVER
For the ARCHIVER project, Arkivum’s stack of software services – which can be run on-premise or on cloud infrastructure – will be deployed and extended on the Google Cloud Platform. As content comes into the archive it will be ingested, validated, and organised. It will then go through appropriate preservation and safeguarding processes to ensure it is properly protected and remains usable. Finally, it will be made searchable, discoverable, and accessible for current and future users.
Importantly, Arkivum makes extensive use of open standards, open source, and open specifications. Facilitating interoperability, migration, portability and exit strategies, this open approach is particularly useful in long-term data archiving and preservation, since data is likely to outlive the specific systems and environments used to store and provide access to it. For the ARCHIVER project, further open specifications and standards will be integrated into Arkivum’s service to accommodate the data management needs of the scientific community.
In this context the Google Cloud Platform, which offers the rapid network connectivity required for vast volumes of data, will provide an ideal basis for long-term digital preservation. Its multiple storage options, already proven for multi-petabyte datasets, range from fast-access storage for frequently accessed data to deep-archive storage for infrequently accessed data; this favours optimisation that successfully balances different imperatives, such as access speed, storage cost, data safety and retention periods. The platform is also compatible with a multiplicity of software and scientific applications, while its existing infrastructure is geared to ensuring that data worthy of long-term digital preservation remains well managed, carefully monitored and secure, today and in the years to come.
As Arkivum’s Matthew Addis says: “The ARCHIVER project will be crucial in laying the foundations for ensuring the longevity and long-term accessibility of some of the world’s most valuable scientific data.”
Dr Matthew Addis
Guest author
Co-founder and CTO
Arkivum