British Library To 'Harvest' The Web For Future Historians

British Library To 'Harvest' The Web For Future Historians

The British Library will begin to preserve the digital age for future generations when new regulations come into force.

It aims to "harvest" the entire UK web domain to document current events and record the country's burgeoning collection of online cultural and intellectual works.

Billions of web pages, blogs and e-books will now be amassed along with the books, magazines and newspapers which have been stored for several centuries.

The library could eventually collect copies of every public Tweet or Facebook page in the British web domain.

Lucie Burgess, leading the project at the British Library, said the unprecedented operation would provide a complete snapshot of life in the 21st century which increasingly plays out online.

"If you want a picture of what life is like today in the UK you have to look at the web," she said.

"We have already lost a lot of material, particularly around events such as the 7/7 London bombings or the 2008 financial crisis.

"That material has fallen into the digital black hole of the 21st century because we haven't been able to capture it.

"Most of that material has already been lost or taken down. The social media reaction has gone."

The operation to "capture the digital universe" will begin with an automatic "web harvest" of an initial 4.8 million websites - or one billion web pages - from the UK domain, she said.

This will start tomorrow and is expected to take three months.

It will then take another two months to process the data.

"We will have to distinguish between content published in the UK and elsewhere but in principle we will be able to archive the publicly available tweets of any individual, company or organisation," Ms Burgess said.

Until now the British Library could only preserve a relatively small handful of websites.

The 2003 Legal Deposit Library Act paved the way for the information to be stored but copy right laws forced the library to seek permission each time it wanted to collect web content.

Under the new regulations - which extend to the Bodleian Library, in Oxford, Cambridge University Library, the National Library of Scotland, the National Library of Wales and Trinity College Library in Dublin - it has the right to receive a copy of every UK electronic publication.

Roly Keating, chief executive of the British Library, said: "Ten years ago, there was a very real danger of a black hole opening up and swallowing our digital heritage, with millions of web pages, e-publications and other non-print items falling through the cracks of a system that was devised primarily to capture ink and paper.

"The Legal Deposit Libraries Act established in 2003 the principle that legal deposit needed to evolve to reflect the massive shift to digital forms of publishing.

"The regulations now coming into force make digital legal deposit a reality, and ensure that the Legal Deposit Libraries themselves are able to evolve - collecting, preserving and providing long-term access to the profusion of cultural and intellectual content appearing online or in other digital formats."

The British Library, which has invested £3 million in the project during the past two years, plans to collect the material by conducting an "annual trawl" of the UK web domain.

It will "harvest" information from another 200 sites - such as online newspapers or journals - on a more regular basis.

Access to the material, including archived websites, will be offered in reading rooms at each of the legal deposit libraries.

Culture Minister Ed Vaizey said: "Legal deposit arrangements remain vitally important. Preserving and maintaining a record of everything that has been published provides a priceless resource for the researchers of today and the future.

"So it's right that these long-standing arrangements have now been brought up to date for the 21st century, covering the UK's digital publications for the first time."

Close

What's Hot