We might be accidentally creating a “black hole” in history unless we rethink how to archive the web, one internet pioneer has warned.
Vint Cert, who is recognised for ‘co-founding’ the Internet with Robert Khan, said “we stand to lose an awful lot of our history” if big changes are not implemented.
Cerf is now a vice-president at Google, and won the AM Turing Award in 2004 for his “pioneering work on internetworking, including the design and implementation of the Internet’s basic communications protocols”.
But now he says despite many projects to archive online data for future generations, we might end up not being able to use any of it.
Cerf told the American Association for the Advancement of Science that “if we want people in the future to be able to recreate what we are doing now, we are going to have to build the concept of preservation into the internet”.
Cerf says that a “digital vellum” must be developed which can maintain the state of hardware and software, as well as raw data, so that the web as it appears today can be experienced in decades to come.
“When you think about the quantity of documentation from our daily lives that is captured in digital form, like our interactions by email, people’s tweets, and all of the world wide web, it’s clear that we stand to lose an awful lot of our history,” he said, according to the Guardian.
“We don’t want our digital lives to fade away. If we want to preserve them, we need to make sure that the digital objects we create today can still be rendered far into the future.”
The size of the internet is difficult to estimate, but one commonly and maybe dubiously cited stat says that Google’s index accounts for about 200 terabytes of data, which is 0.004% of the internet.
Cerf says that archive is too valuable to future historians for us to simply abandon:
“Let us imagine that there’s a 22nd-century Doris Kearns Goodwin and she decides to write about the beginning of the 21st century and seeks to reproduce the conversations of the time. She discovers that there’s an awful lot of digital content that either has evaporated because nobody saved it, or its around but it’s not interpretable because it was created by software that’s 100 years old.”
Cerf is helping to create a system to do just that at Carnegie Mellon University, with the help of IBM. The system (‘Olive’, or Open Library of Images for Virtualised Execution) can take and manage slices of the internet including all kinds of data, and the technical information about the systems that used it.
“Data storage is getting so cheap that I don’t worry about [storing data],” he said at the conference, according to the FT.
“I worry about how to find something in it.”