Digital Curation Blog: 2009

Thursday 17 December 2009

More activity on semantic publishing

If you saw tweets from @cardcc today, you might realise I’ve been very interested in a couple of recent developments in semantic publishing. I wrote earlier about linking data to journal articles, including David Shotton’s adventures in semantic publishing. David’s work was one of those included in the review article in the Biochemical Journal by Attwod, Kell, McDermott et al (2009). The article ranged over the place of ontologies and databases, science blogs, and various experiments. These included

RSC and Project Prospect,
The ChemSpider Journal of Chemistry,
The FEBS Letters experiment,
PubMed Central and BioLit,
Shotton’s PLoS experiment,
Elsevier Grand Challenge,
Liquid Publications,
The semantic Biochemical Journal experiment.

The latter was the real focus of the article, available in PDF, but which if read through a special reader called Utopia Document displayed some active capabilities. These included the ability to visualise and rotate 3-d images of proteins, to see tables represented as graphs (or vice versa) and to link to entries in nucleic acid databases. The capabilities were perhaps a bit awkward to spot and to manipulate, but still interesting. This article is (gold) open access. Other articles in the issue have also been instrumented in this way.

It’s clearly early days for Utopia, and I wasn’t wholly impressed with it as a PDF reader, but I was certainly very excited at some of what I read and saw.

I also read today a very different article (I think not available on open access), by Ruthensteiner and Hess (2008). They describe the processes in making 3-d models of biological specimens, and presenting them in PDF, readable by a standard Acrobat Reader. The 3-d capability was at least as good as if not better than the Utopia results.

Because it’s getting late, I’ll end with my last tweet:

“My head is spinning with semantic article possibilities. I hope some get picked up in new #jiscmrd proposals, see http://www.jisc.ac.uk/fundingopportunities/funding_calls/2009/12/1409researchdata.aspx"

Attwood, T. K., Kell, D. B., McDermott, P., Marsh, J., Pettifer, S. R., Thorne, D., et al. (2009). Calling International Rescue: knowledge lost in literature and data landslide! The Biochemical journal, 424(3), 317-33. doi: 10.1042/BJ20091474.

Ruthensteiner, B., & Hess, M. (2008). Embedding 3D models of biological specimens in PDF publications. Microscopy research and technique, 71(11), 778-86. doi: 10.1002/jemt.20618. Pubmed abstract.

Tuesday 15 December 2009

Linked Statistics & other data

Someone pointed me to the blog Jeni's Musings, written by Jeni Tennison. I don't know who Jeni is, but there's some really interesting stuff here, with some obvious links to UK Government Data activity. Among other things, there's a post about expressing statistics with RDF (it looks pretty horribly verbose, but it's the first attempt I've seen to address some real data that could be relevant to science research), an thoughtful post about the provenance of linked data, and a series of 5 posts (from here to here) on creating linked data.

Wednesday 9 December 2009

Director, The Digital Curation Centre (DCC)

So, time to come fully out into the open, after various coy hints over the past week or so. I'm planning to retire around the start of DCC Phase 3. Adverts are starting to appear with the following text:

"We wish to appoint a new Director to take the DCC forward into an exciting third phase, from March 2010. You must be a persuasive advocate for better management of research data on a national and international scale. Able to listen to and engage with researchers and with research management, publishers and research funders, you will build a strong, shared vision of the changes needed, working with and through the community. You should have a sound knowledge of all aspects of digital curation and preservation, an understanding of higher education structures and processes, and appropriate management skills to be the guiding force in the DCC’s progress as an effective and enduring organisation with an international reputation.

"This post is fixed term for three years. [I would suggest: in the first instance...]
Closing date: Friday 15th January 2010."

This is a great job, and I think an important one. I have been bending the ears of many people in the last couple of weeks to ask them to think of appropriate people to point this advert at (yes, I know that's rotten English; it's been a long week).

Further details will be on the University of Edinburgh's jobs web site, http://www.jobs.ed.ac.uk/ (they weren't there when I checked a few minutes ago, maybe tomorrow).

Tuesday 8 December 2009

Leadership opportunities

Those interested in leadership in Digital and Data Curation should keep an eye on the relevant UK press and lists over the next week or so for anything of interest...

Last volume 4 issue of IJDC just published

On Monday this week, we published volume 4, issue 3 of IJDC. From one respect, this was a miracle of speed publishing, as 7 of the peer-reviewed articles had just been delivered the previous week as part of the International Digital Curation Conference. But we also included an independent article, plus 1 peer-reviewed paper and 3 articles with a rather longer gestation, originating in papers at iPres 2008! There are good and bad reasons for that too lengthy delay.

I wrote in the editorial that I would reproduce part of it for this blog, to attract comment, so here that part is.

"But first, some comments on changes, now and in the near future, that are needed. One major change is that Richard Waller, our indefatigable Managing Editor, has decided to concentrate his energies on Ariadne. Richard has done a grand job for us over the past few years, in his supportive relationships with authors, his detailed and careful editing, and in commissioning general articles. To quote one author: “I note that the standard of Richard’s reviewing is much better than [a leading publisher's]; they let an article of mine through with very bad mistakes in the references without flagging them for review, and were not so careful about flagging where they had changed my text, not always for the better”. The success of IJDC is in no small way a result of Richard’s sterling efforts over the years. I am very grateful to him, and wish him well for the future: Ariadne authors are very lucky!

"Looking to the future of IJDC, we will have Shirley Keane as Production Editor, working with Bridget Robinson who provides a vital link to the International Digital Curation Conference, and several other members of the DCC community. We are seeking to work more closely with the Editorial Board in the commissioning role and to draw on the significant expertise of this group.

"In parallel, we have been reviewing how IJDC works, and are proposing some changes to enhance our business processes and I shall be writing to the Editorial Board shortly. For example, we expect to include articles in HTML- as well as PDF format, to introduce changes to reduce the publishing lead times, and a possible new section with particular practitioner orientation. As part of reduced publishing lead times, we are considering releasing articles once they have been edited after review, leading to a staggered issue which is “closed” once complete. I’m planning to repeat this part of the editorial in the Digital Curation Blog [here], perhaps with other suggestions, and comments [here] would be very welcome."

Oh, we then did a little unashamed puffery...

"We are, of course, very interested in who is reading IJDC, and the level of impact it is having on the community. In order to find out, Alex Ball from UKOLN/DCC has been trying several different approaches in order to get as full a picture as possible.
One approach we have used is to examine the server log for the IJDC website. The statistics for the period December 2008 to June 2009 show that around 100 people visit the site each day, resulting in about 3,000 papers and articles being downloaded each month. It was pleasing to discover we have a truly global readership; while it is true that a third of our readers are in the US and the UK, our content is being seen in around 140 countries worldwide, from Finland to Australia and from Argentina to Zimbabwe. As one would expect, we principally attract readers from universities and colleges, but we also receive visits from government departments, the armed forces and people browsing at home.

"The Journal is also having a noticeable impact on academic work. We have used Google Scholar to collect instances of journal papers, conference papers and reports citing the IJDC. In 2008, there were 44 citations to the 33 papers and articles published in the Journal in 2006 and 2007, excluding self-citations, giving an average of 1.33 citations per paper. Overall, three papers have citation counts in double figures. One of our papers (“Graduate Curriculum for Biological Information Specialists: A Key to Integration of Scale in Biology” by Palmer, Heidorn, Wright and Cragin, from Volume 2, Issue 2) has even been cited by a paper in Nature, which gives us hope that digital curation matters are coming to the attention of the academic mainstream."

OK, so we're not Nature! Nevertheless, we believe there is a valuable role for IJDC, and we'd like your help in making it better. Suggestions please...

(I made this plea at our conference, and someone approached me immediately to say our RSS feed was broken. It seems to work, at least from the title page. So if it still seems broken, please get in touch and explain how. Thanks)

IDCC 09 Delegate Interview: Melissa Cragin and Allen Renear

Melissa Cragin and Allen Renear [corrected; apologies] from the University of Illinois, who will be chairing IDCC next year in Chicago, give their reactions to IDCC 09 and a hint of their preparations for next year's event in this final, double-act, video interview....

[click here to view this video at Vimeo]

IDCC 09 Delegate Interview: Neil Grindley

JISC Programme Manager and session chair Neil Grindley gives us his response to IDCC 09 in this quick video interview as the event draws to a close...

[click here to view this video at Vimeo]

IDCC 09 Keynote: Prof. Ed Seidel

In a content-packed keynote talk, Ed Seidel wanted to give us a preview about what types of project are driving the National Science Foundation's need to think about data, the cyber infrastructure questions, the policy questions and the cultural issues surrounding data that are deeply routed in the scientific community.

To illustrate this initially, Seidel gave the example of some visualisation work on colliding black holes that he had conducted whilst working in Germany with data collected in Illinois, explaining that in order to achieve this he had to do a lot of work on remote visualisation and high performance networking – but that moving the data by network to create the visualisations was not practical, so the team had to fly to Illinois to do the visualisations, then bring the data back. He also cited projects that are already expecting to generate an exabyte of data – vastly more than is currently being produced – so the problem of moving data is only going to get bigger.

Seidel looked first to the cultural issues that influence scientific methods when it comes to the growing data problem. He demonstrated the 400-year-old scientific model of collecting data in small groups or as individuals, writing things down in notebooks and using small amounts of data in that could be measured in kilobytes in modern terms, with calculations carried out by hand. This has not change from Galileo and Newton through to Stephen Hawkins in the 1970's. However, within 20-30 years, the way of doing science changed – with teams of people working on projects using high performance computers to create visualisations of much larger amounts of data. This is a big culture shift, and Seidel pointed out that many senior scientists are still trained in the old method. You now need larger collaborative teams to solve problems and manage the data volumes to do true data-driven science. He used the example of the Hadron Collider, where scientists are looking at generating tens of petabytes of data, which need to be distributed globally to be analysed – with around 15,000 scientists working on around six experiments.

Seidel then went on to discuss how he sees this trend of data sharing developing, using the example of the recent challenge of predicting the route of a hurricane. This involved the sharing of data between several communities to achieve all the necessary modelling to respond to the problem in a short space of time. Seidel calls the groups solving these complex problems “grand challenge communities”. The scientists involved with have three or four days to share data and create models and simulations to solve these problems, but will not know each other! The old modality of sharing data with people that you know will not work and so these communities will have to find ways to come together dynamically to share data if they are going to solve these sorts of problems. Seidel predicted that these issues are going to drive both technical development and policy change.

To illustrate the types of changes already in the pipeline, Seidel cited colleagues who are playing with the use of social networking technologies to help scientists to collaborate – particularly Twitter and Facebook. Specifically, they have set up a system whereby their simulation code tweets its status, and have also been uploading the visualisation images directly into Facebook in order to share it.

Seidel noted that high dimensional, collaborative environments and tremendous amounts of bandwidth are needed, so technical work will be required. The optical networks often don't exist – with universities viewing such systems like the plumbing and funding bodies not looking to support the upgrade of such infrastructure. Seidel argued that we need to find ways to catalyse this sort of investment.

To summarise, Seidel highlighted two big challenges in science trends at the moment: multi-skilled collaborations and the dominance of data, which are both tightly linked. He explained that he had calculated that compute, data and networks have grown 9-12 orders of magnitude in 20-30 years after 400 years unchanged, which shows the scale of the change and the change in culture that it represents.

NSF has a vision document which highlights four main areas – virtual organisations for distributed communities, high performance computing, data visualisation, and learning and work practices. Focusing on the “Data and Visualisation” section, Seidel quoted their dream for data to be routinely deposited in a well-documented form, regularly and easily consulted and analysed by people and are openly accessible, protected and preserved. He admitted this is a dream that is no where near being realised yet. He recognised that there need to be incentives for the changes and new tools to deal with the data deluge. They are looking to develop a national data framework, but emphasised that the scientific community really needs to take the issues to heart.

Taking the role of the scientist, Seidel took us through some of the questions and concerns which a research scientist may raise in the face of this cultural shift. They included concerns about replication of results – which Seidel noted could be a particular problem when services come together in an ad hoc way, but needs to be addressed if the data produced is to be credible.

Seidel moved on to discuss the types of data that need to be considered, in which he included software. He stressed that software needs to be considered as a type of data and therefore needs to be given the same kind of care in terms of archiving and maintenance as traditional scientific collection or observation data. He also includes publications as data, as many of these are now in electronic form.

In discussing the hazards faced, Seidel noted that we are now producing more data each year than we have done in the entirety of human history up to this point – which demonstrates a definite phase change in the amount of data being produced.

The next issue of concern Seidel highlighted was that of providence – particularly how we collect the metadata related to the data that we are considering how to move around. He admitted that we just simply don't know how to do this annotation at the moment, but this is being worked on.

Having identified these driving factors, Seidel explained the observations and workgroup structures that NSF has in place to think more deeply and investigate solutions to these problems, which includes the DataNet project. $100 million is being invested in five different projects as part of this programme. Seidel hopes that this investment will help catalyse the development of a data-intensive science culture. He made some very “apple-pie” idealistic statements about how the NSF sees data, and then used these to explain why the issues are so hard, emphasising the need to engage the library community who have been curating data for centuries, and the need to consider how to enforce data being made available post-award.

Discussions at the NSF are suggesting that each individual project should have a data management policy which is then peer-reviewed. They don't currently have consistency, but but this is the goal.

In conclusion, Seidel emphasised that there are many more difficult cases are coming... However, the benefits of making data available and searchable – potentially with the help of searchable journals and electronic access to data – are great for the progress of science, and the requirement to make many more things available than before if percolating down from the US Government to the funding bodies. Open access to information online is a desirable priority and clarification of policy will be coming soon.

IDCC 09 Keynote: Timo Hannay

Timo Hannay presented a talk entitled 'From Web 2.0 to the Global Database”, providing a publishing perspective on the need for cultural change in scientific communication.

Hannay took a step back to take a bigger picture view. He began by giving an overview to his work at Nature, noting that the majority of their business is through the web – although not everyone reads the work electronically, they do access the content through the web. He then explained how journals are being coming more structured, with links providing supplementary links and information. He admitted that this information is not yet structured enough, but it is there – making the journal more like databases.

Hannay moved on to explain that Nature is getting involved in database publishing. They help to curate and peer-review database content and commission additional articles to give context to the data. This is a very different way of being a science publisher – so the change is not just for those doing the science!

After taking us through Jim Gray's four scientific paradigms, Hannay asked us to think back to a talk by Clay Skirky in 2001, which led to the idea that the defining characteristic of the computer age is not the devices, but the connections. If a device is not connected to the network, it hardly seems like a computer at all. This led Tim O'Reilly to develop the idea of the Internet Operating System, which morphed into the name “Web 2.0”. O'Reilly looked at the companies that survived and thrived after the dot com bubble and created a list of features which defined Web 2.0 companies, including the Long Tail, software as a service, peer-to-peer technologies, trust systems and emergent data, tagging and folksonomies, and “Data as the new 'Intel Inside'”.... the idea that you can derive business benefit from powering data behind the scenes.

Whilst we have seen the Web 2.0 affect science, science blogging hasn't really taken off as much as it could have done – particularly in the natural sciences – and is still not a main stream activity. However, Hannay did note some of the long term changes we are seeing as a result of the web and the tools it brings: increasing specialisation, more information sharing, smaller 'minimum publishable unit', better attribution, merging of journals and databases – with journals providing more structure to databases – and new roles for librarians, publishers and others. Hannay asserted that these changes are leading, gradually, to a speeding up of discovery.

Hannay took us through some of the resources that are available on the web, from Wikipedia to PubChem and ChemSpider, where the data is structured and annotated through crowd sourcing to make the databases searchable and useable.

He asserted that we are moving away from the cottage-industry model of science, with one person doing all the work in the process from designing the experiment to writing the paper. We are now seeing whole teams with specialisms collaborating across time and space in a more industrial-scale science. Different areas of science at at different stages with this.

Hannay referred to Chris Anderson's claim on Wired Magazine that we no longer need theory. He rejected this, but did agree that more is different, so we will be seeing changes. He gave the example of Google, which didn't develop earlier in the history of the web simply because it was not necessary until the web reached a certain degree of scale for it to be useful.

As publishers, Hannay believes that have a role to play in helping to capture, structure and preserve data. Journals are there to make information more readable for human beings, but they need think about how they present information to help both humans and computers to search and access information as both are now just as important.

All human knowledge is interconnected and the associations between facts are just as important as the facts themselves. As we reach that point when a computer not connected to the network is not really a computer, Hannay hopes we will reach a point where a fact not connected to other facts in a meaningful way will hardly be considered a fact. One link, one tag, one ID at a time, we are building a global database. This may be vast and messy and confusing, but it will be hugely valuable – like the internet itself as the global computer operating system.

IDCC 09: Richard Cable Discusses BBC Lab UK and Citizen Science

Richard Cable, the Editor of BBC Lab UK, opened his talk with the traditional representations of science on television (“Science is Fun” vs “blow things up”) in his presentation about the BBC Lab UK initiative – designed to involve the public with science.

Cable used this comparison to illustrate that most “citizen science” seems to involve the mass engagement of people with science, whereas Lab UK is aimed at mass participation in science. The project is about new learning: creating scientifically useful surveys and experiments with the BBC audience, online.

He discussed the motives of the audience and how this forms a fundamental part of both the design of the experiment and the types of experiment that they can usefully conduct. For the audience, the experiment has to be a bit of a “voyage of self-discovery” where they learn something about themselves as well as contributing data to the wider experiment – a more altruistic motive. Cable emphasised that they work with real scientists, properly designed methodologies, ethics approval and peer-review systems so that the experiments are built on solid science and therefore make a useful contribution to scientific knowledge, rather than just entertainment for the audience.

To illustrate, Cable took us through the history of BBC online mass participation experiments which have led to the development of the new Lab UK brand. This included their Disgust experiment, involving showing users images and asking them to judge whether they would touch the item in the image. This was driven by a television programme, which directed the audience to the website after the show. He also discussed Sex ID, which worked the opposite way round – with the results of the experiment feeding the content of the programme. 250,000 participants got involved over 3 months to take a series of short flash tests which identified the sex of their brains. This exemplified his point about giving the audience a motive – with them learning something about themselves as a result of participation.

In continuing this back-story, Cable briefly introduced Stress, which was the prototype launch for the Lab UK brand itself, which was linked into the BBC's Headroom initiative. He noted that the general public would rather something that gave some lifestyle feedback, rather than just being purely sciency. This experiment – a series of flash tasks and uncomfortable questions – has since been taken down.

The more recent Brain Test Britain was a higher profile experiment launched by the programme Bang Goes The Theory which was the first longitudinal experiment, where the audience were asked to revisit the site over a period of 6 weeks to participate, rather than one-off site visit, survey model of experiment used in the previous examples. They were expecting 60,000 participants, given the issue of retention, to help establish whether brain training actually works. This was a proper clinical trial with academic sponsors from the Alzheimer's Society – the results of which will be announced in a programme later next year.

The fourth experiment Cable described was The Big Personality Test, linked with the Child Of Our Time series following children born in the year 2000. They used standard accepted models for measuring personality, to give detailed feedback to participants. They were seeking to answer the question: “Does personality shape your life or does life shape your personality?”. They attracted 100,000 participants in 3 days, which was vastly more uptake than expected. The level of data they have collected already is becoming unmanageable, so this means they are having to re-evaluate the duration of the experiment.

In the future, they are hoping to take their experiments social using Facebook and Twitter as part of the method.

Cable summed up these experiments by highlighting the rules they have found they need to apply when designing such experiments. These include a low barrier to entry, a clear motive for participation, a genuine mass participation requirement, a sound scientific methodology and an aim that will contribute to new knowledge.

Cable went on to discuss the practicalities of how experiments are designed from conception to commissioning. This involves selecting sponsor scientists, who help to design the experiment and analyse the results. He explained the selection process, which entails finding respected scientists who are flexible and adaptable to this experiment format. The role of this “sponsor academic” is to collaborate on experiment design, advise on the ethics processes, interpret the results and then write the peer-review paper resulting from the data and publish their findings.

The data collected from these experiments comes in two forms: personally identifiable data and anonymisable data. This means that the scientist cannot trace individual participants back, but the BBC (or three people within the BBC) can trace people back in the event that they need to manage the database and delete entries if requested. Cable also explained that the data they ask for is driven by the science, not editorial decisions by television programme makers, using standard measures where possible.

Finally, Cable discussed the actual data and the curation issues surrounding it. All the data from Lab UK is stored in one place, connected by the BBC ID system, which enables them to start doing secondary analysis from the data where participants have taken part in multiple experiments. The sponsor academics have a period of exclusivity before the data becomes available for academic and educational purposes only. However, they are still grappling with issues of data visualisation so they can make this data comprehensible to the general public, and data storage issues – as the BBC does not do long term data storage. There are precedents – including the People's War project where people's memories of the World War II were collected and hosted online. This data has now been passed to the British Museum and forms part of their collection. He also noted that there may be demands from the ethics committee on how long they can keep the data and before it may be destroyed.

IDCC 09 Interview: Chris Rusbridge

Chris Rusbridge, Director of the DCC, gives us his views of IDCC 09 and his thoughts about the future of the DCC as it moves towards Phase 3 in this video interview:

[click here to view this video at Vimeo]

Monday 7 December 2009

IDCC 09 Opening Keynote Recording Now Available

After a few problems with the recording following issues with the wireless network at the IDCC 09 venue, I am pleased to say that Professor Douglas Kell's opening keynote is now available online by clicking here.

Complete listings of the recordings from all the main IDCC 09 sessions are available at the event NetVibes page.

IDCC 09: Prof. Carole Palmer - "The Data Conservancy: A Digital Resource & Curation Virtual Organisation"

Professor Carole Palmer introduced the Data Conservancy which was “cooked up” at the IDCC when it was held in Glasgow. The Data Conservancy asserts that Research Libraries are a core part of the emerging distributed network of data collections and services.

Palmer noted that there is not really an adequate analogy for data services yet (are data sets the new library stacks? Or the new special collections?) but emphasised that data collections and services are consistent with the research library mission.

The Data Conservancy is a diverse group of domain and data scientists, enterprise experts, librarians and engineers. Palmer introduced the range of partners involved in the project, and then moved on to discuss how they look to move forward in a very “non-rigid way”, learning to build principles of navigation and how large the solution space actually is – with technical solutions being only a small part of that. She also noted how an NSF report discussing how successful infrastructure evolves has inspired their group.

Their goals align with the original programme call for DataNet. They are going to collect, organise, validate and preserve data as part of a data curation strategy, as necessary for the call. They are also going to examine how to bring data together to address grand research challenges that society is currently facing. However, the strategy is to connect systems to infrastructure and to be highly informed by user-centred design. They found it was very very important to build on existing exemplar projects and to engage with communities that already have deep involvement with scientists.

Palmer took us through diagrams showing who they are intending to support and how the responsibilities of each of the teams within the projects. They are trying to strike a balance between the research and implementation – which is a requirement of DataNet.

The Data Conservancy believes in a flexible architecture, but this has to support a wide range of requirements, data and uses that they have across their constituencies. As a research library project, they are committed to bringing data in, but Palmer noted that not all research libraries can or should do this.

A big part of their project has to do with building a data framework, so they are thinking a lot about the notion of the “scientific observation” as a common concept across scientific disciplines. They will be examining existing models and building on this work. In particular Palmer talked us through an ORE resource map and noted the need to link data to literature and explained that as libraries they are well positioned to work in this area and improve upon such models.

The launch pad for the project is looking at data from the Astronomy – specifically the Sloane Digial Sky Survey, which is almost 3 times bigger than data held at Johns Hopkins University in total, which presents a big initial problem in terms of scale. They will then be taking what they learn from working with this core community and applying this to other areas, including Life Sciences, Earth Science and Social Science, after a deep study of the history of astronomic research processes.

As part of her presentation, Palmer gave us an over view of the types of projects they are involved with and how they intend to start interfacing between these projects. She also explained further about their work at Illinois as a number of her colleagues from Illinois were present at the conference. Their work has noted that it is not just the big instrument driven science that will drive this forwards, but also smaller science projects. They are also working to understand how they can determine, early on, the long-term potential for data. The IDCC will be hosted at Illinois in 2010, which will be followed by a partner summit for DataNet projects as they move forward.

To conclude, Palmer discussed the education element of their work, which includes a data curation specialisation in the Masters of Science, with the third class running this semester involving 31 students. The Data Conservancy is expected to infuse teaching practices and help to educate a more diverse range of students. She showed us a slide demonstrating the strategy for building the new workforce at Illinois, with the Data Conservancy working across the various areas.

There are lots of connections between the Data Conservancy and research groups and, so Palmer is looking forward to sharing results, work practices and ideas as they move forward with the DataNet.

Catch Up On IDCC 09 Day 2

The recordings of the main sessions from Day 2 of IDCC 09 are now available to view on Vimeo.

Please select from the links below to watch any sessions of interest, or any that you may have missed from the live stream of the event....

Best Peer-Reviewed Paper “Multi-Scale Data Sharing in the Life Sciences- Some Lessons for Policy Makers” – Graham Pryor, University of Edinburgh

Keynote Address by Professor Ed Seidel, Associate Director, Directorate of Mathematical and Physical Sciences, National Science Foundation

Closing Keynote Address: Timo Hannay – Publishing Director, Nature.com, Nature Publishing Group

Closing Remarks: Chris Rusbridge - Director of the DCC

You can also find the complete list of session recordings, together with links, at the IDCC 09 NetVibes page where we are still gathering #idcc09 tweets and other feeds about the event. If you are blogging about the event, please remember to use the #idcc09 tag!

Saturday 5 December 2009

IDCC 09 Delegate Interview: Kevin Ashley

Kevin Ashley, who manages the Digital Archives Department for the University of London Computer Centre, gives us his response to IDCC 09 over lunch on the final day.

[click here to view this video at Vimeo]

IDCC 09 Poster Presenter Interview: John Kunze

Poster presenter John Kunze from California Digital Library discusses his poster and his observations from IDCC 09...

[click here to view this video at Vimeo]

IDCC 09 Demonstrator Interview: Terri Mitton

Terri Mitton was one of those demonstrating in the Community Space at IDCC 09. Here she tells us what she has found interesting about the event...

[click here to view this video at Vimeo]

Friday 4 December 2009

IDCC 09 Best Peer-Reviewed Paper: Graham Pryor

Presenter of the Best Peer-Reviewed Paper, Graham Pryor of the University of Edinburgh, gives a context to his paper: “Multi-Scale Data Sharing in the Life Sciences - Some Lessons for Policy Makers” in a video interview prior to his presentation on Day 2 of IDCC 09.

[click here to view this video at Vimeo]

IDCC 09 Delegate Interview: William Kilbride

William Kilbride from the Digital Preservation Coalition tells us what he has found interesting on day one of IDCC 09 in this quick video interview over coffee....

[click here to view this video at Vimeo]

IDCC 09 Delegate Interview:

Duncan Dickinson from the University of Southern Queensland discusses his experience of IDCC 09 in this video interview.

[click here to view this video at Vimeo.]

IDCC 09 Demonstrator Interview: Heather Bowden

Heather Bowden from the University of North Carolina introduces their Digital Curation Exchange, which was demonstrated during the Community Space session at IDCC 09

[click here to view this video at Vimeo]

IDCC 09 Peer-Review Paper: Andrew Treloar

Dr Andrew Treloar of the Australian National Data Service discusses his paper: “Designing for Discovery and Re-Use: the 'ANDS Data Sharing Verbs' Approach to Service Decomposition” and his tomato plants in this short video interview at IDCC 09.

[click here to view this video in Vimeo]

IDCC 09 Peer-Review Paper: Tyler Walters

In the first of a series of informal interviews from IDCC 09, Tyler Walters from Georgia Institute of Technology gives us a summary of his peer-review paper "Data Curation Program Development in US Universities: The Georgia Institute of Technology Example", presented during the parallel sessions at the International Digital Curation Conference 2009 on Friday 4th December

[click here to view this video at Vimeo]

Catch Up On IDCC 09

If you missed any of the International Digital Curation Conference Sessions as they were live streamed yesterday, but have read the blog summary here and wish you had seen the presentation - never fear! The recordings are now available online via the video sharing site Vimeo.

Just click on the link for the relevant session below to be taken direct to the recording. Unfortunately, there was a glitch with the recording for Prof. Douglas Kell's opening keynote, but we hope to get this up online in the early part of next week.

IDCC 09 Day One Plenaries:

Prof Carole Palmer: The Data Conservancy: A Digital Resource & Curation Virtual Organisation

Dr William Michener: DataONE: A Virtual Data Center for Biology, Ecology, and the Environmental Sciences

Prof Anne Trefethen: NeuroHub: The information environment for the neuroscientist

Mark Birkin: National e-Infrastructure for Social Simulation (NeISS)

Panel Discussion: UK/US Perspectives

DCC Symposium: Citizen Science: Data Challenges

Introduction /Chair: Dr Liz Lyon, Associate Director, DCC

Presentation: Richard Cable, BBC Lab UK

Summing Up: Cliff Lynch, Executive Director, Coalition for Networked Information

Thursday 3 December 2009

IDCC 09: Cliff Lynch Sums Up Day One

In summing up day one, Cliff Lynch observed how the focus of the discussion at these conferences has shifted over the last five years, harking back to the first meetings when there was more discussion about preservation rather than curation. Lynch noted that we are now beginning to understand that preservation has to be a supporting structure to curation, which is a more complex process – more deeply involved in the research process.

One of the other trends he observed emerging is that of “re-use” of data. We are no longer just interested in preserving, but evaluating the prospects of re-use for data and improving those prospects, where possible, to derive greater value from our data.

Lynch noted that there is a deepening linkage between the tools and workflows that researchers use, so data curation needs to be increasingly integrated, as this will help solve the problems of meta data, providence and so on to make curation more effective.

Lynch was very happy to hear mention of the notion that we need to get the scientific equipment developers and vendors involved. This could help feed curation into the workflow more effectively – he gave the example of cultural heritage researchers who found their cameras “knew” a lot of the meta data that they had to laboriously enter to fulfil their curation needs, and so could use the equipment to aid in the curation of the data it produced.

Lynch took a lot of heart from the focus on education to give us a generation of data preservers and data curators. He was also heartened by comments that funding agencies were taking data curation seriously as part of the grant proposal and review processes. He also suggested it would be great if we could actually track the progress of this type of cultural shift.

In concluding, Lynch looked at the more speculative elements of the day's discussion, including the Citizen Science debate – referring to Liz Lyon's paper on the topic. However, he wants us to recall that there is a whole range of computational and observational citizen science tasks, not just the survey-based BBC Lab UK model. He also reminded us that this is not just applicable to science... we are seeing the emergence of Citizen Humanities and amateur study in other areas which has been revitalised by the web. What we need need to look to is building data support for citizen scholarship as a whole.

Finally, Lynch made a speculation involving the measure “scientific papers per minute” which underscores how badly out of control scientific communication is and creates a huge problem when propagating and curating knowledge. It seems to Lynch that one of the things we need to recognise is that many of these papers don't need to be papers, but database submissions. This would be a better way to do things if we are going to manage the data – without the emphasis on the traditional individual-voice analysis paper. So we need to have is a hard conversation about traditional forms of scientific communication and data curation to determine how data curation fits into scholarly communication and how scholarly communication may need to change to help us manage the sheer volume of the output.

I caught up with Cliff just after his summary of day one to ask what he is looking forward to most from day two of IDCC 09...

“I am looking forward to hearing from Ed Seidel. Most of us in the States know that there are three more DataNet awards in their final stages, so we would love to know who has got the inside track on those... although I suspect he will say that he can't comment on that!

Following on from my summary, I would like to know what people think about how we can track the uptake on data curation in funding bids.

Having been involved in the paper review process, I know that the best peer-review paper is very good, and there are some other great papers being presented tomorrow, so I am very much looking forward to it!”

IDCC 09: Panel Discussion - UK/US Perspectives

To contrast and conclude the morning plenary sessions, the four speakers formed a panel to accept questions from the audience.

Q: Anne Trefethen was asked to explain more about Blog3.
A: Anne's colleagues have been using it and it will be launched next week at the All Hands meeting.

Q: How is user-centric design being used in other areas? (aside from Neuro Hub)
A: Carole Palmer explained that their work does involve requirements based work, whilst William Michener explained that DataONE engaged users from the beginning from different research centers, each of which also does its own work in their centers to establish the needs of their users. Mark Birkin explained how they have identified three different types of users – emphasising how diverse the groups of users can be – and highlighted the use of social networking tools to harvest user views directly.

Q: What perspectives do the panel have as to whether data curation is still a pioneering activity and what level of maturity is there among researchers?
A: Trefethen noted that whilst some researchers are mature in their understanding of data management, but there are groups who are surprised by the requirement for curation commitments in the funding bids. She explained that an understanding needs to be nurtured across disciplines, not just individual disciplines. Palmer explained that in terms of preservation, they see people lining up at the door, whilst the data sharing side is not so practiced (although there are people very keen philosophically). Bad experiences have fed into this. Michener noted that there has been a dis-service done by failing to educate young scientists with good data practice as part of doing science, so there is a lot of re-educating to be done. Birkin has a different perspective, as he is doing secondary analysis of well preserved primary data sources. However, there is not the same level of practice about the secondary analysis of data in his field, which can led to researchers having to reinvent methods.

Q: Are there plans to be able to cite the data that's being used?
A: Michener is looking at a data citation model that will rely on digital object identifiers to give scientists as much credit as possible for not just their publications, but also their data. It is key to cite the data as a specific object, as the data can lead to multiple publications. The other three panellists agreed that this is part of their projects, at different levels of priority.

IDCC 09: Mark Birkin Presents NeISS

To open, Birkin gave us an introduction to GIS (Geographic Information Systems) which display data sets as map graphics. He demonstrated some of the applications of this type of system, which have spawned quite a large industry, with 20,000 people in the US who claim to use GIS each day in their work.

So they have this transformational technology in geography, which enables one to manage and integrate spatial and attribute data and has widespread applications for demographics, climate research, land-use, health, business, crime etc. Birkin admitted that his first reaction when introduced to these maps was: “So what?! What can we do with these systems to make decisions or provide insights into the kind of phenomena we are studying?”

He used the examples of how intelligence can be added to the GIS data through spatial analysis, helping to automatically identify burglary hotspots, which has been used to inform policing decisions, mathematical models drawn from GIS, simulation and dynamic simulation.

Birkin went on to give us more detail about how he is trying to create a social simulation of the city of Leeds by combining data sources, and how this can inform policy makers. This includes creating “synthetic individuals” to create a complete model.

As a researcher looking to create simulations and analyse issues using geographical information, there are loads of data sources. You would download this information and go away and create the simulations independently. The point of the NeISS project is to create a framework for sharing the value adding-work of creating the simulations from the different data sets. They started in the spring by building portals to bring together technologies to help create an infrastructure with the capabilities to help add value to all the data that is available.

IDCC 09: Anne Trefethen Presents NeuroHub

Anne Trefethen introduced NeuroHub, which is a much smaller project than the US projects presented earlier in the day. Their aim is to work with neuroscientists to enable them to share their data – work which they help will fit into the wider picture and generate some useful tools.

The project involves Oxford , Reading and Southampton universities, together with their research council STFC e-Science Group. She explained how late trains brought two main players in the project together in conversation, which helped to develop the platform for collaboration which evolved between the universities.

Trefethen then focused in on the science involved. She noted that their work will only of value if they help to deliver the science. She explained the specific projects each university group are working on – including studying the way neural networks of insects work to enable them to move their limbs. She asked us to take note of the types of images that her slides showed – demonstrating the types of data that the scientists want to share with their colleagues – in raw form. She drew our attention to the note that some of the diagrams were stored with Spike 2 software – exemplifying the need to be aware of the wide range of tools when storing and sharing data. Data does not just include images, but also video. She explained that one of the apparently small, but significant considerations is that the scientists do not want to have to use a USB key in order to share their data.

She emphasised that it is very important to identify what the data we are are using actually are. Experiments do not necessarily create metadata to make it easy to find and share the results later. She also noted the complex range of software products being used in different ways to collect, process and publish the data.

To overcome this, the NeuroHub project involves embedding the developers in the neuroscience labs in the early stages to gain insight, combined with structured and unstructured interviews to establish how all of these issues mesh together.

Trefethen then moved on to look at the challenges they are facing – a list that she admitted could have been taken from Douglas Kell's opening keynote. The variety of interdisciplinary teams, different expectations, cultures, requirements, and understanding of shared terms, which can all obstruct data sharing. They have been using an agile development process to try and resolve some of these challenges and ensure that they develop tools that actually work for the scientists in practice.

She explained that their aim is “jam today, jam tomorrow” i.e. doing simple things that can make a big difference. This can include things like format conversions and proper annotation to help facilitate data sharing.

Trefethen then introduced some of the related projects that they are interacting with – including myExperiment (“Facebook for scientists” - socialising the data and providing annotation) and CARMEN, which is larger than NeuroHub, but more focussed on one area and works in the same community – promoting standards. There is a lot out there that they can integrate into NeuroHub.

In explaining the environment architecture of the project, Trefethen emphasised that they did not want to develop a large, monolithic system, but rather something that is in their workspace and creates an environment that empowers the researchers.

IDCC 09: William Michener Presents DataONE

Michener began his presentation introducing DataONE by asking “Why?” He explained that his team is focussed on the environmental challenges that are of increasing concern to us all.

He described their approach to build a knowledge pyramid and the role of citizen science to contribute more site-based data. Michener also highlighted the problem of data loss not just due to fire and other physical factors created by unstable data storage systems such as tapes, but also an inability to archive all the data – predicting that next year they will lose as much data as gets collected in the course of the year.

Data that gets collected does lose value over time. Scientists are more familiar with their data set at the time of publication, so without comprehensive metadata, we lose a lot of the details that make the data re-useable and relevant.

DataONE is designed to address some of these challenges. They aim to provide universal access to data about life on earth. They are developing a cyber infrastructure with distributed nodes, including member nodes where data are stored at existing sites, and co-ordinating nodes, where there is not physical storage of data, but keep the metadata catalogues, and an investigator toolkit for students and scientists to help access and use this data.

Michener then tool us through how they plan to support the data lifecycle – providing examples of the systems and tools to support data contributors and data users.

DataONE is initially focussing on biological and environmental data, but they recognise that many of the issues require using other data sources – including social science. They already have a range of data sources and a diverse array of partners which include libraries, academic institutions, research networks. They plan to leverage existing structures, but also expanding the network throughout the project. He also explained the link between DataONE and the Data Conservancy described by Carole Palmer in the previous presentation.

Michener showed how they have already identified a number of member nodes across the globe representing a number of research networks – but there is no common data management standard between them. “The problem with standards is that there are so many of them,” he noted, so there is no one-size fits all approach.

He demonstrated some of the investigator tools that they are developing to help search and filter the data available, which also enable you to bring up the metadata as well as the original data. Finally, he also introduced the EVA working group, which looks at exploring, visualising and analysing data, and gave a practical example of how they have combined data and created visualisations mapping vegetation and bird migration.

IDCC 09 Keynote Address: Douglas Kell

Professor Douglas Kell provided us with a BBSRC perspective on digital data curation and his own perspective as an academic.

He began with the philosophy of data-science, as data curation is not an end in itself, but rather a means to an end. He showed a diagram that he referred to as an arc of knowledge, which demonstrated that one would normally start with an idea or hypothesis, then do an experiment that would produce some data that would either be consistent with the hypothesis or not. The other side of the arc addresses where those ideas come – the data, which is the starting point for many within the audience at IDCC 09. This is data-driven science, when hypotheses are generated from the data, compared to the hypotheses-generate science, which he admits the biological community is quite reliant on. As we move towards more data-driven science we encounter the problem of storing not just the data itself so we can make use of it, but also the knowledge generated by the data – a theme that would be picked up by Mark Birkin in his later talk.

The digital availability of all sorts of stuff changes the entire epistimology of how we do science, making all sorts of things possible.

Historically physics were seen as the high data science, whilst no biology is now being recognised as a high data science. Biology is a short, fat data model, with less data, but more people using it, whilst physics could be described as a long, thin data model. He did note that there is a lot of video and image data related to biological studies that is not being shared because people don't yet know how to handle it.

Kell made the point that if you can access and re-use scientific data you will gain a huge advantage over those who do not, both academically and potentially commercially, which helps to drive funding for this type of work. He used the example of genomics to show just one of the fields generating huge data sets, which could be used for data-drive science, but need to be integrated.

Relative to the cost of creating the data, the cost of storing and curating it is minimal, so it is therefore a good idea to store it effectively. But the issue is not just storage. There is also the cost of moving it. Kell asserted that we need to move to a model where we don't take the data to the computing, but rather take the computing to the data, which will change the way we approach storage and sharing.

Kell moved on to explain that we will need to have a new breed of curators and tools to deal with the challenges, particularly making that data useful. Having the data does not always help, as generally biologists do not have the tools to deal with big data. He expects the type of software to evolve that does not sit locally on the machine, but sits somewhere else and gets changed and updated, but is useable by the scientist without specialist computing knowledge.

Kell pointed out that things do not evolve just from databases getting bigger, but rather from the tools to deal with the data and the curation methods evolving too. To illustrate, he pointed out that five scientific papers per minute are published, so we need a lot of tools to make this vast amount of literature (and the associated data) useful, so it does not just end up in a data tomb.

Some areas of science have better strategies than others, but BBSRC are now looking at making data curation and sharing part of the funding processes, but making sure that data-driven projects are not competing with hypotheses-driven bids. He noted that journalists are keen, funding bodies are keen and the culture will soon change so that NOT managing and sharing data will become distinctly uncool, like smoking.

Finally, Kell emphasised the need to integrate the data and the metadata. He noted that the digital availability of all data has the potential to stop the balkanisation of scientific data, and it is the responsibility of people within the room to ensure this.

Wednesday 2 December 2009

IDCC 09 Gets Underway...

The 5th International Digital Curation Conference got underway stylishly in London this evening as Dr Liz Lyon, Associate Director of the DCC, welcomed delegates to the opening drinks reception at the Natural History Museum, which was specially lit in DCC orange and red for the occasion.

Liz gave us some background to the Natural History Museum and some of its exhibits – including tales of “Dippy” (the iconic skeletal Diplodocus) and an 8-foot squid housed in a tank produced by Damien Hurst's suppliers. These impressive displays aside, Liz noted that the museum has its own range of digital curation challenges, which are explored in an interview with Neil Thompson at the DCC website. Liz then introduced us to Lee Dirks of Microsoft Research, who kindly sponsored the event.

Lee expressed his team's honour at having the opportunity to sponsor the evening drinks reception and noted that the IDCC was one of the more fascinating conferences in the field and the one that he has prioritised getting to for the last three years. He explained very briefly that Microsoft Research has been doing a lot of work in e-Science and e-Research areas and they feel very strongly that data curation is a critical activity that needs to receive more attention. He proposed a toast to Chris Rusbridge and all the organisers for putting on the event and reminded the speakers that those of us in the audience will be looking to them for all the answers over the coming days.

With that in mind, I will be blogging here over the next two days to summarise all of the hopefully illuminating presentations that are packed into this years' programme, together with a range of interviews with speakers and delegates. Please feel free to comment, and if you are attending the conference and blogging yourself, please include a link to your post.

Five-Minute Interview at IDCC 09: William Michener

In the first of our interviews from IDCC 09, William Michener gives us a sneak preview of his plenary talk, which will be live streamed at 10:15 tomorrow.

Introduction:

I am William Michener from University of New Mexico and I am a professor with the University Libraries System

What will you be talking about in your presentation tomorrow?

I am introducing DataONE which stands for Data Observation Network for Earth, which is a virtual data centre for the Biological, Ecological and Environmental Sciences. It is unique in that it will be highly distributed with world wide presence and it will support the curation and preservation of data from Universities, individual scientists, research networks and other organisations. In addition, it will host an investigators' tool kit that will provide data exploration, data management and analytical and visualisation tools. This will be for scientists, students and citizens.

What's next for DataONE?

This is a long term project to set up, essentially, a virtual data centre that will last decades to centuries so we hope to build new partnerships and expand upon existing partnerships with a large number of European organisations

What are you looking for in Phase 3 from the DCC?

We hope that we could explore collaborations with respect to developing educational materials as well as providing digital curation and other tools for scientists and students.

Tuesday 1 December 2009

Live Streaming at IDCC 09

If you are planning to follow the 5th International Digital Curation Conference online, you can now watch the main sessions broadcast from the event via a live video stream.

You can see the live stream at the new event NetVibes page at www.netvibes.com/idcc2009. NetVibes is a tool for collecting resources and feeds from all over the web, which we hope will enable you to get everything you need to stay up to date in one place, if you are participating in the event remotely.

[screen shot of IDCC 2009 NetVibes page]

You will find recordings of any sessions that you may have missed, updates from this blog and the official @idcclive Twitter commentary, and all the opinions and comments tagged with the #idcc09 hash tag - all brought together at one page.

The main plenary sessions, including the DCC symposium, will be live streamed on Thursday 3rd December and Friday 4th December, subject to consent from the individual speakers. Parallel sessions will not be covered.

If you cannot access NetVibes for any reason, then you can also view the live video stream by clicking here.

Wednesday 25 November 2009

IDCC 2009 Amplified!

As Chris has recently announced, the annual International Digital Curation Conference is almost upon us. This year's event will be amplified using a range of online social media tools to help include those who can't make it to London on 3^rd and 4^th December, and to capture the online conversation surrounding the event for future reference.

This blog will form the centre point of the coverage. There will be summaries of each of the sessions, video interviews with speakers and delegates, and much more. So, if you are reading via the RSS feed, expect a flurry of updates throughout the conference! If you're not subscribed to the RSS feed, make sure you check back regularly during the event to see what's been covered.

You will also be able to follow the official live commentary of each of the plenary sessions on Twitter by following @idcclive, and take part in the conversation using the event hash tag #idcc09. If you have a question for a speaker, simply tweet your question to @idcclive and it will be relayed to the speaker for you at an appropriate point.

We look forward to seeing you at IDCC 09 – whether in person or online!
_________________

The amplification of IDCC 09 will be co-ordinated by Kirsty McGill. Kirsty is the Creative Director of communications and training firm TConsult Ltd.

Wednesday 18 November 2009

Workshops prior to the International Digital Curation Conference

Pre-conference workshops can be very useful and interesting; they can be a good part of the justification for attending a conference, giving an extended opportunity to focus on a single topic, followed by a broader (but shallower) look at many topics, at the conference itself. This time it is quite frustrating, as I would very much like to go to all the workshops! There is still time to register for your choice, and for the IDCC conference itself.

Disciplinary Dimensions of Digital Curation: New Perspectives on Research Data 

Our SCARP Project case studies have explored data curation practice across a variety of clinical, life, social, humanities, physical and engineering research communities. This workshop is the final event in SCARP, and will present the reports and synthesis.

See the full programme [PDF]

Digital Curation 101 Lite Training 

Research councils and funding bodies are increasingly requiring evidence of adequate and appropriate provisions for data management and curation in new grant funding applications. This one-day training workshop is aimed at researchers and those who support researchers and want to learn more about how to develop sound data management and curation plans.

See the full programme [PDF]

Citability of Research Data 

Goal: Handling research datasets as unique, independent, citable research objects offers a wide variety of opportunities.

The goal of the new DataCite cooperation is to establish a not-for-profit agency that enables organisations to register research datasets and assign persistent identifiers to them.

Citable datasets are accessible and can be integrated into existing catalogues and infrastructures. A citable datasets furthermore rewards scientists for their extra-work in storage and quality control of data by granting scientific reputation through cite-counts. The workshop will examine the different methods for the enabling of citable datasets and discuss common best practices and challenges for the future.

▪ See the full programme [PDF]

Repository Preservation Infrastructure (REPRISE) (co-organised by the OGF Repositories Group, OGF-Europe, D-Grid/WissGrid) 

Following on from the successful Repository Curation Service Environments (RECURSE) Workshop at IDCC 2008, this workshop discusses digital repositories and their specific requirements for/as preservation infrastructure, as well as their role within a preservation environment.

Data and the journal article

I recently had a discussion (billed as a presentation, but it was on such an (ahem) intimate scale that it became a discussion) at Ithaka, the organisation in New York that runs JSTOR, ArtSTOR and Portico. We talked about some of the issues surrounding supporting journal articles better with data. Both research funders and some journals are starting to require researchers/authors to keep and to make available the data that supports the conclusions in their articles. How can they best do this?

It seems to me that there are 4 ways of associating data with an article. The first is through the time-honoured (but not very satisfactory) Supplementary Materials, the second is through citations and references to external data, the third is through databases that are in some way integrated with the article, and the fourth is through data encoded within the article text.

My expectation was that most supplementary materials that included data would actually be in Excel spreadsheets, and a few would be in CSV files, while even fewer would be in domain-specific, science-related encodings. I was quite shocked after a little research to find, at least for the Nature journals I looked at, that nearly all supplementary data were in PDF files, while a few were in Word tables. I don't think I found any that were Excel, let alone CSV. This doesn't do much for data re-usability! As things stand currently, data in a PDF document (eg in tables) will probably need to be extracted by hand copy; possibly by cut and paste followed by extensive hand manipulation.

I would expect that looking away from the generalist journals towards domain-specific titles, would reveal more appropriate formats. However, a ridiculously quick check of Chem-Comm, a Royal Society of Chemistry title, showed supplementary data in PDF even for an "electronically enhanced article (eg Experimental procedures, spectra and characterization data, perhaps not openly accessible...).

There’s a bit of concern in some quarters about journals managing data, particularly that data would disappear behind the pay wall, limiting opportunities for re-use.

What would be ideal? I guess data that are encoded in domain-specific, standardised formats (perhaps supported by ontologies, well-known schemas, and/or open software applications) would be pretty useful. I’ve also got a vague sense of unease about the lack of any standardised approach to describing context, experimental conditions, instrument calibrations, or other critical metadata needed to interpret the data properly. This is a tough area, as we want to reduce the disincentives to deposit as well as increase the chances of successful re-use.

Clearly there are many cases where the data are not appropriate for inclusion as supplementary materials, and should be available by external reference. Such would be the case for genomics data, for example, which must have been deposited in an appropriate database (the journal should demand deposit/accession data before publication).

External data will be fine as long as they are on an accessible (not necessarily open) and reasonably permanent database, data centre or repository somewhere. I do worry that many external datasets will be held on personal web sites. Yes, these can be web-accessible, and Google-indexed, but researchers move, researchers die, and departments reorganise their web presence, which means those links will fail, and the data will disappear (see the nice Book of Trogool article "... and then what?").

Sometimes such external data can be simply linked, eg a parenthetical or foot-noted web link, but I would certainly like to encourage increasing use of proper citations for data. Citations are the currency of academia, and the sooner they accrue for good data, the sooner researchers will start to regard their re-usable data as valuable parts of their output! It’s interesting to see the launch of the DataCite initiative coming up soon in London.

There is this interesting idea of the overlay data journal, which rather turns my last paragraph on its head; the data are the focus and the articles describe the data. Nucleic Acids Research Database Issue articles would be prime examples here in existing practice, although they tend to describe the dataset as a persistent context, rather than as the focus for some discovery. The OJIMS project described a proposed overlay journal in Meteorology; they produced a sample issue and a business analysis, but I’m not sure what happened then.

The best (and possibly only) example I know of the database-as-integral-part-of-article approach is Internet Archaeology, set up in 1996 (by the eLib programme!) as an exemplar for true internet-enabled publishing. 13 years later it's still going strong, but has rarely been emulated. Maybe what it provides does not give real advantages? Maybe it's too risky? Maybe it’s too hard to create such articles? Maybe scholarly publishing is just too blindly conservative? I don't know, but it would be good to explore in new areas.

Peter Murray-Rust has argued eloquently at the tragedy of data trapped and rendered useless in the text, tables and figures of articles. We would like to see articles semantically enriched so that these data can be extracted and processed. Encoded data points us to a few examples, such as the Shotton enhanced article described in Shotton et al 2009, and also to the Murray-Rust/Sefton TheoREM-ICE approach (although that was designed for theses, I think). I think the key here is the lack of authoring tools. It is still rather difficult to actually do this stuff, eg to write an article that contains meaningful semantic content. The Shotton target article was marked up by hand, with support from one of the authors of the W3C SKOS standard, ie an expert! The chemists having been working on tools for their community, both the ICE example, and also MS Chem4Word, maybe ChemMantis, etc.

This last paragraph also points us towards the thesis area; I think this is one that Librarians really ought to be interested in tackling. What is the acceptable modern equivalent to the old (but never really acceptable) practice of tucking disks into a pocket inside the back cover of a thesis? Many universities are now accepting theses in digital form; we need some good practice in how to deal with their associated data.

So, we seem to be quite a way from universal good practice in associating data with our research articles.

Shotton, D., Portwin, K., Klyne, G., & Miles, A. (2009). Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Computational Biology, 5(4). doi: 10.1371/journal.pcbi.1000361.

Thursday 17 December 2009

Tuesday 15 December 2009

Wednesday 9 December 2009

Tuesday 8 December 2009

Monday 7 December 2009

Saturday 5 December 2009

Friday 4 December 2009

Thursday 3 December 2009

Wednesday 2 December 2009

Tuesday 1 December 2009

Wednesday 25 November 2009

Wednesday 18 November 2009

Creative Commons

Blog Archive

Contributors

Labels