Update on technical development of our DSpace submission tool

Just uploaded a new technical report by Ian Wellaway to our repository: http://hdl.handle.net/10036/3847

This report reviews the approaches that were identified earlier this year as possible solutions to the ‘big data’ upload issue: using the default DSpace upload tool; using third-party software and tools; developing a bespoke solution for Exeter.

Ian outlines the development work that has been done in these areas and the outcomes. For a time we have been developing two prototypes concurrently – one that could, ideally, be easily reused by other HEIs, and a more bespoke tool catering for Exeter’s specific needs but with less cross-institutional transferability.

Various tools and applications are evaluated and discussed: SWORD, sworduploader, EasyDeposit and Globus FTP.

Hope this will be of interest to other MRD projects and wider.

Posted under Big Data, Reports, Technical development

This post was written by Jill Evans on October 2, 2012

Tags: , , ,

PGR feedback on data upload

Last week we asked our group of PGRs to test upload of data to the Exeter Data Archive. I was particularly interested in seeing how they would respond to the interface and the metadata web form.

The following are some of the comments that we received – some of these relate specifically to how DSpace works but some are of general interest:

• Add a sentence to the current licence making it clear that depositors can ask to remove their data/outputs.

• It’s important to be able to see inside a zip file.

• How can multiple files be uploaded?

• It would be used more if it were possible to upload from your own drive – drag and drop rather than entering metadata through the web interface.

• A ‘wizard’ like process would be really helpful.

• Would like a template structure for storing previously entered metadata, this could be selected later for further related deposits.

• Keywords – need intuitive text to appear in boxes otherwise will get an inconsistent and inaccurate list of keywords.

• Upload speed – varied between PGRs, Mac users found it much quicker – 100mb audio file uploaded in about 30 seconds; 700mb took 20 mins to upload with a Mac.

• The Submit button needs to be much clearer

• Do you need to login before you upload or could you choose to upload and then have to login – which is better?

• Metadata – people will cut corners if it’s too onerous.

• Would be good to be able to add projects to the hierarchy (i.e., DSpace Communities structure)

• DPA – is it contravening DPA if even an administrator can see sensitive data?

• Data could be encrypted as well as being stored in a ‘dark archive’.

• An upload manager would be a really useful feature – you could queue files for upload and then just leave them.

• Important to add contact details of depositor (PI, etc.), especially email address.

• Clearer help and guidance; make mandatory fields clearer.  Title – more specific guidance, is this title of the deposit or depositor.

• Would be useful to have a dropdown list of your previous submissions, you could then choose to link things together (e.g., paper & data), and make the process easier.

• Confused about the difference between date of publication and date of creation – publication is date it becomes publicly available and is need by DataCite – but DSpace doesn’t automatically assign this detail to the ‘publication’ field.

• Need a more comprehensive list of data types than default Dublin Core list.

Posted under Big Data, Metadata, Technical development

This post was written by Jill Evans on May 31, 2012

Tags: , ,

Case study – The Cricket-Tracking Project

Other JISC MRD projects or those working with ‘big data’ may be interested in a case study that has been written for Open Exeter by Dr Jacq Christmas (http://hdl.handle.net/10036/3556).

The case study documents the process of reviewing, preparing, uploading and describing multiple large video files. The project that generated the files is investigating the behaviour of crickets through analysis of thousands of hours of motion-triggered video.

The project is interesting to us for a number of reasons:

• It is a cross-disciplinary/cross-departmental project – these sort of projects are becoming increasingly common at Exeter and do throw up interesting questions around the area of ‘ownership’
• Huge amounts of data have been and continue to be produced
• Storage is a problem due to the number and size of files – most files are stored on external hard drives held in various places
• As there is no central storage system, secure backup can be a problem
• Ditto secure sharing
• The first batch of video is in a proprietary format that requires specific software in order to be viewable

The case study sets out quite clearly the thought that should be given to selecting and preparing files for upload to a repository. We are looking at how the procedures described can be adapted as templates to guide researchers from other disciplines through the deposit process, some aspects of which will always be generic, for example:

• Listing and explaining the various file formats and how they are related
• Selecting a set of metadata fields to describe the files
• Thinking about the structure of the data in the repository and how it links to related resources, projects and collections

One issue that has arisen from this case study, that we were already well aware of, is the preference to deposit research in a project or research group collection rather than a generic departmental or College collection. In many cases the sense of belonging to or affinity with a group is stronger than departmental ties. This is a tricky one for us: DSpace structure centres on a hierarchy of communities, sub-communities and collections; once these have been set up and start to be populated, it is difficult to make significant changes. Add to that the fact that our CRIS, Symplectic, has been painstakingly mapped across to all our existing communities and collections and any structural changes become even more problematic. For the moment we are looking at a possible metadata solution (dc****.research group ??). I’d be interested to hear how others deal with the research project/group requirement.

We’re about to start a similar test case study with Astrophysics and later in the year with an AHRC-funded project based in Classics and Ancient History. It will be interesting to see if the approach taken in these areas are significantly different, or given different emphasis.

I won’t say that our first case study has allowed us to resolve the many issues raised yet but we are at least more aware of what is important to researchers and can start to take steps to find solutions.

Posted under Big Data, Case studies

This post was written by Jill Evans on May 28, 2012

Tags: , , ,

OR2012

Good news for Open Exeter – we heard that our paper on archiving PGR data has been accepted for OR2012 in Edinburgh. We are all planning on attending so hope to catch up with other MRD02 projects in July.

Posted under News

This post was written by Jill Evans on April 30, 2012

Tags: , ,

Archiving PGR research data?

As we finish the third week of our investigations into RDM practice around the University, we’re a little surprised by a common factor that is starting to emerge from interviews: concern about what happens to PGRs’ data when they leave the University at the end of their studies.

We had some idea from conversations with PGRs that they themselves have questions about what happens to student data when someone leaves. The most consistent comment is that since there are no policies or guidelines of any sort, data will probably sit on a hard drive or external drive in an office somewhere until either the device fails or no-one can figure out how to access the files again.

For PGRs this is a problem for two main reasons:
• Students would like to receive recognition for their work and feel it is being valued and reused to contribute to building knowledge in their academic field. If the data is more accessible, it will have greater impact and enhance their career development.
• Typically this research data is unavailable for incoming students to build on; they will be aware that the research has taken place but due to the lack of policy on recording and storing PGR data, they (and their supervisors) have no way of locating it.

For researchers, where PGR research has been incorporated into project/research group activities, continuing access to raw data is critical.

Researchers may be aware that previous research is relevant to current students supervised but again, cannot access the original data. This can lead to reduplication of effort.

Additionally, it can be useful to have access to restrictions-free raw data as a tool to teach research skills and methodologies to incoming students.

Until this point, we hadn’t really considered that there might be a role for the project in providing continuing access to PGR data. However, there is clearly a (relatively) quick win opportunity for us here: we already mandate thesis deposit to our research outputs repository, ERIC, which we are looking at integrating with our data archive; we already allow deposit of supplementary files, such as video and audio when they’re an integral part of the thesis. It’s only a comparatively small next step to then permit (or even mandate?) deposit of underlying data. It’s an aim we will certainly incorporate into our scheme of work over the next few months.

Are other projects coming across a similar situation?

Posted under Follow the Data

This post was written by Jill Evans on March 2, 2012

Tags: , ,