Using SWORD v2 to upload large data files

Our project uses Dspace as it’s online repository, with ATMOS storing the actual files. This gives us a very large storage area for researchers to submit their research data to.

However, where the Dspace submission interface works well with small files, it struggles a bit with larger ones, or muliple files.

So, to combat this we’ve been looking at creating a new SWORD based submission tool. We started by exploring a python script created by QMUL, ‘sworduploader’.

This works well for small files and multiple files since it effectively zips up files, submit’s them to Dspace using the SWORD service document and then unzips them. However, as we developed the script to work for larger data we started to uncover some issues.

1) Python’s default zip engine uses 32 bit so started to fail when attempting to zip up anything outside of the 32 bit range (4GB). We solved this though by forcing python to using the zip64 module, giving us a much larger scope of filesize.

2) The Dspace XMLUI coccoon module (in core.properties) contains a integer max upload filesize. This is currently set to just over 2GB and cannot be increased. We’re working on this.

http://permalink.gmane.org/gmane.comp.db.dspace.user/15901

And we look forward to…

3) HTTP is not a very trustworthy protocol for uploading data across the web. What happens if the submisson of a large file is interrupted?

We are also working a bespoke java based tool for submitting files to the respository. This uses the ‘batch import’ script of Dspace and seems to work well. However, we’d like to use SWORD and what we don’t want here is to have to create separate submission tools for different kinds of data since this would confuse the user so we’d like to make it all work with one interface.

So, we persevere.

Posted under Technical development

This post was written by Ian Wellaway on August 28, 2012

Collecting data that captures human emotions

Another case study by an Exeter PGR is now available: http://hdl.handle.net/10036/3697

This report, by Mrunal Chavda, a Drama PhD student, presents some fairly unusual data management challenges.

Basically, Mrunal is attempting to capture and document emotional responses to dramatic situations using human subjects.

This immediately throws up obvious issues around ethics, confidentiality and the correct use and storage of information covered by the Data Protection Act.

Given the unusual nature of the study there are some unique challenges around data collection – how to ensure the emotions captured are unselfconscious and genuine, and what technologies can be used. Devices must be reliable and robust but also as unobtrusive as possible.

All this is carried out with the purpose of developing a new analytical model based on Rasa aesthetics.

Interesting reading!

For a basic definition of Rasa:
http://m.eb.com/topic/491635

URI: http://hdl.handle.net/10036/3697

Posted under Case studies

This post was written by Jill Evans on August 20, 2012

Tags: , ,

PGR “audits”

For the first six months of the year we asked our PGR students to complete a research data management “audit” every week. This task has now ended and I am working on analysing the results so a fuller report will follow.

The audit consisted of 17 questions and asked questions such as: What file formats are the data you created this week? Please state both electronic and paper; Where was this week’s data created? i.e. Home, office, field trip etc.; and Does any of the data you created this week need to be shared? Please give details. Follow this link for a template of the questions we asked: Weekly_audit_form_template (Excel file – blog post updated 20th August 2012).

A quick analysis of the audits has thrown up a number of similarities amongst our students:

  1. They all seem to work in phases i.e. there will be a data collection phase, a writing up phase, a literature review phase etc. Although there is obviously some overlap between these phases and the length of the phase differs between all our students, the general principle does seem to hold true across all the different disciplines.
  2. All of our students create and/or analyse their data both at home and in the office on campus. A number have also been on field trips to collect data. This supports the findings of our DAF survey where research data was shown to be collected and analysed both on and off campus.
  3. Similar issues are faced by students of different disciplines. One that has shown up in the audits is the potential size of image files and the adequate filing and storage of hundreds (or even thousands) of such files so that particular images are easily found.
  4. Although our students use different file formats to each other (with the exception of Word, Excel, Powerpoint and PDF which are common to each to a greater or lesser extent) they each use only a comparatively small number of formats/file types (the most used is eight).

The audit has proved to be a mine of useful information for the project and the regular meetings I have been holding with our students has allowed me to check details and abbreviations that I didn’t recognise. Further analysis of the results will, I am sure, provide further useful information.

Posted under Follow the Data, PGR students, Research

This post was written by Gareth Cole on August 17, 2012

The pitfalls of using copyrighted materials in theses

Some of you may be interested in a short case study written by one of our PGRs, Duncan Wright of Archaeology, on trying to deal retrospectively with the issue of obtaining permission to use copyrighted visual material in his thesis: 

http://hdl.handle.net/10036/3690

Like many PhD students, Duncan only be came aware of copyright restrictions on reuse of materials towards the end of his research and now has the problem of trying to negotiate permission with multiple copyright holders for content that has become an integral part of his thesis.

He has the option of removing the offending material and submitting a redacted version to our repository (thesis deposit is mandatory) but clearly this will have an impact on how the intellectual value of the study is perceived and is therefore not ideal.

It’s short enough to give to new or 2nd year PGRs as a warning, so please feel free to reuse!

Posted under Case studies, Copyright, Research

This post was written by Jill Evans on August 13, 2012

Tags: , ,

A new hand on deck

This week saw the beginning of a new era for the good ship Open Exeter as James Beeson joined the team. He comes into the role of Project Administrator in place of Tutti, who was last seen rowing for shore. He comes from a project background and hopes are high that he will be able to apply his charisma, knowledge and professional experience in full measure. James will be with the team on Mondays and Thursdays. He currently spends Tuesday and Wednesday working in Student Funding.

Outside of this working life he enjoys baseball, beer and writing about himself in the third person.

Posted under News

This post was written by James Nathanael Beeson on August 9, 2012

Open Exeter DAF survey results

We will post more on our survey results shortly but I wanted to get the report out quickly as I think there may be interest in the findings, particularly amongst JISC MRD projects.

You can access the report from ERIC, our repository:
http://hdl.handle.net/10036/3689

It would be great to get some feedback or comments.  Equally, we are happy to answer any queries arising from the report.

Posted under Online survey, Reports

This post was written by Jill Evans on August 8, 2012

Tags: , ,

DCC Workshops

On the 18th and 19th June 2012, Joy Davidson and Sarah Jones from the Digital Curation Centre delivered training sessions on Research Data Management (RDM) for University of Exeter PGR students and Professional Services Staff.

12 PGRs from diverse disciplines and 21 Professional Services staff, from Research and Knowledge Transfer, the Library and academic colleges and disciplines, attended the events which aimed to give an overview of RDM throughout the research lifecycle.

For many attendees this was the first time that they had received RDM training and the Open Exeter team were interested in observing the event and receiving feedback from those who attended in order to develop our own Exeter-specific training. For PGR students, this type of general RDM training will sit in the Researcher Development Programme and we hope to be able to offer discipline-specific advice and training through the Colleges and disciplines as well. As one PGR stated “It would be very useful to receive at least some of this training at a very early stage of the research to ensure that data is appropriately stored and backed up – maybe then being given a reminder and/or more detailed information at a later stage.”

Feedback about the training was positive, and one comment that was reiterated was about the usefulness of a mixed group of attendees from across the University. For PGR students, this led to the discovery that many faced similar RDM issues despite the fact that their areas of study were disparate, for example, how to store their data securely. For Professional Services staff, it was interesting to note how support for researchers comes from various areas of the University, especially in the break-out exercise about who has responsibility for different RDM activities. This is also important as ideally researchers should receive consistent advice on RDM no matter who they ask.

Both PGR students and Professional Services staff have decided to follow up this session with actions of their own. For example, one PGR will try to organise their data better as they so that it will still make sense at a later date and various Professional Services staff will feed back to colleagues about the growing importance of good practice in RDM, especially from the point of view of the funding bodies.

Open Exeter will run a follow-up event which will look at University of Exeter RDM solutions and give practical advice to attendees.

We would like to thank the DCC for providing this training. You can find the presentations and materials used in the workshops here and there are many more useful RDM resources on the DCC website.

Posted under Training

This post was written by Hannah Lloyd-Jones on August 7, 2012