Open Research Exeter Launch!

This week is a very busy one for us! Our Open Access Research and Research Data Management Policy for researchers goes before Senate on Thursday and on Friday we are celebrating the achievements of the Open Exeter project and launching our newly rebranded repository Open Exeter Research (ORE) for research papers, research data and theses.

You may remember the competition to rename the institutional repository which was part of our Open Access Week 2012 (see poster). We received a total of 57 entries from staff and students from all areas of the University and eventually decided on Open Research Exeter (or ORE). Our competition winner, Katie Kelsey, is a Temporary Research Fellow who suggested One Research Exeter. We adapted her suggestion slightly to incorporate the concept of making Exeter’s research open and available to the public including researchers across the globe. Congratulations Katie and hope you are enjoying your Kindle!

On Friday 22nd March, the Open Exeter team will be in the Forum Street on Streatham Campus from 10:00 – 12:00, to answer questions about the project and any other research data management and open access queries you may have. You will have the chance to see an ORE demo, and talk to those who developed the repository as well as other staff from RKT, Exeter IT and the Library who support research at Exeter. Come along and talk to us and you may get a free fairy cake!  You can even join the event on Facebook!

This will be followed by a celebratory lunch for stakeholders and others from around the University who have supported the Open Exeter project since its start date in October 2011. As we draw towards the end of the project, we hope to continue to work closely with researchers and others from around the University to ensure that our project outputs are sustainable in the long-term. In the meantime we will be making sure all is ready for ORE’s launch on Friday!

If you have any questions about ORE, please contact or .

 

Posted under Exeter Data Archive, News, Open Access, Research, Technical development

This post was written by Hannah Lloyd-Jones on March 19, 2013

Dspace submission using Globus and SWORD2 – Update

We’ve made huge progress on our submission tool recently. We now have a prototype web app that collects metadata from users and uses Globus to transfer the files to our ATMOS storage facility before submitting them to Dspace.

I demonstrated this at the IDCC (International Digital Curation Conference) in Amsterdam last week and found many other delegates were either interested in the use of Globus or already had it at there organization BUT didn’t have it hooked into Dspace. Globus allows the user to create ‘endpoints’: data locations such as your laptop or PC that you can then transfer files from and to. The transfer happens asynchronously and as long as the endpoint hardware is on with Globus running, it will eventually complete and submit the entry in Dspace.

All this is added to the web app via an API and the deposit to Dspace as an atom request via SWORDv2. We have also implemented our Single Sign On (SSO) service.

We hope to have a finished prototype next month and aim to share our Dspace development with the wider community.

 

Posted under Technical development

This post was written by Ian Wellaway on January 25, 2013

Dspace authentication with Single Sign On

We’ve made a breakthrough recently by implementing a custom SSO authentication module into one of our test Dspace repositories.

Essentially, this means that users at the University of Exeter will be able to login to Dspace using the institutional SSO service. This can either be done at the Dspace website, or beforehand on another application hosted by the University that also uses the SSO service.

Once logged in via SSO, Dspace grabs the username that is passed automatically by the user and logs them in to Dspace. So if a user has already logged in to the MyExeter portal and then they go to the Dspace repository, they will already be logged in and needn’t type in their details again.

On the technical side, this is done in Dspace by creating a new custom authentication module and adding it to the authentication stack. We used the current LDAP code and amended it to ignore anything parsed in but instead to grab the authenticated userID from the HTTPServerRequest object in the java code using request.getRemoteUser. However, this only works if you first add tomcatAuthentication=false to the ajp connector in tomcat’s server.xml config file.

Once we have a finished module (it still needs a little bit of work) I’ll submit it to Dspace itself so the developers can use it as a starting point for there own Dspace SSO authentication.

Posted under Technical development

This post was written by Ian Wellaway on December 3, 2012

Big data submission tool demo at the 2012 IDCC in Amsterdam

Open Exeter hopes to demonstrate its submission prototype at the International Digital Curation Conference in Amsterdam in January 2012:

 

Managing Research Data

Submitting BIG data to a DSpace repository

Open Exeter project, University of Exeter, UK

DSpace comes readily equipped with its own ‘out of the box’ submission tool which works well with small files and small numbers of files but how do researchers upload their precious large datasets?

The UK JISC funded Open Exeter project set out to understand how researchers at the University of Exeter manage their data.

As part of the project, researchers were surveyed about the amount of data they stored and how they stored it particularly once a project was finished. It was found that some research projects produced huge numbers of files with some massive file sizes and that these were often archived on local hard disks and external drives. To submit these datasets to our DSpace based institutional repository is not practical using the out of the box DSpace submission tool since it limits the user to one file at a time for upload over HTTP whilst the user waits.  In addition transferring large files via such methods can be slow and prone to failure. DSpace does also support command line batch import of files providing they can be successfully transferred via some other means. SWORD provides a standardised way of interfacing with repositories including DSpace but also currently remains limited in its ability to transfer large files reliably.

To overcome these limitations, Open Exeter is developing its own submission tool using elements of the SWORD protocol combined with the leading research data transfer service Globus. SWORD allows us to query the repository to determine which collection the user is allowed to submit to and what sort of metadata is needed. The tool then gathers the metadata and data locations from the user before scheduling transfer of this to the repository. This method works irrespective of the volume of data and its location whilst remaining secure, fast and resilient since if a transfer fails it can be restarted automatically from the point of failure.  Globus provides a unique reference number to track progress and determine completion allowing subsequent submission via batch import to the repository.

Using these technologies, Open Exeter is working toward a solution that will allow researchers to upload their data quickly and securely and will be giving a demonstration of its prototype.

Posted under Technical development

This post was written by Ian Wellaway on October 19, 2012

Research Data Backup with CrashPlan

We  started an evaluation/trial of CrashPlan Enterprise as a backup solution for research data earlier in the year.

This trial finished a few months back but I just wanted to document some key results from this work.

  • CrashPlan is a solution for personal data backup across a range of platforms – Mac, Linux and Windows. It consists of a Java based client and for the Enterprise version a Java based server product.
  • During the evaluation I believe CrashPlan performed very well in terms of installation simplicity, configurability and on-going administration. CrashPlan allows for devolution of responsibility for backups to each college with a fully developed administrative role model.
  • The trial enabled us to fully evaluate it as a potential research data backup solution for use with our EMC Atmos. In our trial, CrashPlan server software was installed on all the Atmos IFS servers with a management GUI installed on a seperate server. The Atmos IFS servers mount the Atmos object space as a file system and this allowed us to offer Atmos space to CrashPlan as a normal file system.
  • Initial results were good, backups occurred at a reasonable speed. However after a period, backup failure would start to occur. It became apparent that this is because of the requirement that a CrashPlan server must maintain its backup files periodically and that during this time a backup cannot take place. The characteristic of this maintenance is many small multiple I/O operations on the backup files which because they are actually stored on Atmos as objects, make the maintenance operation too slow to sustain a reliable backup service. The CrashPlan backup files were remaining in maintenance mode for extremely long periods (days/weeks).
  • There is no work around to this fundamental characteristic of high start-up latency for Atmos I/O.
  • Consequently I cannot recommend CrashPlan as a solution, with Atmos as the back end, for research data backup.
  • I would however highly recommend CrashPlan as a very good solution for general research data backup if we could provide a central backup storage file system with low latency I/O.

 

 

Posted under Technical development

This post was written by Peter Leggett on October 10, 2012

Update on technical development of our DSpace submission tool

Just uploaded a new technical report by Ian Wellaway to our repository: http://hdl.handle.net/10036/3847

This report reviews the approaches that were identified earlier this year as possible solutions to the ‘big data’ upload issue: using the default DSpace upload tool; using third-party software and tools; developing a bespoke solution for Exeter.

Ian outlines the development work that has been done in these areas and the outcomes. For a time we have been developing two prototypes concurrently – one that could, ideally, be easily reused by other HEIs, and a more bespoke tool catering for Exeter’s specific needs but with less cross-institutional transferability.

Various tools and applications are evaluated and discussed: SWORD, sworduploader, EasyDeposit and Globus FTP.

Hope this will be of interest to other MRD projects and wider.

Posted under Big Data, Reports, Technical development

This post was written by Jill Evans on October 2, 2012

Tags: , , ,

Prototype No. 2 – SWORD, Globus FTP and importing by reference

In parallel to our development using SWORD to try to solve the issue of uploading large research data  to Dspace, we’ve been working on another prototype which uses a bit of SWORD along with an FTP client (Globus) and Dspace’s own command line script for importing data into the archive.

The process works as so:

1) The user uploads their research data from their own device (PC, iPad, etc) up to ATMOS (remember we have this storage facility) using the Globus FTP client. We’ve played about with Globus and have initilly created a script to sort out the certificatation that Globus requires to transfer files between locations. Globus has many advantages here in that, in the event of any stoppage, the upload should be resurrected

2) Once uploaded the user uses a GUI that is build from the input-forms.xml file on the dspace server to enter metadata and choose the files on the server they have just uploaded. They can then submit to the archive.

Behind the scenes, Dspace is creating a set of files required for the command line ‘import’ script to submit the files by reference to the archive. In this way, the final stage works much quicker.

As you can see, this is a two stage process, with the user uploading the files and then submitting them to the repository, however, we have ideas to allow the user to do everything at the upload stage, or if the files sizes are not so large, there is a part to the GUI for simply uploading files straight to dspace as in a normal submission.

And where does SWORD come in? We use it to get the service document.

Testing continues…

 

Posted under Technical development

This post was written by Ian Wellaway on September 14, 2012

Using SWORD v2 to upload large data files

Our project uses Dspace as it’s online repository, with ATMOS storing the actual files. This gives us a very large storage area for researchers to submit their research data to.

However, where the Dspace submission interface works well with small files, it struggles a bit with larger ones, or muliple files.

So, to combat this we’ve been looking at creating a new SWORD based submission tool. We started by exploring a python script created by QMUL, ‘sworduploader’.

This works well for small files and multiple files since it effectively zips up files, submit’s them to Dspace using the SWORD service document and then unzips them. However, as we developed the script to work for larger data we started to uncover some issues.

1) Python’s default zip engine uses 32 bit so started to fail when attempting to zip up anything outside of the 32 bit range (4GB). We solved this though by forcing python to using the zip64 module, giving us a much larger scope of filesize.

2) The Dspace XMLUI coccoon module (in core.properties) contains a integer max upload filesize. This is currently set to just over 2GB and cannot be increased. We’re working on this.

http://permalink.gmane.org/gmane.comp.db.dspace.user/15901

And we look forward to…

3) HTTP is not a very trustworthy protocol for uploading data across the web. What happens if the submisson of a large file is interrupted?

We are also working a bespoke java based tool for submitting files to the respository. This uses the ‘batch import’ script of Dspace and seems to work well. However, we’d like to use SWORD and what we don’t want here is to have to create separate submission tools for different kinds of data since this would confuse the user so we’d like to make it all work with one interface.

So, we persevere.

Posted under Technical development

This post was written by Ian Wellaway on August 28, 2012

Technical Update

In the technical part of the project, now we have upgraded our live and test Dspace installations to the latest version (1.8.2) and switched to Oracle 11g (our previous version of Postgres needed to be upgraded and our support at the Univeristy is much better for Oracle), we’ve been able to move onto looking at developing a submission tool that can cope with some quite heavy research data loads.

The most important scenarios from our perspective are:

  1. How to get large data into Dspace (Gigabytes and possibly even Terabytes)
  2. How to submit data composed of many different files (without zipping them up first)

Feedback from researchers gathered by our colleagues in the library has shown these issues to be very important, and critically, the ‘out of the box’ Dspace submission feature does not handle these very well.

To combat these, we are looking developing two different prototype submission tools:

  1. A SWORD based submission tool using Python
  2. A submission tool using the SWORD service document but then submitting via the Dspace command line import sccript

By developing each tool in parallel we hope to determine which works the best. We have so far found that while the Dspace command line tool can submit large files (successfully submitted a 6GB piece of data), the SWORD tool has hit upon some issues.

However, we hope to eventually have at least one fully working solution, if not two, that can be sued to submit data of any shape or size.

Posted under Big Data, Technical development

This post was written by Ian Wellaway on July 31, 2012

PGR feedback on data upload

Last week we asked our group of PGRs to test upload of data to the Exeter Data Archive. I was particularly interested in seeing how they would respond to the interface and the metadata web form.

The following are some of the comments that we received – some of these relate specifically to how DSpace works but some are of general interest:

• Add a sentence to the current licence making it clear that depositors can ask to remove their data/outputs.

• It’s important to be able to see inside a zip file.

• How can multiple files be uploaded?

• It would be used more if it were possible to upload from your own drive – drag and drop rather than entering metadata through the web interface.

• A ‘wizard’ like process would be really helpful.

• Would like a template structure for storing previously entered metadata, this could be selected later for further related deposits.

• Keywords – need intuitive text to appear in boxes otherwise will get an inconsistent and inaccurate list of keywords.

• Upload speed – varied between PGRs, Mac users found it much quicker – 100mb audio file uploaded in about 30 seconds; 700mb took 20 mins to upload with a Mac.

• The Submit button needs to be much clearer

• Do you need to login before you upload or could you choose to upload and then have to login – which is better?

• Metadata – people will cut corners if it’s too onerous.

• Would be good to be able to add projects to the hierarchy (i.e., DSpace Communities structure)

• DPA – is it contravening DPA if even an administrator can see sensitive data?

• Data could be encrypted as well as being stored in a ‘dark archive’.

• An upload manager would be a really useful feature – you could queue files for upload and then just leave them.

• Important to add contact details of depositor (PI, etc.), especially email address.

• Clearer help and guidance; make mandatory fields clearer.  Title – more specific guidance, is this title of the deposit or depositor.

• Would be useful to have a dropdown list of your previous submissions, you could then choose to link things together (e.g., paper & data), and make the process easier.

• Confused about the difference between date of publication and date of creation – publication is date it becomes publicly available and is need by DataCite – but DSpace doesn’t automatically assign this detail to the ‘publication’ field.

• Need a more comprehensive list of data types than default Dublin Core list.

Posted under Big Data, Metadata, Technical development

This post was written by Jill Evans on May 31, 2012

Tags: , ,