Update on technical development of our DSpace submission tool

Just uploaded a new technical report by Ian Wellaway to our repository: http://hdl.handle.net/10036/3847

This report reviews the approaches that were identified earlier this year as possible solutions to the ‘big data’ upload issue: using the default DSpace upload tool; using third-party software and tools; developing a bespoke solution for Exeter.

Ian outlines the development work that has been done in these areas and the outcomes. For a time we have been developing two prototypes concurrently – one that could, ideally, be easily reused by other HEIs, and a more bespoke tool catering for Exeter’s specific needs but with less cross-institutional transferability.

Various tools and applications are evaluated and discussed: SWORD, sworduploader, EasyDeposit and Globus FTP.

Hope this will be of interest to other MRD projects and wider.

Posted under Big Data, Reports, Technical development

This post was written by Jill Evans on October 2, 2012

Tags: , , ,

Zen Archiving: an Open Exeter Case Study in Astrophysics

Posting this on behalf of Tom Haworth. Tom is a 2nd year Postgraduate in Astrophysics and has been commissioned by us to write a case study documenting the process of transferring large amounts of data (TBs) from a HPC (zen) to the Exeter Data Archive.

We are interested in the whole process – from deciding what to keep and what to delete to data bundling and metadata entry. The Astrophysics Group is using the process to develop policy and guidelines on use of zen to store and manage data.

The following are some initial thoughts on how to kick off the process:

Zen Archiving: an Open Exeter Case Study in Astrophysics

Summary:

– The archiving process will have to take place from the command line (or a gui) on zen-viz.
– Tom Haworth will develop a script that takes user-entered metadata, potentially compresses the file, and sends both directly to the archiving server.
– The Open Exeter IT team has sufficient information to perform the archiving server-end work. They are also considering command line retrieval of data.
– The kind of data that we expect to archive is completed models. Necessary software to view the data should be included too.
– Email and WIKI entries are all that will be required for training.

Where is the data
Data will be stored on zen at one of /archive/, /scratch/ or/data/. archive and scratch are not under warranty.

What kind of data needs to be archived
There will be a range of data of different file formats, some not seen outside of the astrophysics community. These can be collected and compressed, if not by the user then potentially by the submission script at run-time. Compression is not always worth doing so a list of compression-worthy extensions could be stored.

The data to archive will probably be on a model-by-model basis rather than publication, but publication details will be included in the metadata. This will probably be governed by the size of the files.

Data to be archived should be completed models.

What will happen to the data on zen
This will probably be determined on a case-by-case basis depending on how frequently (if at all) the data is required. Data that has no imminent further use should be removed.

For example, I would be archiving some finished models but may also need them for my thesis.

How might extraction from the archive work from the command line?
– searching could still take place on the web
– extraction would rely on direct communication with the archiving server

Policy for archiving
Should avoid letting any user on zen archive absolutely anything and everything. Need:
 guidelines on what should be archived
 We can track how much people have been archiving and communicate with them if it looks like they are abusing it.

Metadata verification for senior users is not required. PhD students could have their submission metadata verified by their supervisor.

Metadata
Metadata is required to ensure that the data is properly referenced and can be found easily.
Entries are Title, Author, Publisher, Date Issued, URL, Abstract, Keywords, Type etc.

In HPC astrophysics there will likely be additional entries of use such as the code used to generate the data. I suggest using an “Additional Comments” field.

This information will be requested at the command line when archiving.

The archiving procedure on zen
It will be completely impractical to archive the data through the web interface. It will also be impractical to download the data onto a local machine and then archive it (local machines probably will not even have the capacity to store the data). The ideal situation will be one in which data can be archived straight from zen, communicating directly with the storage server and sending the appropriate metadata in addition.

This should happen from the zen visualization node, so as not to grind the login node to a halt.

A simple command line script would be all that is required.

Basic archive script
Read in name of thing to archive
Check the size of the thing to archive
Communicate with the archiving server to check if the quota will be exceeded
If quota not exceeded
Get metadata from user (some could be stored in a .config file for each user)
Check if the file extension is in the list of those that are worth compressing
Compress if worthwhile
Copy metadata and dataToArchive across to the archiving server
Else
Tell the user to contact the person responsible for updating quota sizes.
End

A gui version could also be implemented if desired, but would definitely not be necessary for zen.

At present Tom Haworth is going to develop this script and test the procedure on existing data. Pete Leggett of Open Exeter will develop the server end stuff.

Training

For zen users, essentially no training will be required. An email to the zen mailing list telling them what they need to do is standard procedure. They can also contact the zen manager if they have trouble. Can also add a section to the zen component of the astrophysics WIKI so that there is some permanent documentation.

Posted under Big Data, Case studies

This post was written by Jill Evans on May 31, 2012

Tags: , , ,