Zen Archiving: an Open Exeter Case Study in Astrophysics

Posting this on behalf of Tom Haworth. Tom is a 2nd year Postgraduate in Astrophysics and has been commissioned by us to write a case study documenting the process of transferring large amounts of data (TBs) from a HPC (zen) to the Exeter Data Archive.

We are interested in the whole process – from deciding what to keep and what to delete to data bundling and metadata entry. The Astrophysics Group is using the process to develop policy and guidelines on use of zen to store and manage data.

The following are some initial thoughts on how to kick off the process:

– The archiving process will have to take place from the command line (or a gui) on zen-viz.
– Tom Haworth will develop a script that takes user-entered metadata, potentially compresses the file, and sends both directly to the archiving server.
– The Open Exeter IT team has sufficient information to perform the archiving server-end work. They are also considering command line retrieval of data.
– The kind of data that we expect to archive is completed models. Necessary software to view the data should be included too.
– Email and WIKI entries are all that will be required for training.

Where is the data
Data will be stored on zen at one of /archive/, /scratch/ or/data/. archive and scratch are not under warranty.

What kind of data needs to be archived
There will be a range of data of different file formats, some not seen outside of the astrophysics community. These can be collected and compressed, if not by the user then potentially by the submission script at run-time. Compression is not always worth doing so a list of compression-worthy extensions could be stored.

The data to archive will probably be on a model-by-model basis rather than publication, but publication details will be included in the metadata. This will probably be governed by the size of the files.

Data to be archived should be completed models.

What will happen to the data on zen
This will probably be determined on a case-by-case basis depending on how frequently (if at all) the data is required. Data that has no imminent further use should be removed.

For example, I would be archiving some finished models but may also need them for my thesis.

How might extraction from the archive work from the command line?
– searching could still take place on the web
– extraction would rely on direct communication with the archiving server

Policy for archiving
Should avoid letting any user on zen archive absolutely anything and everything. Need:
 guidelines on what should be archived
 We can track how much people have been archiving and communicate with them if it looks like they are abusing it.

Metadata verification for senior users is not required. PhD students could have their submission metadata verified by their supervisor.

Metadata is required to ensure that the data is properly referenced and can be found easily.
Entries are Title, Author, Publisher, Date Issued, URL, Abstract, Keywords, Type etc.

In HPC astrophysics there will likely be additional entries of use such as the code used to generate the data. I suggest using an “Additional Comments” field.

This information will be requested at the command line when archiving.

The archiving procedure on zen
It will be completely impractical to archive the data through the web interface. It will also be impractical to download the data onto a local machine and then archive it (local machines probably will not even have the capacity to store the data). The ideal situation will be one in which data can be archived straight from zen, communicating directly with the storage server and sending the appropriate metadata in addition.

This should happen from the zen visualization node, so as not to grind the login node to a halt.

A simple command line script would be all that is required.

Basic archive script
Read in name of thing to archive
Check the size of the thing to archive
Communicate with the archiving server to check if the quota will be exceeded
If quota not exceeded
Get metadata from user (some could be stored in a .config file for each user)
Check if the file extension is in the list of those that are worth compressing
Compress if worthwhile
Copy metadata and dataToArchive across to the archiving server
Tell the user to contact the person responsible for updating quota sizes.

A gui version could also be implemented if desired, but would definitely not be necessary for zen.

At present Tom Haworth is going to develop this script and test the procedure on existing data. Pete Leggett of Open Exeter will develop the server end stuff.


For zen users, essentially no training will be required. An email to the zen mailing list telling them what they need to do is standard procedure. They can also contact the zen manager if they have trouble. Can also add a section to the zen component of the astrophysics WIKI so that there is some permanent documentation.

PGR feedback on data upload

Last week we asked our group of PGRs to test upload of data to the Exeter Data Archive. I was particularly interested in seeing how they would respond to the interface and the metadata web form.

The following are some of the comments that we received – some of these relate specifically to how DSpace works but some are of general interest:

• Add a sentence to the current licence making it clear that depositors can ask to remove their data/outputs.

• It’s important to be able to see inside a zip file.

• How can multiple files be uploaded?

• It would be used more if it were possible to upload from your own drive – drag and drop rather than entering metadata through the web interface.

• A ‘wizard’ like process would be really helpful.

• Would like a template structure for storing previously entered metadata, this could be selected later for further related deposits.

• Keywords – need intuitive text to appear in boxes otherwise will get an inconsistent and inaccurate list of keywords.

• Upload speed – varied between PGRs, Mac users found it much quicker – 100mb audio file uploaded in about 30 seconds; 700mb took 20 mins to upload with a Mac.

• The Submit button needs to be much clearer

• Do you need to login before you upload or could you choose to upload and then have to login – which is better?

• Metadata – people will cut corners if it’s too onerous.

• Would be good to be able to add projects to the hierarchy (i.e., DSpace Communities structure)

• DPA – is it contravening DPA if even an administrator can see sensitive data?

• Data could be encrypted as well as being stored in a ‘dark archive’.

• An upload manager would be a really useful feature – you could queue files for upload and then just leave them.

• Important to add contact details of depositor (PI, etc.), especially email address.

• Clearer help and guidance; make mandatory fields clearer.  Title – more specific guidance, is this title of the deposit or depositor.

• Would be useful to have a dropdown list of your previous submissions, you could then choose to link things together (e.g., paper & data), and make the process easier.

• Confused about the difference between date of publication and date of creation – publication is date it becomes publicly available and is need by DataCite – but DSpace doesn’t automatically assign this detail to the ‘publication’ field.

• Need a more comprehensive list of data types than default Dublin Core list.

Case study – The Cricket-Tracking Project

Other JISC MRD projects or those working with ‘big data’ may be interested in a case study that has been written for Open Exeter by Dr Jacq Christmas (http://hdl.handle.net/10036/3556).

The case study documents the process of reviewing, preparing, uploading and describing multiple large video files. The project that generated the files is investigating the behaviour of crickets through analysis of thousands of hours of motion-triggered video.

The project is interesting to us for a number of reasons:

• It is a cross-disciplinary/cross-departmental project – these sort of projects are becoming increasingly common at Exeter and do throw up interesting questions around the area of ‘ownership’
• Huge amounts of data have been and continue to be produced
• Storage is a problem due to the number and size of files – most files are stored on external hard drives held in various places
• As there is no central storage system, secure backup can be a problem
• Ditto secure sharing
• The first batch of video is in a proprietary format that requires specific software in order to be viewable

The case study sets out quite clearly the thought that should be given to selecting and preparing files for upload to a repository. We are looking at how the procedures described can be adapted as templates to guide researchers from other disciplines through the deposit process, some aspects of which will always be generic, for example:

• Listing and explaining the various file formats and how they are related
• Selecting a set of metadata fields to describe the files
• Thinking about the structure of the data in the repository and how it links to related resources, projects and collections

One issue that has arisen from this case study, that we were already well aware of, is the preference to deposit research in a project or research group collection rather than a generic departmental or College collection. In many cases the sense of belonging to or affinity with a group is stronger than departmental ties. This is a tricky one for us: DSpace structure centres on a hierarchy of communities, sub-communities and collections; once these have been set up and start to be populated, it is difficult to make significant changes. Add to that the fact that our CRIS, Symplectic, has been painstakingly mapped across to all our existing communities and collections and any structural changes become even more problematic. For the moment we are looking at a possible metadata solution (dc****.research group ??). I’d be interested to hear how others deal with the research project/group requirement.

We’re about to start a similar test case study with Astrophysics and later in the year with an AHRC-funded project based in Classics and Ancient History. It will be interesting to see if the approach taken in these areas are significantly different, or given different emphasis.

I won’t say that our first case study has allowed us to resolve the many issues raised yet but we are at least more aware of what is important to researchers and can start to take steps to find solutions.

DSpace – our repository software

We’ve chosen DSpace as the repository to hold our research data. Much of the work to date has been involved around the issue of submitting large datasets to the respository.

We’re looking at using SWORD, and possibly SWORD 2.0. We’ve also taken the opportunity to update our current DSpace installations to the latest version of DSpace (1.8.2) and switched to an Oracle database, from Postgres. This gives us 24/7 support and allows us to use the latest version of SWORD, which only works on DSpace 1.8. We would also have had to upgrade our version of postrges to allow us to use the latest version of DSpace, which helped toward our decision to move to oracle.

For you techies out there, this process of updating has not been straightforward and is full of pitfalls. There is currently no process of easily cloning a postgres database and creating an oracle version so it all has to be done by hand to ensure the database integrity remains high. However, once the database is switched over, upgrading from 1.6.2 to DSpace 1.7.2 is quite straightforward.

BUT, DSpace 1.8.2 has some important differences to 1.7, most notably the devolvement of the main dspace.cfg file into several smaller configs.

So, a long winded process but we are near the end now. The test DSpace install is fully functional and so we now have the capability to upgrade the live version.

Ian – Technical Developer

Data: An Engineer’s Perspective

This month has been a month of thesis writing. As a disciple of engineering, I have not really considered the output of my hours of writing, producing an accessible front end to my research, as data. However, I am told that it is. Personally, I see data as the pages and pages of numbers produced either through physical experimentation or through a computer program; in a format that generally means nothing to anyone apart from its creator. You could extend this definition to describe the figures, graphs etc. that you derive from these pages of nonsense. But what about the findings, the conclusions, the evaluation…is that also data? Well, I suppose they are. Without these how does anyone know what your data has been used for, or its relevance to current research?

But then you have to also consider the framing: The metadata. How did you get your results? What assumptions did you make? What formulas did you use? All of these give validity to your data and provide valuable information and a platform for its future use. Without this platform, the data is arguably worthless to anyone else.

In the past these thoughts about data have been insignificant to me. But now that I am a PG researcher, I have a responsibility to store and share my data on a professional level. The Open Exeter project has helped me to think about data; how I define it, store it, save it, protect it and, in the end, share it. This level of consideration is being demanded by funding bodies and institutions nationally, who expect that the resultant output of taxpayers money is available to all. And as professionals, it is our duty to meet these expectations. And don’t forget, it benefits us in the long run by providing a whole new platform for developing new research using the data provided by others as a platform…without all the red tape.

Congratulations Elif, our Kindle winner!

We would like to thank everyone who participated in our online research data management survey and announce that Elif Gozler, a PhD student in the Institute of Arab and Islamic Studies, was the lucky winner of the Kindle in our prize draw.

Elif was randomly picked from the nearly 300 participants in our research data management survey and was presented the prize today by Afzal Hasan, Subject Librarian for the IAIS and Politics, and Open Exeter’s Holistic Librarian. Elif also enjoyed a cream team with the Open Exeter team.

Congratulations Elif!


Encryption and synching

As is common for researchers dealing with human research subjects, security of sensitive data is one of my main concerns. While developing my data management plan, it became clear that I would need to encrypt my data storage devices, including my computer and backup external hard drive.
Encryption, put simply, is the conversion of data into a format that requires a key in order to be decyphered. If you lose your laptop, external hard drive or USB key, it would at least make it very hard for the person that finds it to access your data in an intelligible format.
I have been making enquiries about encryption solutions, and found out that the university’s IT department uses Truecrypt, which is a free open-source software compatible with various operating systems, including Windows 7, Vista, XP and Mac OS X. More information and guidelines on Truecrypt, offered by the University of Exeter’s IT department, can be found at:

It is to be noted that, if you use a university loaned laptop, you may have to request approval prior to encrypting the device. Laptops on loan are not necessarily encrypted, but I was informed that this may change in the future. Very importantly, make sure to remember and keep your key safely!
Another concern of mine is to ensure the data temporarily stored on my iPad – which I use as a data collection tool (notes, audio, fieldwork photos) – would be secured. General opinion seems to be that an iPad is password protected and therefore relatively secured; I would be interested to hear what others think about this. I have also noticed an option to wipe out the data on an iPad/iPhone should the authentication attempts fail 10 times in a row.
I am also currently wondering about the safety of cloud storages such as Dropbox and iCloud. I have been assured by an IT technician who read a lot about Dropbox that it can be considered as safe. I do not have a problem with putting my own documents in there… But what about sensitive research data? I would be interested to know what others make of cloud storage.

I am a bit of a research data hoarder. I am so scared of losing my preciously gathered data that I tend to save it everywhere, with different backups. It is great if your devices are properly protected, unless, like me, you perpetually have to update manually your documents stored in multiple locations. Aside from the time-consuming nature of this process, I have experienced a few instances in the past of missing out on updating a backup file, and thus losing track of the most up to date version of my document. With my current project, I have decided to investigate ways of automating the synching process, to make sure my files are saved and backed-up, saving me time and avoiding missing updates. There is a simple solution that has been discussed by another member of the Open Exeter project in his private blog, which is Synchtoy. This same free software solution was also recommended by a technician from the University’s IT department. All I need to do now is go on a data/folder cleaning spree, plan out properly my storage destinations and process and get the synching going. It will take a bit of time, but as with many things, investing the energy now might save me much more in the future.

Gadgets for Research: Tech Review

The Open Exeter project team recently introduced us PGRs to an interesting gadget that might start appearing on researchers’ wish lists. It’s called the Livescribe Echo Smartpen and I’ve been given one to test drive for a few weeks.

It’s really a very simple concept; whatever you write with the smartpen is converted to typed text. Sounds perfect for fieldwork when it might be difficult to carry a laptop with you. All you do is make fairly legible handwritten notes in a standard-sized notebook, which comes with the pen. This is then stored on the pen and can be downloaded and converted to text on your computer later. You can even record audio notes on the pen to go along with what you’re writing.
I’ve been using the pen for the past few days, which has meant I can work outside and take advantage of the great weather. The verdict so far is 10/10. It’s not much bigger than a standard pen but this amazing gadget can hold pages of handwritten notes for conversion to text later and copes remarkably well with my less-than-ideal handwriting.

All in all, the Echo Smartpen is a convenient way of creating research data such as field notes, which could save researchers a considerable amount of time and effort especially if they need to digitise their notes for archiving. However, at around £100 it might not be the top priority for researchers with a limited expenses budget.

Check back later for more Gadgets for Research blog posts including a review of dictation software and iPad apps.

Discuss Debate Disseminate

The Open Exeter project is pleased to invite all UoE researchers to Discuss Debate Disseminate: A discussion of the issues around the management of your research materials and data and an opportunity to network with other researchers. PhD students and early career researchers from all disciplines welcome.

The event will take place on 22nd June 09:00 – 12:30 in the Upper Lounge of Reed Hall on the Streatham campus.


09:00 – 09:15: Arrival coffee/tea

09:15 – 09:30: Welcome

09:30 – 10:30: Session 1: Delete, Keep or Share?: Each researcher brings one example of research material or data (this could be, for example, in electronic or paper format).  In groups you will describe your research material or data briefly before discussing whether you would delete it, keep it, or share it, and why.

10:30-10:45: Coffee/tea break

10:45-11:30: Session 2: “Speed data dating”: Meet and get to know other researchers and the issues that they face with their research materials. Are there any common problems or solutions?

11:30-12:15: Session 3: PhD student panel session: Open Exeter PhD student answer your research materials management questions.

12:15-12:30: Feedback and Close

Please register for the event via email to 

For event details see: https://www.facebook.com/events/407590612590904/

