Archival Data Quality - Asbjoedt/CLISC GitHub Wiki

Introduction

Archiving preserves data for prosterity. The goal of archiving documents is to take a snapshot of the data as they were at the time of archiving. Then, the documents may at a later time be made accessible to a user, either the original data producer, a researcher or anyone interested, who have legitimate access rights to the documents. Ideally, the document should be as similar as possible to the original document, they were copied from, both in content and structure, but technical changes may be performed by the archive in order to ensure persistent accessibility to the document throughout changing technological landscapes. This may result in data loss, but if so it must be reduced to the minimum and it must be documented.

Documents created as spreadsheets have specific properties and use cases. To ensure the minimum loss of data and a maximum data quality for future reuse, it is necessary to analyze the significant properties of spreadsheets, and from this, compile a list of requirements that determines archival data quality for the preservation of spreadsheets. The list of requirements must be validated against each spreadsheet, which an archive ingests. This section describes the validation needs and archival requirements, as used by CLISC.

Validation of standard

A spreadsheet may be created or changed using software frameworks or by manually editing the XML. This may result in errors related to how the XML are formally allowed to be constructed according to the standard and its official schema files. The schema files guarantee any valid spreadsheet can be opened by a standard spreadsheet reader. Therefore, each spreadsheet must be validated against the file format schemas and if the spreadsheet has any errors, the errors must be corrected.

The two dominant spreadsheet standards, Office Open XML SpreadsheetML and OpenDocument Spreadsheet, should therefore be implemented with validators, and these should preferably be open-source and maintained by a strong consortium of stakeholders.

Archival requirements

Here follows a list of requirements, which CLISC checks for and performs actions to the archived spreadsheet in order to make sure the spreadsheet meets the necessary data quality level for long-term storage. If archiving is not selected when using CLISC, the list of requirements will not be followed.

Must contain data

Spreadsheets containing zero data have no value for later reuse and hence no value for archiving.

Risk: There's two risks for spreadsheets, either having no sheets or no cells with data. Fortunately, Excel does not allow user to save a spreadsheet without any sheets and a COM exception is also thrown when using Excel Interop.

System.Runtime.InteropServices.COMException: 'The project folder must have at least one visible worksheet.

However, other ways of deleting the sheets without using Excel might exist, and secondly in Excel saving a spreadsheet without any cell values is allowed.

Solution: Check for number of sheets (minimum one) and check for presense of cell value in any sheet in the spreadsheet (minimum one).

No password or write protection

Spreadsheets may be password protected at file, workbook, sheet or cell levels. This protection may relate to both read and write or just write. Spreadsheet may be write protected without password at file property level.

Risk: Spreadsheet cannot be opened, if password is lost over time. Spreadsheet cannot be validated or archival data requirements be performed.

Solution: Alert user if spreadsheet cannot be read because of password protection. Spreadsheet should be manually unprotected.

Data connections

Spreadsheets may have data connections to external sources such as other spreadsheets, CSV or XML files or servers.

Risk: The data connections will allow for future updating of the affected cell values, which will compromise the snapshot of data, which is taken at the time of archiving. Documenting the existence of a data connection as metadata is not considered significant for the authentic preservation.

Solution: Remove any data connections from the spreadsheet.

Embedded objects

Spreadsheets may contain other spreadsheets, Word documents, PowerPoint presentations, 3D objects or images and these objects may be in a format, which are not longterm sustainable.

Risk: Embedded objects may be in formats, which an archive does not accept for longterm preservation.

Solution: Alert user of existence of embedded objects, so that the user can manually extract the data from the spreadsheet and store the data separately.

External Relationships

External relationships include linked OLE objects (they are not embedded) and linked cell values fetched from another spreadsheet. Both external relationships relate to files in your local directory, and which relationship will be broken, if the files or spreadsheet is removed from the local directory.

Risk: Data from the context of external relationships will be lost, if a spreadsheet is archived, which typically means removing the spreadsheet from it's local environment parameters if the archive is pursuing a data migration strategy.

Example of data loss

Solution: Remove any external relationships. In the case of linked cell values, keep the actual cell values as the snapshot. In the case of unembedded OLE objects, they should be handled manually and the object should ideally be archived separately and the relationship documented in metadata. Alternative solution is to embed the OLE object.

RTD functions

Microsoft Office Excel provides a worksheet function, RealTimeData (RTD). This function enables you to call a Component Object Model (COM) Automation server to retrieve data in real time.

The RTD function uses the following syntax =RTD(RealTimeServerProgID,ServerName,Topic1,[Topic2], ...)

Source

Risk: When opening an archived spreadsheet many years from now, the server connection might still be available and the values will auto-update. This intervenes with the archival goal of taking snapshots of data at a given time.

Solution: Remove any RTD formula functions in cells in any sheet in a spreadsheet, but keep the actual cell values as the snapshot.

Printer settings

Office Open XML spreadsheets may contain encrypted printer settings. Archived data should not contain any encrypted information. Embedded printer settings are not necessary for printing the spreadsheet.

Risk: The encrypted printer settings may be broken and criminals can get access to confidential printer addresses, which may compromise the network.

Solution: Remove printer settings from spreadsheet.

Absolute path

Office Open XML spreadsheets may contain absolute path to local directory saved in workbook.xml.

Risk: The local directory path may contain private folder names.

Solution: Remove absolute path from workbook.xml.

VBA projects and macros

Spreadsheets may contain Visual Basic for Application code and macros, which makes it possible to run code that changes the content of the spreadsheet or the data stored on the user's computer.

Risk: VBA projects (macros) can have executable code that may cause harm to the user's computer or change the data in the spreadsheet in ways that can break the snapshot of data.

Solution: Convert any spreadsheets saved as .xlsm and .xltx to .xlsx, and thereby removing any VBA projects (macros). For OpenDocument Spreadsheets, it will be necesssary to check each spreadsheet for macros and remove any occurence.

Archival requirements under consideration

Hyperlinks

It is not considered necessary to remove any hyperlinks or automatically convert any hyperlinks to corresponding URLs found in Internet Archive's Wayback Machine, however hyperlinks do constitute an external link to data or metadata, which may be lost over time, if the referenced source is no longer available. It may be possible to reestablish this link, if it is preserved in the Wayback Machine or other internet archives or the data loss may be considered within acceptable ranges in consideration of the amount of work, it may require to preserve the linked data alongside the spreadsheet. Guiding any future user to search for any hyperlinks in public internet archives, is deemed an acceptable approach to solve this issue.

Solution: Alert user of detected hyperlinks, but perform no other action.

File property information

Information such as author, title, keywords, last modified by, superior etc., which is found in the spreadsheet's file property information, may be misleading if the spreadsheet has been reused as a template or extensively overwritten over time by multiple people. It may therefore be relevant to remove this information from the spreadsheet, and the information may be saved to a metadata file, if the archivist may have concerns regarding the documentation of authenticity, or the information may simply be left in place and any future use could involve supplying a guideline on what to be aware of when reusing an archived spreadsheet. Which approach is the most appropriate should be evaluated.

Solution: Do nothing.

Embedded objects

Alternative approaches to the manual conversion of embedded objects could be to leave them in place, remove them or create code, which may automatically convert the embedded objects to static renderings in accepted file formats. Which approach is the most appropriate should be evaluated.

Solution: Alert user of detected embedded objects, but perform no other action.