Checksum Verification - psu-libraries/scholarsphere GitHub Wiki

Strategy

The fastest way to verify that all the files from Scholarsphere 3 were correctly migrated is to compare the etag calculated by Amazon's S3 service with the original md5 checksum that was calculated by Fits when the file was added to Scholarsphere 3.

While the etag verification process can be used for other purposes, it will not the be the only checksum verification method in Scholarsphere. We will need to created separate checksums, such as sha256, and store those with the file's metadata for future reference.

Getting Fits Checksums

Find Deleted Files

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
  begin
    FileSet.find(resource.pid)
  rescue Ldp::Gone
    resource.update(exception: 'Ldp::Gone', error: nil)
  end
end

Find Blank Files

These are files that were uploaded via Box but were never ingested due to timeout issues.

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
  begin
    resource.update(exception: 'ArgumentError', error: 'original_file is nil') if FileSet.find(resource.pid).original_file.nil?
  rescue StandardError => e
    puts "Failed to update #{resource.pid}: #{e.message}"
  end
end

MD5 Checksums

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
  FileSet.find(resource.pid).original_checksum
end

Get Etag

class EtagError < StandardError; end

def update_file_set(resource)
  url = URI("https://scholarsphere.psu.edu/api/public")

  https = Net::HTTP.new(url.host, url.port)
  https.use_ssl = true

  request = Net::HTTP::Post.new(url)
  request["x-api-key"] = ENV['SS4_API_KEY']
  request["Content-Type"] = "application/json"
  request.body = "{\"query\":\"{\\n  file(pid: \\\"#{resource.pid}\\\") {\\n    etag\\n  }\\n}\\n\",\"variables\":{}}"

  result = https.request(request)
   
  resource.update(client_status: result.code, client_message: result.read_body)
rescue StandardError => e
  resource.update(exception: "EtagError", message: e.message)  
end

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL and client_status IS NULL').each do |resource|
  update_file_set(resource)
end

Compare Checksums

See https://teppen.io/2018/06/23/aws_s3_etags/

def check_md5(pid:, etag:)
  file_set = FileSet.find(pid)

  if file_set.original_checksum.first == etag
    "passed"
  else
    "failed"
  end
rescue
  "unknown"
end

report = {}

Scholarsphere::Migration::Resource.where(model: 'FileSet').each do |resource|
  if resource.client_status == "200"
    etag = resource.message.dig('data', 'file', 'etag')

    if etag.nil?
      report[resource.pid] = "Etag is missing. It's been removed from Scholarsphere 4"
    elsif etag.match?('-')
      report[resource.pid] = "Skipped"
    else
      report[resource.pid] = check_md5(pid: resource.pid, etag: etag)
    end
  else
    report[resource.pid] = "#{resource.exception}: #{resource.error}"
  end
end

Re-calculate failed checksums

The md5 checksum that was from the Fits report does not match the checksum in Scholarsphere 4, so we re-calculate the checksum from the existing file in Scholarsphere 3. This is most likely because the a newer version of the file was uploaded, but Fits was never run or updated.

report.select { |k, v| v == "failed" }.keys.map do |pid|
  resource = Scholarsphere::Migration::Resource.find_by(pid: pid)
  etag = resource.message.dig('data', 'file', 'etag')
  file_set = FileSet.find(pid)
  location = FileSetDiskLocation.new(file_set)
  md5 = Digest::MD5.hexdigest File.read(location.path)
  
  if etag == md5
    report[resource.pid] = "passed"
  else
    report[resource.pid] = "failed"
  end
end

Calculate S3 Etags for large files

The Etag isn't the actual md5, it is a custom etag created for multipart uploads. In this case, we calculate the custom etag locally and compare it to the original from Amazon.

See https://github.com/antespi/s3md5 for the script that calculates the S3 etag.

Note: This was originally set to look for "Skipped" values, but an earlier iteration incorrectly marked the results "failed" so we went back and re-checked only the remaining failed checks.

total = report.select { |k, v| v == "failed" }.keys.count
counter = 1
report.select { |k, v| v == "failed" }.keys.map do |pid|
  resource = Scholarsphere::Migration::Resource.find_by(pid: pid)
  etag = resource.message.dig('data', 'file', 'etag')
  file_set = FileSet.find(pid)
  location = FileSetDiskLocation.new(file_set)
  command = "./s3md5 --etag #{etag} 50 #{location.path}"
  print "Checking #{counter}/#{total}..."
  stdout, stderr, status = Open3.capture3(command)
  
  if status == 0 && stdout
    report[resource.pid] = "passed"
  else
    report[resource.pid] = "failed"
  end
  puts "done!"
  counter = counter + 1
end

Outcomes

There are three scenarios

Initial md5 calculated by Fits in SS3 matches the md5 Etag in S3
Outcome: nothing to do ✅

Initial md5 calculated by Fits in SS3 does NOT match the md5 Etag in S3
Outcome: Re-calculate the md5 Scholarsphere 3 ✅
Notes: This likely happened because a newer version of the file was uploaded to Scholarsphere 3, and Fits was never re-run to capture the updated md5.

The Etag in S3 is not an md5
Outcome: Calculate the S3 Etag locally in Scholarpshere 3 and compare ✅

See https://github.com/psu-libraries/scholarsphere/issues/1006