Useful ActiveRecord queries - sul-dlss/preservation_catalog GitHub Wiki

Some useful ActiveRecord queries

  • useful for running from console for now, similar to the sorts of info that might be exposed via REST calls as development proceeds
  • things below related to status will change once status becomes a Rails enum on MoabRecord (see #228)

note about sql sanitization

If running any of the following queries raises an error like:

ActiveRecord::UnknownAttributeReference: Query method called with non-attribute argument(s): "SUM(zip_parts.size) - max(moab_records.size) AS zmv_size_diff"
from /opt/app/pres/preservation_catalog/shared/bundle/ruby/2.7.0/gems/activerecord-6.1.3.2/lib/active_record/sanitization.rb:145:in `disallow_raw_sql!'

You may have to wrap the offending raw SQL in an Arel.sql call, e.g. Arel.sql('SUM(zip_parts.size) - max(moab_records.size) AS zmv_size_diff').

per https://api.rubyonrails.org/classes/ActiveRecord/UnknownAttributeReference.html

When working around this exception, caution should be taken to avoid SQL injection vulnerabilities when passing user-provided values to query methods. Known-safe values can be passed to query methods by wrapping them in Arel.sql. Again, such a workaround should not be used when passing user-provided values, such as request parameters or model attributes to query methods.

which disk is a specific druid stored on?

# example AR query
druid = 'yk449xx2836'
CompleteMoab.joins(:preserved_object).find_by('preserved_objects.druid = ?',druid).moab_storage_root

which objects aren't in a good state?

# example AR query
[1] pry(main)> MoabRecord.joins(:preserved_object, :moab_storage_root).where.not(status: 'ok').order(status: :asc, storage_location: :asc).pluck(:status, :storage_location, :druid)
generated SQL
-- example sql produced by above AR query
SELECT "moab_records"."status", "storage_location", "druid" FROM "moab_records"
INNER JOIN "preserved_objects" ON "preserved_objects"."id" = "moab_records"."preserved_object_id"
INNER JOIN "moab_storage_roots" ON "moab_storage_roots"."id" = "moab_records"."moab_storage_root_id"
WHERE ("moab_records"."status" != $1)
ORDER BY moab_records.status ASC, moab_storage_roots.storage_location ASC
# example result, one bad object on checksum_root01
[["invalid_checksum", "spec/fixtures/checksum_root01/sdr2objects", "zz102hs9687"]]

catalog seeding just ran for the first time. how long did it take to crawl each storage root, how many moabs does each have, what's the average moab size?

# example AR query
[2] pry(main)> MoabStorageRoot.joins(:moab_records).group(:name).order('name asc').pluck(:name, Arel.sql('min(moab_records.created_at)'), Arel.sql('max(moab_records.created_at)'), Arel.sql('(max(moab_records.created_at)-min(moab_records.created_at))'), Arel.sql('count(moab_records.id)'), Arel.sql('round(avg(moab_records.size))'))
generated SQL
-- example sql produced by above AR query
SELECT "moab_storage_roots"."name", min(moab_records.created_at), max(moab_records.created_at), (max(moab_records.created_at)-min(moab_records.created_at)), count(moab_records.id), round(avg(moab_records.size)) FROM "moab_storage_roots"
INNER JOIN "moab_records" ON "moab_records"."moab_storage_root_id" = "moab_storage_roots"."id"
GROUP BY "moab_storage_roots"."name"
ORDER BY name asc
# example result when there's one storage root configured, would automatically list all if there were multiple
[["storage_root2", 2017-11-18 05:49:54 UTC, 2017-11-18 06:06:50 UTC, "00:16:55.845987", 9122, 0.3132092573e10]]

what's the breakdown of objects among the storage roots?

# example AR query
[30] pry(main)> MoabRecord.joins(:moab_storage_root).group(:status, :name, :storage_location).order(status: :asc, storage_location: :asc).pluck(:status, :name, :storage_location, 'count(moab_records.id)')
generated SQL
-- example sql produced by above AR query
SELECT "moab_records"."status", "name", "storage_location", count(moab_records.id) FROM "moab_records"
INNER JOIN "moab_storage_roots" ON "moab_storage_roots"."id" = "moab_records"."moab_storage_root_id"
GROUP BY "moab_records"."status", "name", "storage_location"
ORDER BY "moab_records"."status" ASC, "storage_location" ASC
# example result from stage
[["ok", "new-store-fixtures-sr1-last", "/services-disk-stage/new-store-fixtures-sr1/sdr2objects", 1],
 ["ok", "service-disk-stage100", "/services-disk-stage/store-100/sdr2objects", 3404],
 ["ok", "services-disk-stage2", "/services-disk-stage/store2/sdr2objects", 11229],
 ["ok", "fixture_sr2", "spec/fixtures/storage_root02/sdr2objects", 1],
 ["invalid_checksum", "fixture_sr3", "spec/fixtures/checksum_root01/sdr2objects", 10],
 ["invalid_checksum", "fixture_sr1", "spec/fixtures/storage_root01/sdr2objects", 1],
 ["invalid_checksum", "fixture_sr2", "spec/fixtures/storage_root02/sdr2objects", 2],
 ["validity_unknown", "fixture_sr1", "spec/fixtures/storage_root01/sdr2objects", 3]]
how many moabs on each storage root are status != 'ok'?
# example AR query
[12] pry(main)> MoabRecord.joins(:moab_storage_root).where.not(status: 'ok').group(:status, :name, :storage_location).order(status: :asc, storage_location: :asc).pluck(:status, :name, :storage_location, 'count(moab_records.id)')
generated SQL
-- example sql produced by above AR query
SELECT "moab_records"."status", "name", "storage_location", count(moab_records.id) FROM "moab_records"
INNER JOIN "moab_storage_roots" ON "moab_storage_roots"."id" = "moab_records"."moab_storage_root_id"
WHERE "moab_records"."status" != 0
GROUP BY "moab_records"."status", "name", "storage_location"
ORDER BY "moab_records"."status" ASC, "storage_location" ASC
# example result from stage
[["invalid_checksum", "fixture_sr3", "spec/fixtures/checksum_root01/sdr2objects", 10],
 ["invalid_checksum", "fixture_sr1", "spec/fixtures/storage_root01/sdr2objects", 1],
 ["invalid_checksum", "fixture_sr2", "spec/fixtures/storage_root02/sdr2objects", 2],
 ["validity_unknown", "fixture_sr1", "spec/fixtures/storage_root01/sdr2objects", 3]]

view the druids on a given endpoint

Warning: will return tons of results on prod, query should probably be filtered further

# example AR query
input> MoabRecord.joins(:preserved_object, :moab_storage_root).where(moab_storage_roots: {name: :fixture_sr1}).pluck('preserved_objects.druid')
generated SQL
-- example sql produced by above AR query
 SELECT preserved_objects.druid FROM "moab_records"
 INNER JOIN "preserved_objects" ON "preserved_objects"."id" = "moab_records"."preserved_object_id"
 INNER JOIN "moab_storage_roots" ON "moab_storage_roots"."id" = "moab_records"."moab_storage_root_id"
 WHERE "moab_storage_roots"."name" = $1  [["name", "fixture_sr1"]]
# example result
["bp628nk4868", "dc048cw1328", "yy000yy0000"]

view the zip parts for a given druid list

# example AR query
input> druids = ["fj470sr8766", "ys375hy4998", "vx705zc3288"]
input> ZipPart.joins(
          zipped_moab_version: [:preserved_object, :zip_endpoint]
        ).where(
          preserved_objects: { druid: druids }
        ).order(
          :druid, :zip_version, :endpoint_name, :suffix
        ).pluck(
          :druid, 'current_version AS highest_version', 'zipped_moab_versions.version AS zip_version', :endpoint_name, :status, :suffix, :parts_count, :size, :created_at, :updated_at
        )
generated SQL
-- example sql produced by above AR query
SELECT "druid", current_version AS highest_version, zipped_moab_versions.version AS zip_version,
  "endpoint_name", "zip_parts"."status", "zip_parts"."suffix", "zip_parts"."parts_count", "zip_parts"."size",
  "zip_parts"."created_at", "zip_parts"."updated_at"
FROM "zip_parts"
  INNER JOIN "zipped_moab_versions" ON "zipped_moab_versions"."id" = "zip_parts"."zipped_moab_version_id"
  INNER JOIN "preserved_objects" ON "preserved_objects"."id" = "zipped_moab_versions"."preserved_object_id"
  INNER JOIN "zip_endpoints" ON "zip_endpoints"."id" = "zipped_moab_versions"."zip_endpoint_id"
WHERE "preserved_objects"."druid" IN ('hz813rj6000', 'yg036yy3673', 'tb692bc8422', 'bg731fx0119', 'zx893sy6257', 'nx552zw5634', 'cd763sd8878', 'qc430zq6785', 'tg286cg3956', 'qt215kg7513')
ORDER BY "druid" ASC, "zip_version" ASC, "endpoint_name" ASC, "zip_parts"."suffix" ASC
# example result
[["fj470sr8766", 4, 1, "aws_s3_east_1", "ok", ".zip", 1, 55793674, Sun, 06 Oct 2019 21:22:22 UTC +00:00, Wed, 15 Jul 2020 08:21:00 UTC +00:00],
 ["fj470sr8766", 4, 1, "aws_s3_west_2", "ok", ".zip", 1, 55793674, Sat, 11 Aug 2018 05:29:45 UTC +00:00, Wed, 15 Jul 2020 07:49:58 UTC +00:00],
 ["fj470sr8766", 4, 1, "ibm_us_south", "ok", ".zip", 1, 55793674, Sat, 09 Mar 2019 02:51:39 UTC +00:00, Wed, 15 Jul 2020 08:03:19 UTC +00:00],
 ["fj470sr8766", 4, 2, "aws_s3_east_1", "ok", ".zip", 1, 27729, Sun, 06 Oct 2019 21:22:17 UTC +00:00, Wed, 15 Jul 2020 08:21:01 UTC +00:00],
 ["fj470sr8766", 4, 2, "aws_s3_west_2", "ok", ".zip", 1, 27729, Sat, 11 Aug 2018 05:29:45 UTC +00:00, Wed, 15 Jul 2020 07:49:58 UTC +00:00],
 ["fj470sr8766", 4, 2, "ibm_us_south", "ok", ".zip", 1, 27729, Sat, 09 Mar 2019 02:51:38 UTC +00:00, Wed, 15 Jul 2020 08:03:19 UTC +00:00],
 ["fj470sr8766", 4, 3, "aws_s3_east_1", "ok", ".zip", 1, 35404, Sun, 06 Oct 2019 21:22:17 UTC +00:00, Wed, 15 Jul 2020 08:21:01 UTC +00:00],
 ["fj470sr8766", 4, 3, "aws_s3_west_2", "ok", ".zip", 1, 35404, Sat, 11 Aug 2018 05:29:45 UTC +00:00, Wed, 15 Jul 2020 07:49:58 UTC +00:00],
 ["fj470sr8766", 4, 3, "ibm_us_south", "ok", ".zip", 1, 35404, Sat, 09 Mar 2019 02:51:38 UTC +00:00, Wed, 15 Jul 2020 08:03:19 UTC +00:00],
 ["fj470sr8766", 4, 4, "aws_s3_east_1", "ok", ".zip", 1, 54801, Sun, 06 Oct 2019 21:22:22 UTC +00:00, Wed, 15 Jul 2020 08:21:01 UTC +00:00],
 ["fj470sr8766", 4, 4, "aws_s3_west_2", "ok", ".zip", 1, 54801, Thu, 09 May 2019 22:34:03 UTC +00:00, Wed, 15 Jul 2020 07:49:59 UTC +00:00],
 ["fj470sr8766", 4, 4, "ibm_us_south", "ok", ".zip", 1, 54801, Thu, 09 May 2019 22:34:03 UTC +00:00, Wed, 15 Jul 2020 08:03:19 UTC +00:00],
 ["vx705zc3288", 6, 1, "aws_s3_east_1", "ok", ".zip", 1, 103251, Mon, 18 Nov 2019 23:31:16 UTC +00:00, Fri, 17 Apr 2020 01:04:26 UTC +00:00],
 ["vx705zc3288", 6, 1, "aws_s3_west_2", "ok", ".zip", 1, 103251, Sun, 26 Aug 2018 05:36:26 UTC +00:00, Thu, 16 Apr 2020 11:07:52 UTC +00:00],
 ["vx705zc3288", 6, 1, "ibm_us_south", "ok", ".zip", 1, 103251, Fri, 08 Mar 2019 18:46:00 UTC +00:00, Thu, 16 Apr 2020 16:11:13 UTC +00:00],
 ["vx705zc3288", 6, 2, "aws_s3_east_1", "ok", ".zip", 1, 32635, Mon, 18 Nov 2019 23:31:16 UTC +00:00, Fri, 17 Apr 2020 01:04:26 UTC +00:00],
 ["vx705zc3288", 6, 2, "aws_s3_west_2", "ok", ".zip", 1, 32635, Sun, 26 Aug 2018 05:36:26 UTC +00:00, Thu, 16 Apr 2020 11:07:52 UTC +00:00],
 ["vx705zc3288", 6, 2, "ibm_us_south", "ok", ".zip", 1, 32635, Fri, 08 Mar 2019 18:46:00 UTC +00:00, Thu, 16 Apr 2020 16:11:13 UTC +00:00],
 ["vx705zc3288", 6, 3, "aws_s3_east_1", "ok", ".zip", 1, 40159, Mon, 18 Nov 2019 23:31:16 UTC +00:00, Fri, 17 Apr 2020 01:04:26 UTC +00:00],
 ["vx705zc3288", 6, 3, "aws_s3_west_2", "ok", ".zip", 1, 40159, Sun, 26 Aug 2018 05:36:26 UTC +00:00, Thu, 16 Apr 2020 11:07:53 UTC +00:00],
 ["vx705zc3288", 6, 3, "ibm_us_south", "ok", ".zip", 1, 40159, Fri, 08 Mar 2019 18:46:00 UTC +00:00, Thu, 16 Apr 2020 16:11:13 UTC +00:00],
 ["vx705zc3288", 6, 4, "aws_s3_east_1", "ok", ".zip", 1, 50144, Wed, 18 Mar 2020 05:00:55 UTC +00:00, Fri, 17 Apr 2020 01:04:26 UTC +00:00],
 ["vx705zc3288", 6, 4, "aws_s3_west_2", "ok", ".zip", 1, 50144, Wed, 18 Mar 2020 05:00:55 UTC +00:00, Thu, 16 Apr 2020 11:07:53 UTC +00:00],
 ["vx705zc3288", 6, 4, "ibm_us_south", "ok", ".zip", 1, 50144, Wed, 18 Mar 2020 05:00:55 UTC +00:00, Thu, 16 Apr 2020 16:11:14 UTC +00:00],
 ["vx705zc3288", 6, 5, "aws_s3_east_1", "ok", ".zip", 1, 53760, Tue, 19 May 2020 21:39:24 UTC +00:00, Tue, 19 May 2020 21:39:26 UTC +00:00],
 ["vx705zc3288", 6, 5, "aws_s3_west_2", "ok", ".zip", 1, 53760, Tue, 19 May 2020 21:39:24 UTC +00:00, Tue, 19 May 2020 21:39:26 UTC +00:00],
 ["vx705zc3288", 6, 5, "ibm_us_south", "ok", ".zip", 1, 53760, Tue, 19 May 2020 21:39:24 UTC +00:00, Tue, 19 May 2020 21:39:26 UTC +00:00],
 ["vx705zc3288", 6, 6, "aws_s3_east_1", "ok", ".zip", 1, 59396, Fri, 12 Jun 2020 19:08:52 UTC +00:00, Fri, 12 Jun 2020 19:08:56 UTC +00:00],
 ["vx705zc3288", 6, 6, "aws_s3_west_2", "ok", ".zip", 1, 59396, Fri, 12 Jun 2020 19:08:52 UTC +00:00, Fri, 12 Jun 2020 19:08:56 UTC +00:00],
 ["vx705zc3288", 6, 6, "ibm_us_south", "ok", ".zip", 1, 59396, Fri, 12 Jun 2020 19:08:52 UTC +00:00, Fri, 12 Jun 2020 19:08:56 UTC +00:00],
 ["ys375hy4998", 1, 1, "aws_s3_east_1", "ok", ".zip", 1, 21120255, Sat, 30 Nov 2019 03:52:55 UTC +00:00, Thu, 16 Apr 2020 16:19:12 UTC +00:00],
 ["ys375hy4998", 1, 1, "aws_s3_west_2", "ok", ".zip", 1, 21120255, Fri, 24 Aug 2018 08:56:27 UTC +00:00, Thu, 16 Apr 2020 05:09:29 UTC +00:00],
 ["ys375hy4998", 1, 1, "ibm_us_south", "ok", ".zip", 1, 21120255, Sun, 10 Mar 2019 08:09:04 UTC +00:00, Thu, 16 Apr 2020 09:19:47 UTC +00:00]]

which druids don't have a zipped moab version for each of their versions on each of their target endpoints?

which have some of the expected versions, but not all?

note: might take a few minutes to run on prod

# example AR query
[48] pry(main)> PreservedObject.left_outer_joins(zipped_moab_versions: [:zip_endpoint]).where('zip_endpoints.id = zipped_moab_versions.zip_endpoint_id').group(:druid, :current_version, :endpoint_name).having('count(zipped_moab_versions.id) < current_version').pluck(:druid, :current_version, :endpoint_name, 'count(zipped_moab_versions.id)')
generated SQL
-- example sql produced by above AR query
SELECT "preserved_objects".*
FROM "preserved_objects"
  LEFT OUTER JOIN "zipped_moab_versions" ON "zipped_moab_versions"."preserved_object_id" = "preserved_objects"."id"
  LEFT OUTER JOIN "zip_endpoints" ON "zip_endpoints"."id" = "zipped_moab_versions"."zip_endpoint_id"
WHERE (zip_endpoints.id = zipped_moab_versions.zip_endpoint_id)
GROUP BY "preserved_objects"."druid", "preserved_objects"."current_version", "endpoint_name"
HAVING (count(zipped_moab_versions.id) < current_version)
# example result, expect to have 4 versions on each endpoint, but only 3 have replicated to each so far
[["vc647pg5260", 4, "aws_s3_east_1", 3],
 ["vc647pg5260", 4, "aws_s3_west_2", 3],
 ["vc647pg5260", 4, "ibm_us_south", 3]]
which have none of the expected versions?

note: might take a few minutes to run on prod

TODO: Figure out how to replace this.

[53] PreservedObject.joins(preservation_policy: [:zip_endpoints]).left_outer_joins(:zipped_moab_versions).group(:druid, :current_version, :endpoint_name).having('count(zipped_moab_versions.id) = 0').pluck(:druid, :current_version, :endpoint_name, 'count(zipped_moab_versions.id)')
generated SQL
-- example sql produced by above AR query
SELECT "preserved_objects"."druid", "preserved_objects"."current_version", "endpoint_name", count(zipped_moab_versions.id)
FROM "preserved_objects"
  INNER JOIN "preservation_policies" ON "preservation_policies"."id" = "preserved_objects"."preservation_policy_id"
  INNER JOIN "preservation_policies_zip_endpoints" ON "preservation_policies_zip_endpoints"."preservation_policy_id" = "preservation_policies"."id"
  INNER JOIN "zip_endpoints" ON "zip_endpoints"."id" = "preservation_policies_zip_endpoints"."zip_endpoint_id"
  LEFT OUTER JOIN "zipped_moab_versions" ON "zipped_moab_versions"."preserved_object_id" = "preserved_objects"."id"
GROUP BY "preserved_objects"."druid", "preserved_objects"."current_version", "endpoint_name"
HAVING (count(zipped_moab_versions.id) = 0)
# example result
[["bp628nk4868", 1, "aws_s3_east_1", 0],
 ["bp628nk4868", 1, "aws_s3_west_2", 0],
 ["bp628nk4868", 1, "ibm_us_south", 0],
 ["dc048cw1328", 2, "aws_s3_east_1", 0],
 ["dc048cw1328", 2, "aws_s3_west_2", 0],
 ["dc048cw1328", 2, "ibm_us_south", 0]]

look for non-ok zip parts, and other replication irregularities for druids that have started replication

druids_with_replication_issues =
  ZipPart.
    joins(zipped_moab_version: [{ preserved_object: [:moab_records] }, :zip_endpoint]).
    where.not(zip_parts: { status: 'ok' }).
    group(:druid, :endpoint_name).
    pluck(:druid,
          Arel.sql('SUM(zip_parts.size) AS zmv_size'),
          Arel.sql('max(moab_records.size) AS moab_size'), # max(moab_records.size) -- heuristic is that biggest moab is usually latest version is usually most up to date, since things mostly work
          Arel.sql('SUM(zip_parts.size) - max(moab_records.size) AS zmv_size_diff'),
          :endpoint_name,
          Arel.sql('COUNT(zip_parts.id) parts_count'),
          Arel.sql('ARRAY_AGG(zip_parts.status) part_statuses'))

include ActionView::Helpers::NumberHelper

# make more readable
druids_with_replication_issues_human = druids_with_replication_issues.map do |result|
  [
    result[0], # druid
    number_to_human_size(result[1]), # zmv_size
    number_to_human_size(result[2]), # moab_size
    (result[1] >= result[2]) ? 'zmv_size >= moab_size' : 'MoabRecord LARGER than ZipParts!', # warn if zipped moab's parts don't add up to larger than the moab size
    number_to_human_size(result[3]), # zmv_size_diff
    result[4], # endpoint_name
    result[5], # zip part count
    # result[6].map { |status_int| ZipPart.statuses.key(status_int) } # status for all parts on EP
    result[6].map { |status_int| ZipPart.statuses.key(status_int) }.uniq
  ]
end

Look for orphaned zipped moab versions (i.e. replication was started and a ZippedMoabVersion was created, but no ZipPart records were created and nothing was replicated to the cloud)

note: might take a few minutes to run on prod

zmvs_with_no_zip_parts = ZippedMoabVersion.joins(:preserved_object, :zip_endpoint).left_outer_joins(:zip_parts).group(:druid, :endpoint_name, :version).having('count(zip_parts.id) = 0').order(:druid, :version, :endpoint_name).pluck(:druid, :version, :endpoint_name) ; nil
generated SQL
-- example sql produced by above AR query
SELECT "zipped_moab_versions".* FROM "zipped_moab_versions"
  INNER JOIN "preserved_objects" ON "preserved_objects"."id" = "zipped_moab_versions"."preserved_object_id"
  INNER JOIN "zip_endpoints" ON "zip_endpoints"."id" = "zipped_moab_versions"."zip_endpoint_id"
  LEFT OUTER JOIN "zip_parts" ON "zip_parts"."zipped_moab_version_id" = "zipped_moab_versions"."id"
GROUP BY "druid", "endpoint_name", "zipped_moab_versions"."version"
HAVING (count(zip_parts.id) = 0)
ORDER BY "druid" ASC, "zipped_moab_versions"."version" ASC, "endpoint_name" ASC
# example result
[["bb003ch5585", 1, "aws_s3_east_1"],
 ["bb003ch5585", 1, "aws_s3_west_2"],
 ["zz804sm0716", 5, "ibm_us_south"],
 ["zz950qq8379", 1, "aws_s3_east_1"],
 ["zz950qq8379", 1, "aws_s3_west_2"],
 ["zz950qq8379", 1, "ibm_us_south"]]

N random druids

handy for spot checking things

n = 10
druids = PreservedObject.order('random()').limit(n).pluck(:druid)
generated SQL
SELECT "preserved_objects"."druid" FROM "preserved_objects" ORDER BY random() LIMIT 10
["sj628zj5844", "zy518mc9127", "zp387py2392", "pf656db9702", "qq722pw7781", "bt107qk5531", "zg182hf5648", "hk056fz7050", "fk930bv5254", "cx100kv5813"]
⚠️ **GitHub.com Fallback** ⚠️