Cookbook: MARC queries - folio-org/folio-analytics GitHub Wiki

MARC queries

Here are a few sample queries that can help you get started working with MARC data in LDP1 databases. With small edits (e.g., schema and table names), the same structure of these queries should work for Metadb and LDLite.

Extract content by position: Records that are suppressed but where status is not deleted

This query extracts the 'record status' from the leader. The query also joins MARC data from Source Record Storage to instance data from Inventory to bring in the "discovery suppress" data element. Then it filters to records where record status is not 'd' ('Deleted') and where discovery suppress is TRUE (that is, suppressed records where status is not deleted).

A note about positions in MARC fields like the leader and 008

MARC positions start with '00'. When you want to extract just part of a text value in SQL, though, the first character is said to be at position number 1. When you write your SQL, you will use the substring() function, but you have to add 1 to the normal MARC position numbers to get the correct starting value for your substring. In the example below, you use the number 6 in the SQL code for the '05' position of the leader.

SELECT
    sm.srs_id,
    substring(sm."content", 6, 1) AS ldr_d,
    ie.discovery_suppress
FROM
    srs_marctab sm
    LEFT JOIN folio_reporting.instance_ext AS ie ON sm.instance_hrid = ie.instance_hrid
WHERE
    sm.field = '000'
AND
    substring(sm."content", 6, 1) != 'd'
AND
    ie.discovery_suppress IS TRUE;

Example output:

srs_id ldr_d discovery_suppress
e9835ccc-e3f2-4547-968a-98337ad46c8a n true
43258c6c-3873-4da5-a119-8aa2e62974bb true
82632b3f-f106-4554-85b7-a23e9afe32c5 c true

Extract content by position: Records where language is 'eng'

This query pulls MARC records where the field is 008 and positions 35-37 (which indicate language) match 'eng'. This query uses LIMIT 100 to get only the first 100 records, but you can remove that if you need all of them. Just be cautious with how large of a query that might be!

A note about positions in MARC fields like the leader and 008

MARC positions start with '00'. When you want to extract just part of a text value in SQL, though, the first character is said to be at position number 1. When you write your SQL, you will use the substring() function, but you have to add 1 to the normal MARC position numbers to get the correct starting value for your substring.

In the example below, the number 36 in SQL corresponds to MARC position 35. After inserting this in the substring() function, you can then optionally specify the number of characters you want to extract (in this case, 3, for the three-letter language code).

SELECT
    sm.instance_id,
    sm.field,
    substring(sm.content, 36,3) AS language
FROM
    srs_marctab sm
WHERE
    (sm.field = '008' AND substring(sm.content, 36, 3) IN ('eng'))
    LIMIT 100
;

Example output:

instance_id field language
51e38f83-c7ae-46bb-9078-756fba51c3ee 008 eng
d58c08dd-2bb4-4391-8a24-3caab27f0488 008 eng
7482b39d-b062-4449-86d6-f60cf808ca4a 008 eng

Concatenate across subfields that are split into multiple rows

This query pulls the subfields and contents for MARC fields 336, 337, and 338. Instead of getting a separate row for each subfield, though, the query returns one row per field (336, 337, or 338). In that row, there will be a column that aggregates each subfield and the associated content into a list, using '$' to separate between subfields.

SELECT
    sm.instance_hrid,
    sm.field,
    string_agg('$'::varchar || sm.sf || sm.content, '') AS sf_content
FROM
    srs_marctab as sm
WHERE
    sm.field IN ('336','337','338')
GROUP BY
    sm.instance_hrid,
    sm.field

Example output:

instance_hrid field sf_content
in00000000001 336 $atext$btxt$2rdacontent
in00000000001 337 $2rdamedia$aunmediated$bn
in00000000001 338 $bnc$avolume$2rdacarrier

Identify records where a particular MARC subfield is missing

This query returns identifiers for MARC records where there is a missing value for field 245, subfield 'c'.

Note: this is likely to be very slow, since it has to compare the full list of the instances to a list of instances that do have a particular subfield. If the subfield you're looking for is already mapped to an Inventory data element, it's likely faster to use the Inventory tables or derived tables instead.

SELECT
    DISTINCT sm1.instance_id,
    sm1.instance_hrid
FROM
    srs_marctab AS sm1
WHERE NOT EXISTS	
    (SELECT 1 FROM srs_marctab AS sm2 WHERE sm1.instance_id = sm2.instance_id AND sm2.field ='245' AND sm2.sf = 'c')

Example output:

instance_id instance_hrid
15c927dd-2a82-4153-9f7b-f029dfe6e982 in00000000001
2933879f-dae8-4a93-8ee1-e49799fba81c in00000000002
b070e6d8-6ef3-42cf-8ad1-5aeb6fe195c6 in00000000003