Cookbook: MARC queries - folio-org/folio-analytics GitHub Wiki
MARC queries
Here are a few sample queries that can help you get started working with MARC data in LDP1 databases. With small edits (e.g., schema and table names), the same structure of these queries should work for Metadb and LDLite.
Extract content by position: Records that are suppressed but where status is not deleted
This query extracts the 'record status' from the leader. The query also joins MARC data from Source Record Storage to instance data from Inventory to bring in the "discovery suppress" data element. Then it filters to records where record status is not 'd' ('Deleted') and where discovery suppress is TRUE (that is, suppressed records where status is not deleted).
A note about positions in MARC fields like the leader and 008
MARC positions start with '00'. When you want to extract just part of a text value in SQL, though, the first character is said to be at position number 1. When you write your SQL, you will use the substring()
function, but you have to add 1 to the normal MARC position numbers to get the correct starting value for your substring. In the example below, you use the number 6 in the SQL code for the '05' position of the leader.
SELECT
sm.srs_id,
substring(sm."content", 6, 1) AS ldr_d,
ie.discovery_suppress
FROM
srs_marctab sm
LEFT JOIN folio_reporting.instance_ext AS ie ON sm.instance_hrid = ie.instance_hrid
WHERE
sm.field = '000'
AND
substring(sm."content", 6, 1) != 'd'
AND
ie.discovery_suppress IS TRUE;
Example output:
srs_id | ldr_d | discovery_suppress |
---|---|---|
e9835ccc-e3f2-4547-968a-98337ad46c8a | n | true |
43258c6c-3873-4da5-a119-8aa2e62974bb | true | |
82632b3f-f106-4554-85b7-a23e9afe32c5 | c | true |
Extract content by position: Records where language is 'eng'
This query pulls MARC records where the field is 008 and positions 35-37 (which indicate language) match 'eng'. This query uses LIMIT 100
to get only the first 100 records, but you can remove that if you need all of them. Just be cautious with how large of a query that might be!
A note about positions in MARC fields like the leader and 008
MARC positions start with '00'. When you want to extract just part of a text value in SQL, though, the first character is said to be at position number 1. When you write your SQL, you will use the substring()
function, but you have to add 1 to the normal MARC position numbers to get the correct starting value for your substring.
In the example below, the number 36 in SQL corresponds to MARC position 35. After inserting this in the substring()
function, you can then optionally specify the number of characters you want to extract (in this case, 3, for the three-letter language code).
SELECT
sm.instance_id,
sm.field,
substring(sm.content, 36,3) AS language
FROM
srs_marctab sm
WHERE
(sm.field = '008' AND substring(sm.content, 36, 3) IN ('eng'))
LIMIT 100
;
Example output:
instance_id | field | language |
---|---|---|
51e38f83-c7ae-46bb-9078-756fba51c3ee | 008 | eng |
d58c08dd-2bb4-4391-8a24-3caab27f0488 | 008 | eng |
7482b39d-b062-4449-86d6-f60cf808ca4a | 008 | eng |
Concatenate across subfields that are split into multiple rows
This query pulls the subfields and contents for MARC fields 336, 337, and 338. Instead of getting a separate row for each subfield, though, the query returns one row per field (336, 337, or 338). In that row, there will be a column that aggregates each subfield and the associated content into a list, using '$' to separate between subfields.
SELECT
sm.instance_hrid,
sm.field,
string_agg('$'::varchar || sm.sf || sm.content, '') AS sf_content
FROM
srs_marctab as sm
WHERE
sm.field IN ('336','337','338')
GROUP BY
sm.instance_hrid,
sm.field
Example output:
instance_hrid | field | sf_content |
---|---|---|
in00000000001 | 336 | $atext$btxt$2rdacontent |
in00000000001 | 337 | $2rdamedia$aunmediated$bn |
in00000000001 | 338 | $bnc$avolume$2rdacarrier |
Identify records where a particular MARC subfield is missing
This query returns identifiers for MARC records where there is a missing value for field 245, subfield 'c'.
Note: this is likely to be very slow, since it has to compare the full list of the instances to a list of instances that do have a particular subfield. If the subfield you're looking for is already mapped to an Inventory data element, it's likely faster to use the Inventory tables or derived tables instead.
SELECT
DISTINCT sm1.instance_id,
sm1.instance_hrid
FROM
srs_marctab AS sm1
WHERE NOT EXISTS
(SELECT 1 FROM srs_marctab AS sm2 WHERE sm1.instance_id = sm2.instance_id AND sm2.field ='245' AND sm2.sf = 'c')
Example output:
instance_id | instance_hrid |
---|---|
15c927dd-2a82-4153-9f7b-f029dfe6e982 | in00000000001 |
2933879f-dae8-4a93-8ee1-e49799fba81c | in00000000002 |
b070e6d8-6ef3-42cf-8ad1-5aeb6fe195c6 | in00000000003 |