Transform Meeting Running Notes - uwlib-cams/MARC2RDA GitHub Wiki
Present: Cypress, Ying-Hsiang, Sara, Tynan
Notes: Cypress
We will be changing meetings from weekly to as needed.
- Doreen: still working on digesting the code for 6xx and going through the aggregates document in order start coding for 800, also working on the code for 342
- Cypress: Finished 008 aside from outstanding questions, also finished 006. Onboarded Sara, will onboard Tynan tomorrow
- Ying-Hsiang: Working on setting up Wikibase cloud. Running Wikibase cloud instance and tool from NLG Greece. Troubleshooting this. Reaching out to Wikibase cloud team, hoping Crystal can contact project manager to get some more help.
Present: Cypress, Ying-Hsiang, Doreen, Penny, Tynan, Sara, Crystal
Notes: Cypress
- Cypress - still working on 008 :) 006 is reviewed and she will start on that next if it doesn't need reproduction conditions. Tested punctuation function that accounts for abbreviations.
- Ying-Hsiang - Waiting on Deborah for aggregates work, which is paused for now. Working on setting up Wikibase instances.
- Doreen - Reproduction conditions mostly done! 245 will be added today. Working on updating code for some fields.
- Penny - Learning coding and transformation! Has time for other tasks.
Notes: Cypress
- Cypress: almost finished coding 008 save the questions in the issue. This should make reviewing and coding 006 easier.
- Penny: Penny has finished 1XX, 7XX, and 8XX Google sheets and can take on more work.
- Ying-Hsiang: Aggregate code, Java extension for determining URI types.
- Doreen: 008 reproduction conditions are done! Trying to follow up on 300 to see if it is ready. Reviewing and transforming fields.
- Issues with determining URI types:
- We need to discuss what to do with URIs that cannot be determined to map to an RDA entity
- Maybe if there's an 040 e = RDA can we use VIAF etc. - we should bring this to the group
- What about original RDA vs official RDA? Is this a problem?
Present: Cypress, Doreen, Ying-Hsiang, Penny, Crystal
Notes: Cypress
- Cypress: Still working on 245, classification fields with Gordon and Penny. Finished coding fields in RFT and reviewed some more fields.
- Doreen: Laura and Doreen almost done with reproductions for 008! Also working on reviewing fields and coding.
- Penny: Created new URI table for approved URI and have asked Adam and Gordon to review. Created a heading field attribute mapping table. We can review and revise as we code.
- Ying-Hsiang: Majority of working time has been aggregate code, which is on track. Working on parallel aggregates with Deborah.
- Crystal: Working on datatype URIs in Wikidata
- In some cases, XSLT is not efficient, so we have extensions
- Determining IRI type is one useful case
- code dereferences IRI and downloads XML format
- looks for rdf:type element that matches a list in an XML file (this is where Penny comes in!)
- README for extension
- RdfPredicateExtractor extension requires JDK and Maven
- Follow instructions for setting up in command line
- Run through Java, there is a Java file to run
- Next step would be updating/creating XML lists of approved IRIs
- Oxygen HE is not an option for extensions if we want this to be open source - not able to run a Java extension
- Can we run the Java extension in Oxygen for testing? No. Command line
- Penny created a Google Sheet with just URIs
- What happens if an IRI does not return an XML file? We will have a default case
- Is this a reasonable ask for users to run XSLT through Java? By the end of phase 1, we want some code that executes 'all the things'. Let's do whatever it takes to make it work. Users will need to install Java runtime etc. This is a reasonable ask. We will provide a list of dependencies. Checking to see what is behind an IRI is amazing :)
- Is this something we need to currently add into the workflow? We need to make sure the XML lists are ready before implementing. Reach out to Ying-Hsiang if we have additional questions.
Present: Cypress, Doreen, Penny, Ying-Hsiang
Notes: Doreen
Cypress:
- Coded reproduction conditions for 264 and test data are now available to view.
- Worked on Misc. fields. Working on 245
- Set up input parameter in m2r for base IRI.
Doreen:
- Reproduction conditions added for 264, 250, 260. Currently working on adding conditions to 008.
- Finished transform for 520 and started on 505.
Ying-Hsiang:
- Working on aggregates with Deborah. Deborah has been making some changes so he will commit these changes in code along the way.
- Finished transform for 518.
Penny:
- Finished the table for entity type. Waiting on Gordon to review.
- Cypress: once Gordon is done reviewing, we can put it in xml format.
- Ying-Hsiang: Working on documentation for java extension to determine IRI type.
Laura:
- dropped in! to explain special reproduction conditions for 008. Doreen will add that to the spreadsheet.
- Purpose: Let other institutions use their own IRI.
- Yellow warnings might still occur, but generally fine unless there is a loop.
Present: Cypress, Doreen, Penny, Ying-Hsiang
Notes: Cypress
-
Cypress - Cypress has been working on field 245. She has also finished up some code re-check/code on hold issues and is transforming classification fields.
-
Doreen - 008 vocabs done, replacing old links. Laura added comments to fields that will need reproduction conditions. They will work on 264 together this afternoon.
-
Penny - Still working on entity types, hoping to finish this week and continue with 1XX and 7XX indicator and subfield rows. We looked at the 100 Google Sheet that Penny is updating.
-
Ying-Hsiang - Mapped 334. Working and communicating with Deborah and has converted patterns that Deborah has verified. For performance testing - start with the largest file in the Google Drive Test Data folder. Will return to code for 518.
-
Laura - Laura dropped in! Laura is working on reproduction conditions.
-
We updated the 518 mappinng together!
- Cypress is going to shorten the meeting time since we do not usually take 1.5 hours. The meeting will be schedule for 1 hour.
- Penny will let Cypress know when attributes for agents have been mapped and added to the Google Sheets.
- Cypress will let Laura know when 245 is ready for reproduction conditions.
Present: Cypress, Doreen, Ying-Hsiang, Penny, Sita, Laura, Crystal
Notes: Crystal
- Cypress has been working on 245, resolving code on hold issues, related discussions
- Doreen has been finishing up 008 vocabularies project and working on reproductions
- Ying-Hsiang has been working on aggregates code performance, 3xx code, waiting on big dataset from Crystal
- Multiple $2's (generally multiple non-repeatable fields that are repeated)
- decision: take the first occurrence and don't assume that non-repeatable fields won't be repeated in error
- when external lookup documents are renamed, moved, no longer exist, xslt doesn't have a graceful way to fail. a java extension could be useful here to prevent the code from failing in the middle of a big transformation
- see reproductions guidelines area of Wiki (reviewed during meeting)
- feedback from coders: this is easy to follow and will be easy to implement for coders (yay!)
- if any tags are changed that have already been closed, re-open them and add the "code re-check" label once they have been moved back to the "ready for transform" workflow phase.
- if tags have already been coded or partially coded, add the "code re-check" label when moving them to the "ready for transform" workflow phase
- Demo of what the spreadsheets will look like with reproductions conditions implemented
- serials are out of scope
- are they ready to transform?
- yes, they haven't been updated since february so if they need to be updated again let's do it
- NLG emailed Crystal and Cypress their code. we need official permission to use it. crystal will ask them to also email it to Ying-Hsiang
- we will set up a wikibase cloud test instance
Present: Cypress, Doreen, Sita, Ying-Hsiang, Penny
Notes: Cypress
- More test data output available, any comments on it can go in this discussion
- Transform documentation folder is set up in the Google Drive
- Reminder that Cypress is gone until Tuesday
- Cypress and Ying-Hsiang met to discuss the aggregate code shortly before this meeting
- Cypress has been working with Gordon and Penny on subject heading fields. Also has 245 to work on amongst other things.
- Penny is comparing entity types with RDA entity types and is determining whether they are supertypes, subtypes, or equivalent. She showed us the Google Sheet she is working from, everything looks great! Hopefully Gordon and Adam can review these comparisons.
- Ying-Hsiang is working on aggregates code!
- Doreen had a reproduction meeting with Laura. This is at the beginning stage but is in the works! She finished coding 521 and has assigned herself to 520. Is mostly focused on vocabularies for 006-008.
Present: Cypress, Deborah, Ying-Hsiang, Sita, Gordon, Crystal, Doreen, Penny
Notes:Crystal
- Cypress almost done with subject headings, been working on relationships, Crystal just gave her sample records with 7XX fields and she should be able to run those through today
- Ying-Hsiang just submitted latest code on XSLT extension. Not running smoothly in Oxygen XML editor yet. We can postpone this part for now and look at another part of the code, the XML version for now
- Doreen's work is going well
- Penny hasn't started working on the transform, has outstanding questions on her mapping work that need to be resolved
- Multiple people are working on the same code.
- Let's create a Google Drive folder that we can all access
- We can move current slides into this folder easily
- Cypress will do this
- Comments in code are established practice. Keep doing this!
- Multiple languages in access point: what language tag?
- ISBD has an example in manifestations: you can just insert a plus sign.
- Nothing in RDA about construction of access points--communities decide
- They've been mapped, not coded.
- Will need to look at 041 $a as well, and remember that $a is repeatable and can also include multiple language codes strung together
- MARC Code List for Languages
- Also might be taken from ISO list
- Aggregate Marker Project-DRAFT
- Split file of MARC records into collection aggregates, parallel aggregates, augmentation aggregates, and single expression manifestations
- Writing code now for single expressions
- Once we do start running aggregates, we will need a way to tell the code "hey this is an aggregate"
- Split into separate files?
- Code has different modes: look for all the work properties, expression properties, etc. With aggregates, we're going to have multiple of these classes. The transform needs to know when to run the modes for each class. They're going to be separate functions.
- Category of work/category of manifestation could be applied to aggregates to tip off the transform
- Could use extensions or another program aside from XSLT so we could handle this situation from a different file
- Currently, we are regulated by XSLT. We could use Java extension to do whatever we want with aggregates
- We will explore this further
- AggregateMarkers-DRAFT
- Very complex pattern matching that XSLT probably can't handle
- First step: split out collection works: done
- Second step: split out diachronic works: done
- Deborah proposes that we use MARC Report to split input files prior to transform
- Run, save review, until what you have left is "single expression manifestations" then look at your results
- Conventional collective titles and 6xx $a genre/form terms need to be examined: what determines that something is definitely an aggregate? It would be useful for someone besides Deborah to take a look at these.
- It makes sense to do review in order, check details in a particular order by type. You can sort review by type and names they are given.
- Questions about processing speed and ease of implementation: can we do this in XSLT and add this to the code? Will it add to the code? Can we add it to a Javascript extension? Ying-Hsiang can write an extension and will re-prioritize his workload to make sure it happens by the end of phase I.
- Turning table Adam had into something usable in XML.
- Involve research: can these convert into IRIs?
- If we can extract multiple types from the target, that shouldn't be an issue: does one of those types map to an RDA entity type? We should be able to dereference and determine what type is declared
- We would like to be able to do that with an extension. We haven't been able to do that yet. So right now we need a table
- Penny can take this on, Cypress can show her how to do this
Present:Cypress, Doreen, Deborah, Ying-Hsiang, Gordon, Sita, Penny
Notes:Penny
Cypress is working on
- removing inverse properties (going back up from item, nomen, metadata and agent)
- Code on hold issues such as 538, 043, 257
Ying-Hsiang is working on xml extension to check the semantics of URI of 518
Why and how does the extension work?
- Static uriInMARC.xml table is not enough to track the RDF types
- Use XSLT extension to access and retrieve RDF type of any uri during runtime
- Create a new MARC2RDA extensions project that can be invoked in Oxygen XML editor to fetch and check RDF types
Can we use it with open source tools?-Yes
Deborah went over main and added entry relationships document
- Even if there is no $5, $e like the former owner still needs to be mapped to item relationship
- Discuss about $e and $4 in 7XX again (previously discussed on July 10 Group Meeting)
- Sita: ignore them, name and title used as a AP as a whole
- Gordon: ignore them
- Family names should be treated differently
- Gordon: no, just appellation strings
- Deborah: $4, $e, $i are all unreliable
- Agents in 8XX should be discussed in main meeting
Gordon: boiling all these down in general Added entry is related to the primary WEMI stack through high-level relationships depending on the field. Entity is determined by indicator 1, not by heading
- If 6XX: subject entity related to
- If 7XX: related entity
- If 4XX and 8XX: issue of
Relationship between name porion and title portion in added entry field Related entity (can only be the high-level)
Gordon: A whole record with basic core fields should be transformed as soon as possible so that we can start feedback loops.
Sita: we should focus on the structure first, not details
Cypress show the coding progress
- As-needed? Regular meetings?
- Who attends? We always pester Adam on Slack. Gordon has offered to help with coding. Penny? Crystal's practical usefulness is limited so just Crystal and Cypress seems cruel for Cypress.
- Cypress updates & concerns
- Any chance to look at $6?
- Relator table implementation test is here
- How did Theo generate the xml from the table? were the column names changed manually?
- How are we handling $0s and $1s that don't have a match in the table?
- Implemented a function for $2 that can be replaced once decisions are made - meaning $2 won't hold back field by field transform
- Re: minting uris and avoiding duplication of concept entities:
- embedding all or part of value is the best way to go
- I'm not even sure hash tables can be done in xslt, which is not a functional programming language
- Cypress concerns
- $6 in 561
- complete coding on metadata work (regarding 561 also?) not pushed?
- X00 solution
- what about what DF sent about aggregates (ideas for preprocessing)?
- There was a suggestion to help us avoid duplication of concept entities:
- mint uri using an algorithm that embeds part or all of value in uri
- val=383.6984, iri could be somethin like 10.6069/uwrda.class.383.6984
- ark identifiers suggested; probably could use DOIs but that's a lot of registration!
- Only if we can get DOI-registration API working in mass production
- Still, probably need another IRI solution
- that's for classification numbers; same could apply for subject headings (and other headings
- turn strings into hash codes?
- how feasible is it for agents? They're not as uniform as class numbers of thesaurus headings
- mint uri using an algorithm that embeds part or all of value in uri
- CP updates:
- The $6 issue answered my transform questions I think.
- Transform code for $6 in item-related fields
- Figured out reciprocal properties for metadata work
- Scope, or, who does what now:
- Cypress:
- pull fields from project board -- BSR -- working on 380
- continue wiki transform how-to
- Theo
- Look at $6 solution
- see reciprocal props for md work (fields 583, 526)
- $0 and $1: how are we flipping loc.gov in media types etc? Finish coding. Output examples.
- Prep meeting w Deb &co about relators--prioritize--bring examples--1XX/6XX/7XX
- Aggregates--email!
- grab stuff from board
- Cypress:
- Anything Cypress wants to discuss?
- Looked at metadata work; reciprocal properties? Actually it's easier to have only in md W.
- This is because item is created, code goes to template that generates item
- TG: let's make sure when we randomly generate id, we don't produce different IRIs every time
- Probably a phase 2 problem?
- Looked at metadata work; reciprocal properties? Actually it's easier to have only in md W.
- Let's start narrowing the scope
-
project board rft and rip, maybe ar
- BSR only?
-
example of roles-->RDA properties
- can it be incorporated into main transform? (Theo)
- Theo can do just 100, then have that meeting and get started on 1XX/6XX etc, see below
- Should have a meeting with Deborah/Cypress after we accomplish square one
- can it be incorporated into main transform? (Theo)
-
Some kind of start to preprocessing
- aggregates
- what we're trying to do: weed out collection aggregates
- what resources do we have?
- ask DF! We just need the basic set of markers for collection aggregates; I need a succinct list
- send email to DF
- If we need more, talk to Crystal, she'll work with Laura on it
- ask DF! We just need the basic set of markers for collection aggregates; I need a succinct list
- where do we start?
- Where is Cypress in the aggregates discourse?
- Dialogue included:
- cec: I think we can deal with 700 12.
- df: if analytic entry present, it requires more agg thinking, so put aside.
- aggregates
-
start a model for 1XX/6XX/7XX/8XX transformations
-
Some kind of code output around Feb 22?
-
- Did we resolve this: when is the 880 "in play"?
- Diana says OCLC doesn't display 880; as Adam; also ask Cynthia Whittaker at OCLC
- NOT RESOLVED! Agenda item next week
- Let's write down our specific questions (Cypress)
- Make sure Cypress hears this: We need alt serializations, especially ntriples, as that's all that will display in RIMFF.
- We've never handled the problem of correct IRIs for the output RDA/RDF.
- Still not added to documentation:
- This is too restrictive in the transformation decisions, it should be changed to allow more frequent committing: III.C.4.b. Do not commit until the coding is complete. 2022-07-28
- Change this in the transformation decisions: "III.C.5.a. Remove the transformation-related tags and close the issue for the field. This can be done in using a commit message (see UNDECIDED items below). 2022-09-23." Specifically, do not remove all the transformation-related tags; it is wise to leave the tag change this: retain "coded rft".
- Add to transformation decisions: when selected a field to code, assign yourself the issue in GitHub.
- Cypress issues:
- metadata works:
- triggered by "private" indicator, so we reference the md work from an item
- we sems to use both metadataDescriptionOfItem and ItemDescribedWithMetadataBy
- no md W or E
- Currently each item in a record has its own IRI; we never assume any item is equivalent to another item described in a MARC record
- triggered by "private" indicator, so we reference the md work from an item
- metadata works:
-
We need to establish a scope for February, We would do well to establish a "map" for the remainder of phase one.
-
regular fields
- Is it clear how to code those?
-
relator terms/codes and RDA elements
- Can we use the current table to code relationships in MARC records, especially $e and $4, so they map to the appropriate RDA property?
- Yes/no answer needed
- If yes, how shall we get started?
- What is the official location for this table? Is it the latest version?
- Do we need a separate meeting with Deborah? Can we get started without that?
-
Aggregates
- How are we going to process aggregates?
- How should we get started?
- Where is Cypress on the aggregates discussion?
- What tools do we have?
-
880 field: when is the 880 "in play"? Always? We need to know all fields where there may be an accompanying 880.
-
Add to documentation:
- create a Wednesday agenda item: coordinate with mappers: if "ar" or "rip" are coded, ask them to make a note in the issue when they move to "rft." -- DONE (tg)
- This is too restrictive in the transformation decisions, it should be changed to allow more frequent committing: III.C.4.b. Do not commit until the coding is complete. 2022-07-28
- Change this in the transformation decisions: "III.C.5.a. Remove the transformation-related tags and close the issue for the field. This can be done in using a commit message (see UNDECIDED items below). 2022-09-23." Specifically, do not remove all the transformation-related tags; it is wise to leave the tag change this: retain "coded rft".
- Add to transformation decisions: when selected a field to code, assign yourself the issue in GitHub.
-
We are expected to have some sort of logic or model for 1XX/6XX/7XX/8XX transformations. Any thoughts on coding that?
-
Coding of MARC 533 has been highly anticipated by the project. No need to discuss today, but let's get that on the radar. Laura wants us to know the info in the spreadsheet is quite incomplete, and will require changes to other fields/spreadsheets (like 008) to be complete.
-
We have been asked to devise a solution for including the full MARC record in the output RDA/RDF. It should probably travel with the manifestation. There is an element like rdam:P30254"is manifestation described by" that can be used. Or the unconstrained one: rdau:P60215"is described by".
-
We've never handled the problem of correct IRIs for the output RDA/RDF.
-
We need alt serializations, especially ntriples, as that's all that will display in RIMFF.
-
stray messy notes not-ready-for-prime-time:
- if analytic entry present, it requires more agg thinking, so put aside. cec: I think we can deal with 700 12. It is not a part. CEC: but we know what to do with those. What is URI? How create E or W without info? What about authorities? Many are controled. Aggg W: [uh oh] part work: lord of the rings is hasPart Coll of short stories is not. Aggd E? Agg w? Thing is embodies in M. IRI of triple: what's subject? What are E attributes? MARC records describes it all; which line up with this 700? If there's auth record, then maybe attributes Lots of things no auth, so mint uri DF See: it's complicates: augmented and parallels : it doesn't apply, although there's language sin parallels. It's going to take more thought: phase 2. LA if too many records fall out, the transform will be useless. prefers something less perfect label/identify as hybrids better to be inaccurate cec transform critical mass don't output something that doesn't make sense no aggs yet; run non agg on parallel and aug, not collection
present:
- Theo has two things:
- make better use of Github issues going forward
- need label for this?
- best way to record needs that arose after data review(s)
- Should go into decision index
- Review our section of the decisions index and update as needed.
- make better use of Github issues going forward
- Anything Zhuo wants to address?
- Lexical aliases
- Can output RDF/XML with labels and with lexical aliases
- Anything new with identifiers-withLabels.rdf? Anything further to say about identifiers?
- oXygen has an embedded rdfxml schema (relaxNG compact syntax)
- BF identifiers create a bnode.
- nomens for name (authority) control; record qualifiers, name info; but the intention is to use nomen-things aligned with things with names; identifiers are for identification.
- what do we do now: access points in GLAM.
- Zhuo last day one week from today
- Anything he wants to do this week?
- will wrap up things not finished but started, esp mappings; some Sinopia, esp guidance; transform: everything rft is possible, but nothing outside that.
- Anything over the break, June 10-July 31?
- plan to do casual professional development
- would participate in Wed meetings
- could help with some transform
- sinopia: test creating data; is any questions, happy to participate or even create data; would also like to attend meetings
- May not return? Maybe return.
- Anything he wants to do this week?
present: TG ZP
- Zhuo's sample ISBN data
- as literal
- as literal with prefix
- as nomen
- [how about as typed literal?]
- [where's the code? How did you set up the nomen?]
- [Theo is thinking: this is good enough for Wednesday]
- Anything Zhuo wants to talk about?
- Not much code discussion
- Time ends June 9
- zHUO WILL CODE THE RFT marc FIELDS
- Maybe some awaiting reviews will be coded
- Maybe some mapping
- Obviously Theo has some highly detailed stuff for MARC 245
- What's all that activity in the repo? What's the board look like?
- Action items
- Theo will set up kickoff meeting for admin metadata
- for RSC
- for some kind of publication
- for "design patterns"
- Theo will set up kickoff meeting for admin metadata
Present: TG ZP
First: Thank you, Zhuo, for the last minute work on the transform. Adding the MARC data was very helpful. Adding dataset 2 for review was super helpful: showed some stuff dataset 1 did not.
So far, here are some things to attend-to after RDA data review; note data review is still ongoing:
- Remove MARC100-->fake:rdawP10065 (Theo)
- Alter MARC 020-->rdamd:P30004 for ISBNs (Zhuo)
- POSTPONE THIS TRANSFORM EDIT UNTIL A DECISION IS MADE
- sounds like Nomens are favored; see 18 below
- no hyphens needed in ISBN
- option is just the alphanumeric ISBN string
- Alter ((MARC 245-->rdamd:P30134) + (MARC 245-->rdamd:P30156)) so that both do not output (Theo)
- MARC subfields should never appear in RDA values; includes:
- MARC 264-->rdamd:P30111 (Theo)
- If value is non-isbd, semi colon is best between subfield values.
- Alter MARC 337-->rdam:P3002 so that both $a and $b (the code) do not output. Consider not outputting the meaningless-in-RDA code at all. (Theo)
However, consider this full process; although we said we would not reconcile yet with vocabularies:
- If $a and $b exist, suppress $b, output $a as IRI (from RDA Vocab).
- If $a or $b only:
- Match code or string with string in original vocabulary, somehow extract-and-insert IRI from RDA vocab.
- Create mapping between RDA and ID.LOC.GOV vocabulary.
- Send mapping to RSC TWG and ask them to publish.
- DO NOT Alter from MARC 504-->rdamd:P30455 to MARC 504-->rdamd:30137 ; the 30455 property was deemed fine. (Zhuo).
- MARC 245 ok output for man but not for wor: work does not output the full complexity a/n/p/s etc. Investigate and repair. (Theo).
- Additional mapping MARC 502-->rdawd:P10209.
- already have MARC 502-->rdawd:P10077 and -->rdawd:P10006.
- Do not output ISBD square brackets (is this for specified fields only or always?):
- MARC F264 (Theo)
- Repair MARC 245-->rdamd:P30105 sor relating to title proper; there's other inaccuracy there too, see http://fakeIRI2.edu/1302865607man and #390.
- (Theo and all going forward): Unknown placeOfPublication, dateOfPublication, NameOfPublisherAndDistributionManufactureAndProduction: although PCC-PS favor square brackets, eliminate square brackets, and use as value of noteOnManifetsation. Values look like this: rdamd:P30088[Place of publication not identified].
- Option 2, presented in GD's comments to dataset 2, comment 9: only output the "statement" with sq brackets intact; do not map to P30088, P30176, etc.
- TG: just do what's easiest.
- However when sq brackets surround a value believed to be correct, output to appropriate field and strip sq brackets.
- Repair MARC 245 _4 $c c2014 --> rdam:P30280 to -->rdam:P30007 and strip all symbols
- The copyright symbol, the phonogram symbol, the string "(c)", the string "(p)", the string "copyright", the string "phonogram copyright", the letter "c", or the letter "p" should be stripped from the value
- DO NOT change MARC 382$v-->rdaed:P20215 so that it does not use square brackets but, rather, parentheses. Gordon made the suggestion as he felt sq brackets carried to much meaning; Zhuo got this as an MLA recommendation. It's what's expected by the community that consumes this data. (Zhuo)
- MARC 264 has distinct square bracket requirements for RDA output:
- retain square brackets in "statement" elements like rdamd:P30108 (Theo; should be ok as-is)
- eliminate sq brackets in Place and Timespan elements like rdamd:P30085 (Theo; needs attention).
- Add MARC 490-->rdamd:P30106 hasSeriesStatement to output standard "statement" (Theo)
- remove the contents of subfields l (LC call number), y (invalid ISSN), and z (cancelled ISSN)
- [treat $3, $7 as per general decision ...].
- Retain the punctuation; remove the subfield encoding.
- Repair NARC 245 $a with / and . in title (not before sor): those are getting stripped; see http://fakeIRI2.edu/904019193wor. Make sure it's ok for WEMI for output titles.
- Where else will we find data review information:
- Meeting Notes (not checked)
- anywhere else?
- What about setting up Nomens? Will Zhuo be working on that?
- MARC 336 337 338
- Use RDA vocabulary values; IRIs is practical
- do not output code, string and IRI; just one is enough.
- Create mapping RDA-->LC vocab in id.loc.gov for selected vocabularies
- NLG has some of these done already; SZ will send to GD and GD will format
- send to RSC TWG for RDA publication.
- Is that all that needs to be done for this?
- MARC data in input: send entire marc record as one long text string to man. We can also output MARC to RDA-RDFXML, but this is probably best just for data review, eliminated for final output. * Alternative: plain output, plain output with labels, plain output with labels and MARC individual fields (as it is now).
Present: TG ZP
Theo finally got started "blitzing"
- Finished 264 (RDA "statements" were not processed in field order; $3 only accounted-for in "statements"; $6 accounted-for; repeating subfields and parallel statements should not be resolved).
- Almost finished 245; still need to account for "=" and double-check to make sure all possible punctuation is accounted-for
- started on outputting labels near opaque identifiers for properties; put it aside and never returned to it.
How about Zhuo?
- identifiers revisited. Specifically ISBNs. Put qualifiers before number previously. Wants to attemot to mint nomens
Present: TG ZP
Agenda:
- No specific agenda. Just an open discussion. Discussion included:
- Let's move back data review in group one week. Theo will get it on the agenda.
- Comments will remain informal. Enter as needed.
- On the other hand, template names should be structured using common formats. For example: F264-x1-a_b_c means field 264 with any indicator 1 and indicator 2 = 1 will have subfields a b and c processed individually in the template.
- ZP doing 502 field. LC vs OCLC documentation regarding ending period.
- 880 with 502: what happens with identifier? Are there any identifiers in 880? Or does that go in primary field only? What if there's a non-Latin identifier? What do we do with that?
- ZP planning on doing another 5XX: 585.
- Theo: review 264, 260, 245, 490, 336, 340. Get them corrected if needed. Do not seek perfection.
- Theo wants a function for checking ISBD punctuation.
- ZP wants fcn to look up $5
- Question for next time: are we going to perform lookups (mostly for "schemes") by matching a locally stored file or go over http.
No agenda.
Meeting Notes:
- Zhuo working on:
- some changes to where things are:
- folders for test (test input and test output)
- new "lookup" folder (for $5, $2)
- folders for test (test input and test output)
- RDA vocabularies
- We should map to IRIs, not literals
- IRIs are usually values of canonical properties, not object properties
- A lot of 33X fields don't even have object properties
- We need some way to indicater that people doing transform edited the spreadsheets (we'll clearly be making corrections)
- Maybe something in Decision Index about how we transform RDA Vocabularies
- We should map to IRIs, not literals
- 380 field; new function for handling concept in $0/$1
- Noted: 340 field and its current function for $0/$1 uses object properties.
- We have to account for $0 and #1 IRIs that represent RDA entities, as they will be treated differently.
- Mostly this will be agents in MARC data; however, as we anticipate more and more RDA entity IRIs in MARC, we should broaden our effort here.
- Mostly this will be agents in MARC data; however, as we anticipate more and more RDA entity IRIs in MARC, we should broaden our effort here.
- Custodial history/private metadata broke due to 880. Still working on it.
- some changes to where things are:
-
Proposal: we should plan now to prioritize the transform.
- Let's set dates for work and make sure we comply.
- How much time?
-
Theo timeline.
- When are the best days/weeks to focus on this?
- March 20 - April 14: write code and prep dataset for review
- April 19-May 3: Lay low...
- May 3-June 9: edit code to correct errors, oversights, etc.
- When are the best days/weeks to focus on this?
-
Note: Theo just asked today (Thurs., Mar. 2) is we can think about post-completion OPT (may change Zhuo's timeline)
- ZP can work for multiple employers during OPT
-
Zhuo timeline:
- UW Spring quarter ends June 9
- When are there academic requirements, big projects, etc.?
- When is last day of work? Around June 9
- When are the best days/weeks to focus on this.
- Start around March 13
-
Produce some code for group to review before Zhuo leaves
- Good day to present data at a meeting: April 14
- Good meeting day to complete review at meeting: April 19 and 26 and May 3
- gives them two weeks to review
- Looking at everything above, let's set dates and goals:
- coding blitz: not sure; maybe Zhuo start on the 13th; Maybe between quarters do extra time; Theo start on the 20th and will do at least 3 weeks.
- date to have code ready for review: April 14
- Related to-do:
- Inform Crystal; make it an agenda item for main meeting
- Inform Crystal; make it an agenda item for main meeting
- Put on agenda -- Theo should do it.
- Code fields as they are ready to code
- we can start coding anything "Awaiting Review" and later
- lots of stuff can be coded now!
- MARC 585 is "Ready for Transform"
- Refine the function for $0 and $1 -- Theo should do this --
- Finish OMR vocabs
- Is coding done, OMR-->RDF-XML? YES, IT IS.
- Hire student to review current RDF-XML assertions inherited from OMR.
- Resolve IRI problem -- no. it's resolved!
- Use DOIs for this project
- ZP will figure data-cite metadata, do a sample, etc.
- consider separately whether we use W3C identifiers or something else
- Use DOIs for this project
- Establish UWL Guidelines for UWLSWD vocabularies, esp concept schemes
- Make sure RDF-XML accords with UWL guidelines
- Resolve on how to publish
- Get the names of the vocabularies correct
- versioning: use releases -- how?
- does it accord with W3C BPs for publishing data on the web? --Theo was working on that
- Publish
- Insert associated values in spreadsheets
- Code the 008; also the 006, 007, if ready
- Theo: finalize 245 (see 2023-01-05 meeting notes)
- Code to output human-readable IRIs instead of opaque -- Theo will work on this, hopefully before the 20th
- MARC 500 template processing with 880 (resolves $6, right?)
- this work is ongoing; every field may have a different solution
- ZP ran into 561 issues; supposed to mint IRI for item; pprivate data issues, etc.
- $6 is on board as "ready for transform"
- MARC 880 is on board as "ready for transform"
- Have we decided on $3 and $5
- $5 is on board as "ready for transform"
- these are in RFT because ZP put them there; they can stay
- Theo check 336, 490, and the field SZ worked on
- Metadata WEMIs. Particularly a topic for MARC 561
- ZP will code the 561; he can transform a corpus with that field; present to group; we'll need to create fake private 561s;
- much of it is coded except for 880 and privacy complications
- this has included metadata works for private assertions
- ZP will code the 561; he can transform a corpus with that field; present to group; we'll need to create fake private 561s;
- Devise methods for weeding-out aggregates
- there's a discussion 354 where it's embedded in a larger discussion
- we should make a new issue specifically to help us code
- Change meeting time for next quarter; not Thursday afternoon
- Theo will produce a to-do list before the 13th
- Zhuo created a MARC 500 template that includes corresponding 880 processing
- He created a template matching the union (|) of 500 with 880 with $6 that starts-with the string 500.
- This will process all records with a 500, as well as all records with a 500/880$6-that-starts-with-500 combination.
- The group noted how this processing with | differs from using AND (in the latter, both conditions would have to exist in any given record for the record to be processed).
- This will make the 880s easy to process!
--> Next meeting: let's take a brief but detailed look at the function for processing $0 and $1.
- Anything Zhuo want to discuss? No
- Theo hasn't progressed beyond last meeting
- There was some discussion about the OMR-->UW vocabularies project
- Anything from Zhuo?
- nothing in particular
- Review current state of m2r.xsl -- reviewed and made some minor changes to apply-templates with mode=ite
- Theo will "finish" work on 245 next
- currently seem to be errors in the xsl:when conditions
- some punctuation still not accounted-for
- Theo will comb through and search for other errors; will create a 245 "dummy.xml"
- currently transform claims these are the fields not yet accounted-for:
- $3 : no $3 in 245
- $6 : should we process at 880 or at 245 (i.e. XXX)?
- if at the XXX field with $6, we can situate in the applicable template
- if at the 880, we'll likely reference every template that applies, not all of which will be named
- Theo thinks we need to code at XXX, not 880
- Zhuo agrees; we'll go forward with this approach to 880
- $7
- issue 358 is empty ; a little content in issue 380, specifically that OCLC has not yet accounted-for $7 so we can punt; however, now (2022-01-05) OCLC has in fact listed $7
- TG's proposal: let's continue to punt; when the group makes a decision on $7, we'll do a sweep through all fields with a $7 (ugh!)
- also note: in most spreadsheets, $7 is not even there; somebody will have to enter in spreadsheets
- data provenance will require reification in RDA and will be a difficult solution for us!
- Zhuo agrees we should postpone coding the $7
- $8 (We will not map $8 until a use case is provided. 2022-07-14)
- Anything else?
- minimum description of a metadata work: need generated ID and link to exp; exp with exp ID and link to man (the rdf file in the original description set) for which we should mint an IRI. How? This is needed in 561.
- The metadata work in Zhuo's example is reification of a statement describing a particular item, enabling him to say something about that statement. The problem, as Theo understood, is that the metadata expression needs to be linked to a metadata manifestation that actually exists. This was all addressed in the github repo in issue 225 for field 561. Theo will start reviewing and see if he can imagine some XSLT ways to resolve the problem.
- Zhuo will not be working on 561 this week so the metadata work/exp/man problem will not be resolved this week
- the sinopia templates project also is struggling with an implementation of RDA reification; it may be good to see what's going on there
- Theo says this is a new problem we are tackling and that we should write an article of some type describing the problem and our solution.
- Anything from Zhuo?
- Small project: record directions for $3 handling. Proposal:
- write transformation code for a few $3's
- record what we did in Discussion 353
- pull it all together and create a $3 decision in the decisions index
- timeline: get this done before the end of January
- what's good about this: we can encounter a few $3's and record how we processed them based on what's in each spreadsheet and Discussion 353; we can record our field-specific processing of $3 in Discussion 353 so that anything unwise can be discussed by the overall group
- what's not so good: the delay in deciding will result in varied approaches in the spreadsheets.
- Anything to add to agenda?
- What has Zhuo been working on? Anything of interest while doing that work?
- 500, focus on $5; temp solution for $5 | $3
- produced code to process every $5 in every MARC records the same way for items
- $5 and $3 together will be resolved at next m2R meeting
- What has Theo been working on?
- 336
- expression information; but when it has a $3, we add note on expression that applies to manifestation!
- This should be described as a problem in the ISSUE (not just in the spreadsheet)
- $2 temporary solution involved
- expression information; but when it has a $3, we add note on expression that applies to manifestation!
- 245
- terminal punctuation elimination using replace()
- process a sibling field (in this case the Leader/18) using substring()= and the appropriate axis, in this case preceding-sibling::
- straightforward field to code
- 490
- used grouping/group-starting-with to handle repeating $a $x $v
- $3 easy to code
- output MARC field value including marc subfields as a string
- starting 340
- 336
- General observations (Theo)
- Theo still skipping over $6
- Summary of what's been coded on m2r-xxx.xsl file using comments
- Entering notes in spreadsheet for rows coded
- THEO SHOULD STOP DOING THIS; instead, every commit should reference the issue#
- Anything to add to the agenda?
- $3 AND $5 ISSUES. Mint IRI for each $5. When $3 and $5 both appear: is $3, data in $a is mapped to man (note on man) with $3 appended to the end (i.e. applies to); then the item has no description.
- ACTION ITEM: ADD TO AGENDA IN WEDNESDAY MEETING
- Approaching November 28 (SWIB)
- Is what we need to do clear?
- Main task: code fields on the board (Theo and Zhuo)
- Run code and review data; use Crystal's MARC data set; JUST DO THIS AS WE CODE FIELD BY FIED; WE CAN RUN TESTS LATER
- Write some code to output labels rather than opaque identifiers in RDA output (Theo)
- what do we want it to look like?
- proposed: just do it separately and add both transforms to an XProc 1.0 pipeline
- If test data set has aggregates or diachronic works, we'll have to filter them out (or eliminate them from the set)
- If there are no aggregates/diachronics, maybe add some to demonstrate how we'll weed them out
- we do not have the criteria for weeding out these resources
- If there are no aggregates/diachronics, maybe add some to demonstrate how we'll weed them out
- Let's not worry about those BSR placeholders
- let's not do them all; only the "obvious" ones; do it at-the-last-minute
- meetings
- Option 1: Just meet on Thursdays; if more discussion is required, either use Teams or email.
- Is Zhuo OK to use Teams/marc2rda?
- Option 2: schedule more meetings; do some work at meetings
- Option 1: Just meet on Thursdays; if more discussion is required, either use Teams or email.
- Is what we need to do clear?
- Theo asked Theo if there was anything he wanted to discuss, He replied, "it's all in the agenda."
- Notes were not added for previous meeting. What we did: we looked over the $5 work.
- Theo pointed out to Theo a possible division of labor; who will do what?
- Placeholders for BSR elements (Zhuo?)
- Fields ready to code on board (Theo?)
- run code and produce sample data (Zhuo?)
- Crystal loaded MARC records today in Github
- Re-useable code to output labels in identifiers rather than opaque identifiers (Theo or Zhuo)
- Anything else? Theo said no, nothing else. Meeting terminated at 3:10 PM.
- Theo working on 264.
- Parallel 264$a, $b, $c statements are not limited to two. Current code only accounts for entry to the left of the '=' and the entry to the right; however, there may be more than one equal sign. We should tokenize() using the '='.
- RDA properties for the parallel statements are soft deprecated (see https://www.rdaregistry.info/Aligns/alignSoft2Rec.html which displays 115 soft-deprecated properties). Current code uses soft-deprecated properties based on a mapping (i.e. the 264 spreadsheet) completed before we were aware (in MARC-to-RDA meetings) that these properties were soft-deprecated. Code will be rewritten using the "RecommendedLabel" rather than the "RedundantLabel." An item will be added to the next Wednesday meeting agenda to open a discussion.
- Theo will continue the 264; perhaps start the 245 now ready for transform; Zhuo will continue the "preprocessing" for $5.
- Some XSLT 3.0 instruments were introduced into the code; specifically text value templates were combined with use of the XPath 3.1 operator => ; when using, don't forget to use @expand-text! Both Theo and Zhuo agree that this operator improves readability compared to the usual approach of "layered" XSLT functions.
- Anything Zhuo would like to talk about? (still working on $5 preprocessing)
- Naming conventions for m2r-xxx-named.xsl (a) record decision in a README.md file in //Working Documents/Transformation Code (in the git repo). (b) move decisions into the README. (c) Theo will create the README. (d) the convention for named templates: @name="F264-x2-abc".
- $5 update (Done in #1 above). (a) No progress made on $5 coding in the m2r transforms.
- Timeline considerations. (a) We want to have an MVP by mid-November.
- workflow considerations. (a) when done, commit with a message that references the issue; if issue 32, reference it with "#32" (see decision inex III.c). Team seems to be on the same page on workflow and selecting fields to work on; how to complete them is not entirely resolved.
- Upcoming week: Theo will work on 264 and the README. Zhu will work on $5 preprocessing; if he finishes that, will select something to code from the project board.
- Zhuo will continue working on the "preprocessing" i.e. the "external dataset for organizations."
- Zhuo and Theo will code fields from the board as needed.
- We will not pursue accessing the Code List for Cultural Heritage Organizations over http this year; we will use the bulk download. This means we will niss updates to the LC data, so we should pursue this next year for certain. For now, we want to demonstrate how the mapping can guide a transform but the end of November.
- Theo will be away from work until the week of September 5.
- Next meeting Thursday, September 8
- Review Zhuo's 030--Code--code vs. spreadsheet--issue tracking--can we run it?
- The code looks perfect.
- code v spreadsheet looks fine
- issue tracking has correct label
- no attempt to run; will attempt outside meeting.
- $5--collection module-->assign to Zhuo!
- The output of tihs module will look somethng like the following:
- <www.marc2rda.edu/ColWor/aealjj> rdfs:type rdac:Work ;
- hasMan <www.marc2rda.edu/ColMan/aealjj> ;
- hasNameOrWhatever “Collection of [lookup label in "Organizations scheme in MADSRDF format serialized as XML."]”
- hasAgentEtc <www.marc2rda.edu/agent/aealjj>;
- moreProperties moreValues . #if applicable
- <www.marc2rda.edu/agent/aealjj> properties values ;
- hasAppellationEtc http://id.loc.gov/vocabulary/organizations/aealjj .
- <www.marc2rda.edu/ColMan/aealjj> rdamd:hasAppellofMan “Collection of {aealjj-Label}” ;
- manOfWork <www.marc2rda.edu/ColWor/aealjj> ;
- moreProperties moreValues . #if applicable
- Zhuo will first attempt to download the data.
- Then Zhuo will try to access the data over http.
- At the meeting it was established that http GET could retrieve the RDF/XML but only with the header accept: application/rdf+xml.
- Theo isn't sure how to incorporate headers into document requests using a URL. Zhuo will experiment.
2022-08-04 Present: Theo, Zhuo Regrets: Benjamin
Agenda and Notes
- Sita volunteered to help -- do we need help right now? Not now; perhaps in time.
- Approval of Workflow decisions Note: there's a new category on the main project workboard: "Almost Done"
- Anything on Zhuo's mind? It's good.
- How can we record issues specific to the transform? Create a new issue. Name the issue TXXX. Apply 2 label: "XXX" and "Transform"
- Theo introduced pipeline design
- Review of this week's coding
- 6a. Include discussion of $5 Concerning $5. We probably shouldn't mint IRIs for ALL 500 fields with a $5 for all institutions. Probably shouldn't mint IRIs for other institution's items. DEFINITELY should not assign the same IRI to different items.