MGM Entity Extraction - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki
- Category description and use cases
- Output standard
- Recommended tool(s)
- Other evaluated tools
- Evaluation summary
Entity extraction, or named entity recognition (NER), is a type of natural language processing (NLP) that attempts to identify and classify entities or concepts, like people, places, organizations, products, and topics, in unstructured text. These extracted entities could then be reviewed (and possibly normalized) and added to item or collection descriptions as access points or tags. Another possible use case is reviewing extracted entities as a way to assess the accuracy of a speech-to-text transcription. If a collection manager is familiar with a collection, unusual entities may serve as a flag for poor transcription.
Audio is passed through a segmenter MGM to label speech, silence and music. If necessary, the audio file is split into segments of speech. A new file composed of only the speech segments is sent through a speech-to-text MGM to generate transcripts. If necessary, timestamps are adjusted to restore original segments of silence and music. The transcript is converted to plain text and sent through an entity extraction MGM to extract entity types of interest to the user. Output can be used to generate lists of terms for users to review, timed-text transcripts with entity annotations (JSON) or entities with time offset annotations (JSON).
- Score threshold?
- Entity types to use from output?
- Language
Summary:
Element Datatype Obligation Definition media object required Wrapper for metadata about the source media file. media.filename string required Filename of the source file. media.characters integer required Number of text characters from the file document evaluated by the NLP tool. entities array of objects required Wrapper for entities extracted from the document. entities[*] object required An entity extracted from the document. entities[*].text string required The entity text extracted from the document. entities[*].type string required The type of entity as classified by the tool or service. entities[*].beginOffset integer required The start character within the document of the extracted entity. entities[*].endOffset integer required The end of the entity string within the document, (i.e. the offset of the character immediately after the last character of the entity). entities[*].start string (s.fff) optional The start time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds. entities[*].end string (s.fff) optional The end time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds. entities[*].subtype array of strings optional A list of subtypes of the entity type (string). (Provided by some NLP services, ex. IBM Watson.) entities[*].nounType string (common | proper) optional Whether the entity name is common or proper. (Used by Google NLP.) entities[*].score object optional A confidence or relevance score for the entity. entities[*].score.type string (confidence | relevance) required The type of score, confidence or relevance. Confidence indicates the NLP service's confidence of correctly detecting the type of entity while relevance indicates the importance of an entity to the document. (Of the candidates we tested, AWS Comprehend "score" would map to "confidence" while IBM Watson's "relevance" and Google NLP's "salience" would map to "relevance". entities[*].score.value number required The score value, typically a float in the range of 0-1. entities[*].normalizedForm object optional A normalized form of the entity. Some services group mentions of similar terms and label them with a normalized term that may or may not correspond to an entity from an external knowledge base or graph. entities[*].normalizedForm.text string required The normalized text form of the entity. entities.normalizedForm.externalEntities array optional A list of corresponding entity ids and/or urls from external knowledge bases or graphs. entities.normalizedForm.externalEntities[*] object required A corresponding entity id and/or url from an external knowledge base or graph. entities.normalizedForm.externalEntities[*].source string required The source of the external entity, ex. "Wikipedia". entities.normalizedForm.externalEntities[*].id string optional An id for a corresponding external entity. entities.normalizedForm.externalEntities[*].url string optional A URL for a corresponding external entity.
Schema[[ ][Expand source]][]
{
"$schema": "http://json-schema.org/schema#",
"type": "object",
"title": "Entity Extraction Schema",
"required": [
"media",
"entities"
],
"properties": {
"media": {
"type": "object",
"title": "Media",
"description": "Wrapper for metadata about the source media file.",
"required": [
"filename",
"characters"
],
"properties": {
"filename": {
"type": "string",
"title": "Filename",
"description": "Filename of the source file.",
"default": "",
"examples": ["myfile.txt"]
},
"characters": {
"type": "integer",
"title": "Characters",
"description": "Number of text characters from the file document evaluated by the NLP tool.",
"default": "",
"examples": [47026]
}
}
},
"entities": {
"type": "array",
"title": "Entities",
"description": "Wrapper for entities extracted from the document.",
"items": {
"type": "object",
"required": [
"text",
"type",
"beginOffset",
"endOffset"],
"properties": {
"text": {
"type": "string",
"title": "Text",
"description": "The entity text extracted from the document.",
"default": "",
"examples": ["New York"]
},
"type": {
"type": "string",
"title": "Type",
"description": "The type of entity as classified by the tool or service.",
"default": "",
"examples": ["PERSON", "COMMERCIAL_ITEM"]
},
"beginOffset": {
"type": "integer",
"title": "Begin offset",
"description": "The start character within the document of the extracted entity.",
"default": 0,
"examples": [4637]
},
"endOffset": {
"type": "integer",
"title": "End offset",
"description": "The end of the entity string within the document, (i.e. the offset of the character immediately after the last character of the entity).",
"default": 0,
"examples": [4645]
},
"start": {
"type": "string",
"title": "Start",
"description": "The start time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds.",
"default": "",
"examples": ["837.834"]
},
"end": {
"type": "string",
"title": "End",
"description": "The end time of entity string within the media, referenced from the timecoded transcript or video OCR, in seconds.",
"default": "",
"examples": ["838.79"]
},
"subtype": {
"type": "array",
"title": "Subtype",
"description": "A list of subtypes of the entity type (string). (Provided by some NLP services, ex. IBM Watson.)",
"items": {
"type": "string"
}
},
"nounType": {
"type": "string",
"title": "Noun type",
"description": "Whether the entity name is common or proper. (Used by Google NLP.)",
"enum": [
"proper",
"common"
]
},
"score": {
"type": "object",
"title": "Score",
"description": "A confidence or relevance score for the entity.",
"required": [
"type",
"scoreValue"
],
"properties": {
"type": {
"type": "string",
"title": "Type",
"description": "The type of score, confidence or relevance. Confidence indicates the NLP service’s confidence of correctly detecting the type of entity while relevance indicates the importance of an entity to the document. (Of the candidates we tested, AWS Comprehend “score” would map to “confidence” while IBM Watson’s “relevance” and Google NLP’s “salience” would map to “relevance”.",
"enum": [
"confidence",
"relevance"
]
},
"scoreValue": {
"type": "number",
"title": "Score value",
"description": "The score value, typically a float in the range of 0-1.",
"default": 0,
"examples": [0.437197]
}
}
},
"normalizedForm": {
"type": "object",
"title": "Normalized form",
"description": "A normalized form of the entity. Some services group mentions of similar terms and label them with a normalized term that may or may not correspond to an entity from an external knowledge base or graph.",
"required": ["text"],
"properties": {
"text": {
"type": "string",
"title": "Text",
"description": "The normalized text form of the entity.",
"default": "",
"examples": ["New York City"]
},
"externalEntities": {
"type": "array",
"title": "External entities",
"description": "A list of corresponding entity ids and/or urls from external knowledge bases or graphs.",
"items": {
"type": "object",
"required": ["source"],
"anyOf": [
{
"properties": {
"source": {
"type": "string",
"title": "Source",
"description": "The source of the external entity, ex. “Wikipedia”.",
"default": "",
"examples": ["Google Knowledge Graph"]
},
"id": {
"type": "string",
"title": "Id",
"description": "An id for a corresponding external entity.",
"default": "",
"examples": ["/m/09c7w0"]
}
}
},
{
"properties": {
"source": {
"type": "string",
"title": "Source",
"description": "The source of the external entity, ex. “Wikipedia”.",
"default": "",
"examples": ["Dbpedia"]
},
"url": {
"type": "string",
"title": "url",
"description": "A URL for a corresponding external entity.",
"default": "",
"examples": ["http://dbpedia.org/resource/New_York_City"]
}
}
},
{
"properties": {
"source": {
"type": "string",
"title": "Source",
"description": "The source of the external entity, ex. “Wikipedia”.",
"default": "",
"examples": ["Wikipedia"]
},
"id": {
"type": "string",
"title": "Id",
"description": "An id for a corresponding external entity.",
"default": "",
"examples": ["New_York_City"]
},
"url": {
"type": "string",
"title": "url",
"description": "A URL for a corresponding external entity.",
"default": "",
"examples": ["https://en.wikipedia.org/wiki/New_York_City"]
}
}
}
]
}
}
}
}
}
}
}
}
}
Sample Output - minimum[[ ][Expand source]][]
{
"media": {
"filename": "myfile.txt",
"characters": 47582
},
"entities": [
{
"text": "John Dewey",
"type": "Person",
"beginOffset": 14,
"endOffset": 24,
},
{
"text": "student success",
"type": "Concept",
"beginOffset": 56,
"endOffset": 81,
},
{
"text": "Bloomington",
"type": "Location",
"beginOffset": 93,
"endOffset": 104,
},
{
"text": "New York",
"type": "Location",
"beginOffset": 155,
"endOffset": 163,
}
]
}
Sample Output - full[[ ][Expand source]][]
{
"media": {
"filename": "myfile.txt",
"characters": 47582
},
"entities": [
{
"text": "John Dewey",
"type": "Person",
"beginOffset": 14,
"endOffset": 24,
"start": "22.888",
"end": "23.555",
"nounType": "proper",
"score": {
"type": "relevance",
"scoreValue": 0.0001677875
},
"normalizedForm": {
"text": "John Dewey",
"externalEntities": [
{
"source": "Google Knowledge Graph",
"id": "/m/04411"
},
{
"source": "Wikipedia",
"url": "https://en.wikipedia.org/wiki/John_Dewey"
}]
}
},
{
"text": "student success",
"type": "Concept",
"beginOffset": 56,
"endOffset": 81,
"start": "32.888",
"end": "33.555",
"nounType": "common",
"score": {
"type": "relevance",
"scoreValue": 0.00011485724
},
},
{
"text": "Bloomington",
"type": "Location",
"beginOffset": 93,
"endOffset": 104,
"start": "402.788",
"end": "403.955",
"nounType": "proper",
"subtype": ["City"],
"score": {
"type": "relevance",
"scoreValue": 0.0001677875
},
"normalizedForm": {
"text": "Bloomington",
"externalEntities": [
{
"source": "Google Knowledge Graph",
"id": "/m/0snty"
},
{
"source": "Wikipedia",
"url": "https://en.wikipedia.org/wiki/Bloomington,_Indiana"
}]
}
},
{
"text": "New York",
"type": "Location",
"beginOffset": 155,
"endOffset": 163,
"start": "837.834",
"end": "838.455",
"nounType": "proper",
"subtype": [
"PoliticalDistrict",
"GovernmentalJurisdiction",
"PlaceWithNeighborhoods",
"WineRegion",
"FilmScreeningVenue",
"City"],
"score": {
"type": "relevance",
"scoreValue": 0.433819
},
"normalizedForm": {
"text": "New York City",
"externalEntities": [
{
"source": "Google Knowledge Graph",
"id": "/m/02_286"
},
{
"source": "Wikipedia",
"url": "https://en.wikipedia.org/wiki/New_York_City"
}]
}
}
]
}
Official documentation: [ https://aws.amazon.com/comprehend/ ]
Language: [100 languages supported: https://docs.aws.amazon.com/comprehend/latest/dg/how-languages.html]
**Description: **
Cost: [Requests for Entity Recognition, Sentiment Analysis, Syntax Analysis, Key Phrase Extraction, and Language Detection are measured in units of 100 characters, with a 3 unit (300 character) minimum charge per request at $0.0001/unit (up to 10M units monthly)]
Social impact:
Notes:
AWS Comprehend is run via the AWS Console or AWS Comprehend API. Interaction with the API can be made through the AWS Command Line Interface (CLI) or by invoking scripts with AWS Lambda functions. AWS offers SDKs in a variety of programming languages. For testing, the AWS CLI was used.
For each plain text file, the file is uploaded to an S3 bucket, then referenced in a call to API using the StartEntitiesDetectionJob method. (This method is used for texts over 5000 characters. For texts under 5000 characters, the DetectEntities method can be used.):
aws comprehend start-entities-detection-job --data-access-role-arn=arn:aws:iam::[access_role] --language-code=en --input-data-config S3Uri=s3://[input_bucket]/myfile.txt, InputFormat=ONE_DOC_PER_FILE --output-data-config S3Uri=s3://[output_bucket]/
This should return a job id and job status. Example:
{"JobId": "4ee1548fec685f96f76361276e588eba", "JobStatus": "SUBMITTED"}
To check on the status of a job, use the DescribeEntitiesDetectionJob:
aws comprehend describe-entities-detection-job --job-id=1337d49aa78c092d510ef8394b545a13
When the job is complete the output is sent to the S3 bucket listed in the OutputDataConfig parameter from the initial request.
Full list of parameters: https://docs.aws.amazon.com/comprehend/latest/dg/API
- LanguageCode: default is English
[Plain text]
Type Description COMMERCIAL_ITEM A branded product DATE A full date (for example, 11/25/2017), day (Tuesday), month (May), or time (8:30 a.m.) EVENT An event, such as a festival, concert, election, etc. LOCATION A specific location, such as a country, city, lake, building, etc. ORGANIZATION Large organizations, such as a government, company, religion, sports team, etc. OTHER Entities that don't fit into any of the other entity categories PERSON Individuals, groups of people, nicknames, fictional characters QUANTITY A quantified amount, such as currency, percentages, numbers, bytes, etc. TITLE An official name given to any creation or creative work, such as movies, books, songs, etc.
AWS Comprehend entity types mapped to common types used for testing.
{'COMMERCIAL_ITEM':'concept',
'DATE':'do not use',
'EVENT':'event',
'LOCATION':'location',
'ORGANIZATION':'organization',
'OTHER':'concept',
'PERSON':'person',
'QUANTITY':'do not use',
'TITLE':'concept'}
<tool name> Example
aws comprehend start-entities-detection-job --data-access-role-arn=arn:aws:iam::[access_role] --language-code=en --input-data-config S3Uri=s3://[input_bucket]/myfile.txt, InputFormat=ONE_DOC_PER_FILE --output-data-config S3Uri=s3://[output_bucket]/
<tool name> Output[[ ][Expand source]][]
{
"Entities": [
{
"BeginOffset": 16,
"EndOffset": 20,
"Score": 0.930534839630127,
"Text": "17th",
"Type": "DATE"
},
{
"BeginOffset": 22,
"EndOffset": 34,
"Score": 0.9784671664237976,
"Text": "This morning",
"Type": "DATE"
},
{
"BeginOffset": 72,
"EndOffset": 101,
"Score": 0.7616077661514282,
"Text": "Bryan Administration Building",
"Type": "ORGANIZATION"
},
{
"BeginOffset": 156,
"EndOffset": 166,
"Score": 0.6579800844192505,
"Text": "Ballantine",
"Type": "LOCATION"
},
{
"BeginOffset": 171,
"EndOffset": 183,
"Score": 0.9638428092002869,
"Text": "Rawles Halls",
"Type": "LOCATION"
},
{
"BeginOffset": 185,
"EndOffset": 198,
"Score": 0.9551753997802734,
"Text": "Five students",
"Type": "QUANTITY"
},
{
"BeginOffset": 234,
"EndOffset": 246,
"Score": 0.8644406795501709,
"Text": "this morning",
"Type": "DATE"
}
]
}
Official documentation: [https://spacy.io ]
Language: English
**Description: **
Cost: Free (open source)
Social impact:
Notes:
Install SpaCy as a Python library using pip or other preferred method.
Download models: https://spacy.io/usage/models (We used en_core_web_lg for testing.)
Example:
python -m spacy download en_core_web_lg
Model: pass the model name as an argument in instantiating SpaCy.
Example:
nlp = spacy.load("en_core_web_lg")
Plain text
Full list at https://spacy.io/api/annotation#named-entities
TYPE DESCRIPTION
PERSON
People, including fictional.
NORP
Nationalities or religious or political groups.
FAC
Buildings, airports, highways, bridges, etc.
ORG
Companies, agencies, institutions, etc.
GPE
Countries, cities, states.
LOC
Non-GPE locations, mountain ranges, bodies of water.
PRODUCT
Objects, vehicles, foods, etc. (Not services.)
EVENT
Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART
Titles of books, songs, etc.
LAW
Named documents made into laws.
LANGUAGE
Any named language.
DATE
Absolute or relative dates or periods.
TIME
Times smaller than a day.
PERCENT
Percentage, including "%".
MONEY
Monetary values, including unit.
QUANTITY
Measurements, as of weight or distance.
ORDINAL
"first", "second", etc.
CARDINAL
Numerals that do not fall under another type.
SpaCy types mapped to common types used for testing:
{'PERSON':'person',
'NORP':'concept',
'FAC':'concept',
'ORG':'organization',
'GPE':'location',
'LOC':'location',
'PRODUCT':'concept',
'EVENT':'event',
'WORK_OF_ART':'concept',
'LAW':'concept',
'LANGUAGE':'concept',
'DATE':'do not use',
'TIME':'do not use',
'PERCENT':'do not use',
'MONEY':'do not use',
'QUANTITY':'do not use',
'ORDINAL':'do not use',
'CARDINAL':'do not use'}
<tool name> Example[[ ][Expand source]][]
import spacy
nlp = spacy.load("en_core_web_sm")
text = u"Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
media = {"filename": "myfile.txt", "characters": len(text)}
entities = []
for token in doc.ents:
entity = {}
entity['text'] = token.text
entity['type'] = token.label_
entity['beginOffset'] = token.start_char
entity['endOffset'] = token.end_char
entities.append(entity)
result = {"media": media, "entities":entities}
<tool name> Output[[ ][Expand source]][]
{
"media": {
"filename": "myfile.txt",
"characters": 54
},
"entities": [
{
"text": "Apple",
"type": "ORG",
"beginOffset": 0,
"endOffset": 5
},
{
"text": "U.K.",
"type": "GPE",
"beginOffset": 27,
"endOffset": 31
},
{
"text": "$1 billion",
"type": "MONEY",
"beginOffset": 44,
"endOffset": 54
}
]
}
Official documentation: https://nlp.stanford.edu/software/index.shtml
Language: [Java, with bindings or translations for Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages]
Cost: Free (open source)
Social impact:
Notes: [Comes with three models-- a 3-class one (Location, Person, Organziation), a 4-class one (Location, Person, Organization, Misc), and a 7-class one (Location, Person, Organization, Money, Percent, Date, Time). Arabic, Chinese, English, French, German, Spanish. Third-party created models for Russian and Swedish]
https://stanfordnlp.github.io/CoreNLP/
plain text
Official documentation: [https://cloud.ibm.com/catalog/services/natural-language-understanding ]
Language: [web service]
Cost: [Lite plan: 30,000 NLU items/month free (A NLU item is based on the number of data units enriched and the number of enrichment features applied. A data unit is 10,000 characters or less.) For example: extracting Entities and Sentiment from 15,000 characters of text is (2 Data Units * 2 Enrichment Features) = 4 NLU Items]
Social impact: On their website, "By default, all Watson services log requests and their results. Logging is done only to improve the services for future users. The logged data is not shared or made public. To prevent IBM from accessing your data for general service improvements, please visit this site: https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#data-collection"
Notes: Entities include Anatomy, Award, Broadcaster, Company, Crime, Drug, EmailAddress, Facility, GeographicFeature, HealthCondition, Hashtag, IPAddress, JobTitle, Location, Movie, MusicGroup, NaturalEvent, Organization, Person, PrintMedia, Quantity, Sport, SportingEvent, TelevisionShow, TwitterHandle, Vehicle and subtypes of each of these. Results include relevance, subtypes and dbpedia names and links, and counts. Arabic, Chinese, Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish
web service
[Input formats]
plain text, raw html, public URL, 50000 character limit
Official documentation: [https://cloud.google.com/natural-language/]
Language: web service
Cost: Calculated in terms of "units," where each document sent to the API for analysis is at least one unit. One unit per 1,000 characters. First 5000 units free, then $1/1000 units.
Social impact:
Notes: [Categories: Person, Organization, Event, Location, Consumer good, Work of art, Quantity, and Other\
- salience (relevance to document)\
- Wikipedia URL\
- Google Knowledge Graph ID\
- distinguishes between proper and common nouns]
Languages supported varies by feature, but generally: English, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish
web service
Input formats
plain text, html
Scripts for converting MGM output formats and comparing results are on the project GitHub.
Analysis of custom vocabulary usage with SpaCy is in the project Google Drive.
Precision, Recall, and F1 scores for ground truth testing are in the project Google Drive.
segmentation-workflow.png
(image/png)
entity_extraction_workflow_example.png
(image/png)
Entity
Extraction.png (image/png)\
Document generated by Confluence on Feb 25, 2025 10:39