Understanding EMRFS Consistent View - qyjohn/AWS_Tutorials GitHub Wiki
emrfs create-metadata
Looking into the DynamoDB table, you will see the following item in it.
hashKey | rangeKey | deletionTTL | lastModified | payload |
---|---|---|---|---|
MultiKeyStoreTag | TableRole | 0 | 1585695395067 | AA== |
bucket="331982-syd"
aws s3 cp test.txt s3://$bucket/
hdfs dfs -cat s3://$bucket/test.txt
Looking into the DynamoDB table, you will see the content remains the same. That is: (a) using the AWS CLI (which is outside of EMRFS) to create files on S3 does not create entries in EMRFS metada; and (b) S3 objects read with HDFS commands (using EMRFS) are not automatically added to the metadata.
hashKey | rangeKey | deletionTTL | lastModified | payload |
---|---|---|---|---|
MultiKeyStoreTag | TableRole | 0 | 1585695395067 | AA== |
emrfs sync s3://$bucket/
Looking into the DynamoDB table, now you have the following items.
hashKey | rangeKey | deletionTTL | lastModified | payload | eTag |
---|---|---|---|---|---|
MultiKeyStoreTag | TableRole | 0 | 1585695395067 | AA== | |
/331982-syd | test.txt | 0 | 1585696510965 | CAEQABgA | d577273ff885c3f84dadb8578bb41399 |
$ aws s3 rm s3://$bucket/test.txt
delete: s3://331982-syd/test.txt
$ hdfs dfs -ls s3://$bucket/
20/03/31 23:23:35 ERROR consistency.ConsistencyCheckerS3FileSystem: No s3 object for metadata item /331982-syd/test.txt
20/03/31 23:23:35 INFO retry.RetryInvocationHandler: Exception while invoking ConsistencyCheckerS3FileSystem.listStatus over null. Retrying after sleeping for 16000ms.
com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: 1 items inconsistent (no s3 object for associated metadata item). First object: /331982-syd/test.txt
As you can see, deleting the object from S3 using AWS CLI does not delete the metadata. This results in the EMRFS inconsistency issue shown in the above-mentioned example.
Now we have some inconsistency, let's try a diff and sync.
$ emrfs diff s3://$bucket
20/03/31 23:40:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
EmrFsApplication.scala(91): dynamoDB endPoint = dynamodb.ap-southeast-2.amazonaws.com
EmrFsApplication.scala(99): s3 endPoint = s3.ap-southeast-2.amazonaws.com
EmrFsApplication.scala(107): sqs endPoint = sqs.ap-southeast-2.amazonaws.com
BOTH | MANIFEST ONLY | S3 ONLY
$ emrfs sync s3://$bucket
20/03/31 23:40:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
EmrFsApplication.scala(91): dynamoDB endPoint = dynamodb.ap-southeast-2.amazonaws.com
EmrFsApplication.scala(99): s3 endPoint = s3.ap-southeast-2.amazonaws.com
EmrFsApplication.scala(107): sqs endPoint = sqs.ap-southeast-2.amazonaws.com
Done syncing s3://331982-syd 0 added | 0 updated | 0 removed | 0 unchanged
creating 0 folder key(s)
[hadoop@ip-172-32-13-198 ~]$
Looking into the metadata table, we still have the same two items in the table. This is the interesting part. No difference is found when doing "emrfs diff", although the object has been deleted from S3 and there is a record for the object in metadata. Doing an "emrfs sync" does not fix the issue either.
In /etc/hadoop/conf/log4j.properties, enable DEBUG logging:
hadoop.root.logger=DEBUG,console
Run the "emrfs diff" command again, saw the following steps in the logs:
$ emrfs diff s3://331982-syd
- (1) DynamoDB DescribeTable
20/04/01 00:05:17 DEBUG amazonaws.request: Sending Request: POST https://dynamodb.ap-southeast-2.amazonaws.com / Headers: (amz-sdk-invocation-id: 9ef2ced0-b07c-70a4-0424-bc5c6c0bd06e, Content-Length: 29, Content-Type: application/x-amz-json-1.0, User-Agent: ElasticMapReduce/1.0.0 emrfs/s3c {}, aws-sdk-java/1.11.682 Linux/4.14.154-99.181.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.242-b08 java/1.8.0_242 scala/2.10.5 vendor/Oracle_Corporation, X-Amz-Target: DynamoDB_20120810.DescribeTable, )
- (2) DynamoDB GetItem
20/04/01 00:05:17 DEBUG amazonaws.request: Sending Request: POST https://dynamodb.ap-southeast-2.amazonaws.com / Headers: (amz-sdk-invocation-id: 505d26f1-f090-948e-6069-82fab0fc0652, Content-Length: 156, Content-Type: application/x-amz-json-1.0, User-Agent: ElasticMapReduce/1.0.0 emrfs/s3c {}, aws-sdk-java/1.11.682 Linux/4.14.154-99.181.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.242-b08 java/1.8.0_242 scala/2.10.5 vendor/Oracle_Corporation, X-Amz-Target: DynamoDB_20120810.GetItem, )
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 >> "{"TableName":"EmrFSMetadata","Key":{"rangeKey":{"S":"TableRole"},"hashKey":{"S":"MultiKeyStoreTag"}},"ConsistentRead":true,"ReturnConsumedCapacity":"TOTAL"}"
- (3) DynamoDB Query - saw the deleted item in the table
20/04/01 00:05:17 DEBUG amazonaws.request: Sending Request: POST https://dynamodb.ap-southeast-2.amazonaws.com / Headers: (amz-sdk-invocation-id: 537cb27d-52fb-419a-2614-d9a9c9e5545c, Content-Length: 226, Content-Type: application/x-amz-json-1.0, User-Agent: ElasticMapReduce/1.0.0 emrfs/s3c {}, aws-sdk-java/1.11.682 Linux/4.14.154-99.181.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.242-b08 java/1.8.0_242 scala/2.10.5 vendor/Oracle_Corporation, X-Amz-Target: DynamoDB_20120810.Query, )
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 >> "{"TableName":"EmrFSMetadata","Limit":2147483647,"ConsistentRead":true,"KeyConditions":{"hashKey":{"AttributeValueList":[{"S":"/331982-syd"}],"ComparisonOperator":"EQ"}},"ScanIndexForward":true,"ReturnConsumedCapacity":"TOTAL"}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 << "{"ConsumedCapacity":{"CapacityUnits":1.0,"TableName":"EmrFSMetadata"},"Count":1,"Items":[{"deletionTTL":{"N":"0"},"payload":{"B":"CAEQABgA"},"eTag":{"S":"d577273ff885c3f84dadb8578bb41399"},"hashKey":{"S":"/331982-syd"},"rangeKey":{"S":"test.txt"},"lastModified":{"N":"1585696510965"}}],"ScannedCount":1}"
- (4) S3 ListObjectsV2 - correctly saw no objects in the bucket
20/04/01 00:05:17 DEBUG amazonaws.request: Sending Request: GET https://331982-syd.s3.ap-southeast-2.amazonaws.com / Parameters: ({"list-type":["2"],"delimiter":["/"],"prefix":[""],"fetch-owner":["false"]}Headers: (amz-sdk-invocation-id: 7fe2369b-a48f-db13-7ac4-9a4494c062ea, Content-Type: application/octet-stream, User-Agent: ElasticMapReduce/1.0.0 emrfs/s3c {}, aws-sdk-java/1.11.682 Linux/4.14.154-99.181.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.242-b08 java/1.8.0_242 scala/2.10.5 vendor/Oracle_Corporation, )
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-1 << "<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>331982-syd</Name><Prefix></Prefix><KeyCount>0</KeyCount><MaxKeys>1000</MaxKeys><Delimiter>/</Delimiter><IsTruncated>false</IsTruncated></ListBucketResult>[\r][\n]"
- (5) DynamoDB Parallel Scan and Query
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 >> "X-Amz-Target: DynamoDB_20120810.Scan[\r][\n]"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 >> "{"TableName":"EmrFSMetadata","ScanFilter":{"hashKey":{"AttributeValueList":[{"S":"/331982-syd/"}],"ComparisonOperator":"BEGINS_WITH"},"payload":{"ComparisonOperator":"NOT_NULL"}},"ReturnConsumedCapacity":"TOTAL","TotalSegments":5,"Segment":0}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 << "{"ConsumedCapacity":{"CapacityUnits":0.5,"TableName":"EmrFSMetadata"},"Count":0,"Items":[],"ScannedCount":0}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-2 >> "X-Amz-Target: DynamoDB_20120810.Query[\r][\n]"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-2 >> "{"TableName":"EmrFSMetadata","Limit":2147483647,"ConsistentRead":true,"KeyConditions":{"rangeKey":{"AttributeValueList":[{"S":"_$folder$"}],"ComparisonOperator":"BEGINS_WITH"},"hashKey":{"AttributeValueList":[{"S":"/331982-syd"}],"ComparisonOperator":"EQ"}},"ScanIndexForward":true,"ReturnConsumedCapacity":"TOTAL"}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 >> "X-Amz-Target: DynamoDB_20120810.Scan[\r][\n]"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 >> "{"TableName":"EmrFSMetadata","ScanFilter":{"hashKey":{"AttributeValueList":[{"S":"/331982-syd/"}],"ComparisonOperator":"BEGINS_WITH"},"payload":{"ComparisonOperator":"NOT_NULL"}},"ReturnConsumedCapacity":"TOTAL","TotalSegments":5,"Segment":4}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-3 >> "X-Amz-Target: DynamoDB_20120810.Scan[\r][\n]"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-3 >> "{"TableName":"EmrFSMetadata","ScanFilter":{"hashKey":{"AttributeValueList":[{"S":"/331982-syd/"}],"ComparisonOperator":"BEGINS_WITH"},"payload":{"ComparisonOperator":"NOT_NULL"}},"ReturnConsumedCapacity":"TOTAL","TotalSegments":5,"Segment":2}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-0 << "{"ConsumedCapacity":{"CapacityUnits":0.5,"TableName":"EmrFSMetadata"},"Count":0,"Items":[],"ScannedCount":0}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-4 >> "X-Amz-Target: DynamoDB_20120810.Scan[\r][\n]"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-4 >> "{"TableName":"EmrFSMetadata","ScanFilter":{"hashKey":{"AttributeValueList":[{"S":"/331982-syd/"}],"ComparisonOperator":"BEGINS_WITH"},"payload":{"ComparisonOperator":"NOT_NULL"}},"ReturnConsumedCapacity":"TOTAL","TotalSegments":5,"Segment":1}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-5 >> "X-Amz-Target: DynamoDB_20120810.Scan[\r][\n]"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-5 >> "{"TableName":"EmrFSMetadata","ScanFilter":{"hashKey":{"AttributeValueList":[{"S":"/331982-syd/"}],"ComparisonOperator":"BEGINS_WITH"},"payload":{"ComparisonOperator":"NOT_NULL"}},"ReturnConsumedCapacity":"TOTAL","TotalSegments":5,"Segment":3}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-2 << "{"ConsumedCapacity":{"CapacityUnits":1.0,"TableName":"EmrFSMetadata"},"Count":0,"Items":[],"ScannedCount":0}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-3 << "{"ConsumedCapacity":{"CapacityUnits":0.5,"TableName":"EmrFSMetadata"},"Count":0,"Items":[],"ScannedCount":1}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-4 << "{"ConsumedCapacity":{"CapacityUnits":0.5,"TableName":"EmrFSMetadata"},"Count":0,"Items":[],"ScannedCount":0}"
20/04/01 00:05:17 DEBUG http.wire: http-outgoing-5 << "{"ConsumedCapacity":{"CapacityUnits":0.5,"TableName":"EmrFSMetadata"},"Count":0,"Items":[],"ScannedCount":1}"
So, the issue seems to be in the final Scan and Query steps, the Scan and Query returns nothing, although the item exists in the table. That is, the wrong ScanFilter is used in the Scan, or KeyConditions is used in the Query.
Will look further.