GSoC 2013 Proposal - Nakull/GSoC-XML2JSON GitHub Wiki
Title :
Short description: The main aim of the project is to design and implement a library to allow conversion of XML data sets to JSON in a custom defined manner - with the help of user defined mappings. Round trip conversion - XML -> JSON -> XML will also be supported. The use case is that currently all the open source tools which perform this conversion do not allow mapping specific attributes and nodes. They transform the complete XML into a JSON without taking into consideration user specifications. In most cases only a small portion of the XML is required to be transformed( to specific JSON fields) - which can be easily specified by XPath and JSON Schema - thus the requirement of this utility. The library designed will be easy to use and have an extensible and simple Interface. The mappings file will be a JSON Schema file with the description tag for each element containing the XPath expression for the same. Only those elements which are required to be converted from XML will be specified by the user in the JSON Schema. Some limitations might arise as with just XPath and JSON Schema there might be ambiguities and anomalies in the conversion. I will be documenting all these and trying to rectify and reduce the limitations of this library. I have expanded more on the features of this library in the course of this proposal.
BASICS
Name: Nakull Gupta
E-mail: [email protected]
RESUME (link)
**1. Have you contacted the mentor about this project proposal? What are your expectations from the mentor? **
Yes. I am in touch with Mr. Yusuf Sagar regarding my interest in this project.
I am an ex-GSoC participant so have a good understanding of how this program works. I would ideally like to work independently and meet( online) with my mentor about 2 times per week. If something major crops up and I am at a roadblock I will mail him immediately. Else a weekly schedule can be followed wherein I can ask my queries and report my progress.
**2. What languages, libraries, toolkits and testing strategy will you use? **
The library will be in Java and I will be using Eclipse as an IDE. Various java packages like java.xml.xpath will be used for XPath parsing and evaluation. Also json-lib will be used to parse JSON schema and create a JSON file from the XML.
All the code will be available at github in a public repository.I will be making periodic check ins. I will also create a wiki page for the project.
I will design the interfaces and classes to be extensible and easy to use. I will also write unit tests for each class and run the library on as large a dataset as possible.
I will be using linux (ubuntu 13.04) as my primary OS.
Other features of the library are covered at the end - in Features of the library.
- What would be the deliverable's?
The deliverable's are very concrete in this case. The library will support XML 2 JSON and JSON 2 XML transformation based on the mappings provided by the user. These mappings can be provided directly in a JSON Schema File. The JSON Schema will serve a dual purpose - provide mappings information as well as be able to provide validation for the generated JSON file.
The mappings for JSON -> XML can be provided in a XSD with JSON Path notations.
- What are your qualifications for this project?
I am an ex-GSoC student ( 2012 ) - worked for GDCM (Medical Imaging) . Presently I am an intern at Microsoft Research India in the Advanced Development Group(finishing on 10th June).
In my previous GSoC project I built a tool for converting DICOM info-sets to XML and vice verse. I also built a XSD and a RELAX - NG schema to standardize the format of the XML file. As you might be aware , DICOM is a binary infoset - with groups and elements ( in the form of tags). Each particular type of element had to be mapped in a different way. I provided mappings for elements in all DICOM supported datatypes. The library was exposed through command line utilities - dcm2xml and xml2dcm.
[ The wiki page for the project - http://gdcm.sourceforge.net/wiki/index.php/GSoC_2012_Export_From-To_XML and the code for the same - https://www.github.com/Nakull/GDCM ]
I have a strong grasp of XML , XSDs and X Path notations( used substantially in my last project) . Also I have used JSON in my prior projects, so am familiar with it. I have done a number of projects in Java so am comfortable coding in it.
- How much time/week do you plan to spend on this project?
I plan to work 40+ hours per week ( or more if required) .
- Do you have any other commitments during the summer, such as finals, coursework, assistantship etc? This does not disqualify you from participating but you have to be upfront about how much time you'll be able to spend on your GSoC project.
My internship at Microsoft ends by 10th June ( and GSoC coding starts by 17th June) so there will be no clash in priorities. This project will be my primary focus for the summer. Also I have no finals/coursework during the GSoC timeline.
Features of the Library
Technologies I intend to use
java.xml.xpath
DOM Parser , SAX Parser - Java Packages
[ All these are java packages so there will be no external dependencies ].
I will be using the JSON Schema as a mappings file http://json-schema.org.
As the Schema is JSON itself , we can easily parse it to a Java object. Two libraries that I am looking into are -> http://json-lib.sourceforge.net/and https://code.google.com/p/google-gson/
To validate the JSON ( with the mappings schema), I can use the https://github.com/fge/json-schema-validator.
For the reverse transformation - mappings can be provided in an XSD and JSON Path notations can be used. To evaluate JSON Path I will use https://code.google.com/p/json-path/ library.
Performance
Not converting the entire document - leading to lesser memory requirements and faster execution. Parsing through list of XPath expressions initially to possibly load complete nodes in memory instead of looking up again and again from the XML file.
More details are below in the Initial Approach heading.
Features
Allow mappings to be uploaded as JSON Schema ( listed below ) which can be easily marshalled as a JSON object. This will be very intuitive as the Schema is easy to read and understand.
Validation of JSON from a schema ( the mappings file can be used for this).
Great deal of effort will be laid to converting XML XPath results correclt to the JSON object. Multiple nodesets returned will also be mapped correctly to their associated object in JSON.
The JSON file in turn can be converted to a XML file. Initially I will provide a basic utility to convert the file as a whole to XML - without removing any objects/fields. This will provide for a complete round trip conversion - from XML->JSON-> XML.
I will also work on creating a similar reverse conversion utility which can use the mappings schema( a XSD) to match to the associated JSONPath element. For this a complete DOM model for the XML dataset will have to be created and populated by means of the the JSON path expressions.
Ease of use / Learning curve for a developer Samples to use the library will be provided. Also the user need to only the details of the interface ( whose reference can point to different implementations of that interface - depending on the user ). This will require the developer to go through minimal steps to get the conversion.
eg.
Mappings m = new Mappings(“mappings”); XML2JSON conv = new specificXML2JSON(); conv.Convert(“input.xml”, “json.txt”, flags); /* XML2JSON is an interface which is implemented by specificXML2JSON flags is another object which has certain user defined settings for conversion
The Mappings class will unmarshall the JSON Schema and get the mappings for each element. */
Extensibility
XML to JSON utility will provide a simple conversion facility initially. I will abstract out the core functionality of the class in interfaces so that a different implementation could also be provided that probably could handle more complex XPath notations ( or any major changes in the XPath / JSON notations )
Add-ons
Features to verify the correctness of the generated JSON by providing a utility to validate against the JSON Schema through the library itself.
Also the API will be generic in nature so it can be easily be added to.
Proposed Timeline
May 3 - May 27 Researching more about the JAVA XPath API ‘s that I will be using. ( Also the DOM/SAX Parser API ). Also going through the links given by Mr. Yusuf. Getting more familiar with JSON , JSONPath and JSON Schema.Going through any coding/developer guidelines. Getting an understanding about the existing utilities for XML 2 JSON transformation. Community Bonding Period:May 28 - June 16 Finalizing on the exact interfaces/ API for the library. Discuss more with my mentor and document the possible issues and modifications required. Going through any other related projects at Emory university to get an idea about how development is done over here. June 17 - July 5 Getting started on the first module - Customized XML 2 JSON conversion. Major Hurdle - Mapping XPath results sets correctly to JSON fields ( accurate mapping for multiple result sets). Working on reading mapping from a JSON Schema File. Resolving Issues of associating similar nodes and fields. Writing Unit Tests for the module - testing it on different corner cases. Testing module on some simple XML datasets, validating the JSON files generated. July 6 - July 20 Hopefully the first module should be over by now, so I would have an easier time doing the reverse transformation. Module : JSON 2 XML For this I will provide two options - Converting the JSOn completely to XML (without a mappings file) Using a XSD as a mappings file with the XSD containing the JSON Path expressions ( similar to JSON schema which was used for the reverse transformation) Testing - Complete round trip testing - take an XML -convert to JSON - covert the JSON back to XML ( this should be same as original XML - the part which was converted that is) Writing Unit Tests for the module - testing it on different corner cases. - like in previous module July 20 - 29 Refactoring Code - Improving Documentation. Getting ready for mid sem submission August 1 - 20 Improving means to get the custom mappings from users. Adding more attributes to JSON Schema . Increasing efficiency and accuracy of multiple result sets from XPath queries. Continue testing existing modules on different datasets and removing bugs accordingly. Verifying JSON generated through the JSON Schema(mappings file) and providing stronger validation Add code to provide validation in the library itself - so that the user gets exposed to a common interface. August 21 - September 1 Documenting any limitations of the library. Removing any association problems/ other anomalies as far as possible.If not completely removed, providing concrete documentation for developers to use easily after understanding the limitations of the interface. Possibly allowing the JSON Schema to have more parameters to provide more specific settings and parameters. (JSON will prove helpful for any future expansion.) September 2 - 13 Buffer period - Time for any unforeseen bugs/roadblocks. Improving documentation and refactoring code - finishing touches. Creating a design and API manual to provide usage guidelines. September 14 - 23 Code Submission to Google as well as deployment at Emory University.
Initial Approach ( Overview of Work Flow )
Instantiate a Java Document -> This is the easy part.
Using java.xml.xpath :
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse(<uri_as_string>);
On this we can execute XPath queries ->
XPathFactory xPathfactory = XPathFactory.newInstance(); XPath xpath = xPathfactory.newXPath(); XPathExpression expr = xpath.compile(<xpath_expression>);
Relative/Absolute XPaths1 can be specified .
Also the result can be retrieved as a NodeList (eg. NodeList nl = (NodeList)expr.evaluate(doc, XPathConstants.NODESET); )
or specific attributes can also be obtained.
Once we have this is place, we have to start building the JSON by fetching data from XPath.
The modified mappings file suggested by Mr. Nadir ->
[
{
"patient_id" : { "xpath" : "" , "type" : "patients/patient[0]/@id" } ,
"admitted" : { "xpath" : "" , "type" : "patients/patient[0]/@admitted" } ,
…………
} ,
{ "patient_id" : { "xpath" : "" , "type" : "patients/patient[1]/@id" } , "admitted" : { "xpath" : "" , "type" : "patients/patient[1]/@admitted" } , ………… } ,
]
can be replaced by an actual JSON Schema ->
{
"$schema": "Sample Patients Schema",
"title": "Patients",
"type": "array",
"items": {
"title": "Patient",
"type": "object",
"properties":{
"id":{
"description":"patients//patient/@id",
"type":"string"
},
"name":{
"type":"string",
"description": “patients//patient/name”
},
"admitted":{
"type":"string",
"description": “patients//patient/@admitted”
},
"address":{
"type":"string",
"description": “patients//patient/address/street , patients//patient/address/city , patients//patient/address/province”
}
"condition":{
"type":"string",
"description": “patients//patient/condition”
}
},
"required": ["id"]
}
}
This is a valid JSON Schema created in accordance to the standard specified @ http://json-schema.org .
Thus the user has to be specify the general pattern - no mappings for each element required. In this I have added a Required clause for "id" - so only as many JSON array elements will be created for patients as the number of ID's. ( feature can be implemented in the library ).
Each property's description has the corresponding XPath expression.
XPath expressions separated by "," will be evaluated individually and their results concatenated.
Also if a patient tag in XML has the id missing that XML node will be omitted. Other use clauses can also be supported.
Reading Mappings File and Creating JSON
As the Schema is JSON itself , we can easily parse it to a Java object. Two libraries that I am looking into are -> http://json-lib.sourceforge.net/and https://code.google.com/p/google-gson/.
eg. using json-lib ->
JSONObject json = (JSONObject) JSONSerializer.toJSON( jsonText);
where jsonText is a string containing the JSON.
Now, from this object we can get any attributes which we require easily.Also as now I have a JSON Schema, I can iterate through the properties manually and create a JSON object according to it, instantiating it with the values from XML ( via XPath in "description").As of now, I feel the JSON Schema with contained XPath notations will be a good idea.
Validation by JSON Schema
The JSON generated can be validated with the JSON Schema ( the mappings file itself ) - as it is serving a dual purpose of mappings as well as as a Schema. A good library for doing this validation is https://github.com/fge/json-schema-validator. The API I create will create a wrapping for function calls to this validation library and thus expose an easy to use interface to use it( integrated with the utilities I create ).
JSON to XML conversion
For this I will be using an XSD to specify the structure of the XML file which will contain the JSON Path notations to specify which JSON fields to use. The JAVA API's DOM model can be used to visualize the XSD. https://code.google.com/p/json-path/ is a library I will use to parse and evaluate JSON Path expressions.
Currently Identified Issues and Limitations
As I have been going through the problem more closely some more issues come to my mind.
For example in the example you have given on the ideas page - a "patients//patient/condition" matches to a "$.patients[*].condition" in the JSONPath . Ideally what we will assume is that this should match to same patient (parent ) in both JSON and XML . But this might not be the case. An XPath query returns all the matching node. Might be some patients do not have tag. Thus we cannot just blindly associate them to the corresponding JSON patient object. eg. patient 1 & 3 have condition - thus same patients should have corresponding JSON objects.
Also there might be anomalies - eg. when you match //patient it matches all patient nodes - no matter how deep they are.
eg. XML->
<PatientsWithSimilarSymptoms>
<patient>
</patient>
</PatientsWithSimilarSymptoms>
Thus //patient will match even the inner nodes. [ possibly this error can be removed from the user's end if he gives the absolute path - but again it helps if we can provide this correlation .]
I am extremely excited about working on this project and working with the people a Emory University. If you choose to accept my proposal, I assure you I will prove to be an asset and work hard to fulfill all that I have set out to do.