Matlab xmlread DTD issues - nickcounts/MDRT GitHub Wiki

Matlab xmlread() DTD Issues

Parsing XML files with DOCTYPPE definition (DTD) declarations causes Matlab's xmlread() function to fail. The best explanation of and solution to this issue came from user Suever on stackexchange.com. It is reproduced here:

Solution

You need to disable external DTD loading for the parser. To accomplish this, you can create a custom DocumentBuilder object, disable the external DTD loading, and pass this as the second input to xmlread.

From the hidden xmlread documentation (visible if you open the file with edit xmlread):

    % Advanced use:
    %   Note that FILENAME can also be an InputSource, File, or InputStream object
    %   DOMNODE = XMLREAD(FILENAME,...,P,...) where P is a DocumentBuilder object
    %   DOMNODE = XMLREAD(FILENAME,...,'-validating',...) will create a validating
    %             parser if one was not provided.
    %   DOMNODE = XMLREAD(FILENAME,...,ER,...) where ER is an EntityResolver will
    %             will set the EntityResolver before parsing
    %   DOMNODE = XMLREAD(FILENAME,...,EH,...) where EH is an ErrorHandler will
    %             will set the ErrorHandler before parsing
    %   [DOMNODE,P] = XMLREAD(FILENAME,...) will return a parser suitable for passing
    %             back to XMLREAD for future parses.
    %   

So this ends up looking something like this:

    % Create the DocumentBuilder
    builder = javax.xml.parsers.DocumentBuilderFactory.newInstance;
    
    % Disable validation
    builder.setFeature('http://apache.org/xml/features/nonvalidating/load-external-dtd', false);
    
    % Read your file
    xml = xmlread(filename, builder);

Keep in mind that this could potentially result in your file being parsed incorrectly.

Update

So looking into this a little closer, once we get past the DTD validation failing, the FEX xml2struct doesn't handle the DOCTYPE entry in the XML correctly and tries to process it just like a normal node. You could modify the source of xml2struct to detect this internally:

    if node.getNodeType == node.DOCUMENT_TYPE_NODE

However, it would probably be easier to just remove all the DOCTYPEs for all of your XML files. The following script should be able to do this.

    folder = 'directory/where/all/files/live';
    
    files = dir(fullfile(folder, '*.xml'));
    
    for k = 1:numel(files)
        filename = fullfile(folder, files(k).name);
        fid = fopen(filename, 'rt');
        content = fread(fid, '*char')';
        fclose(fid);
    
        newcontent = regexprep(content, '\n\s*?<!DOCTYPE.*?(?=\n)', '');
    
        fout = fopen(filename, 'wt');
        fwrite(fout, newcontent);
        fclose(fout);
    end