Note that there are some explanatory texts on larger screens.

plurals
  1. POpython xml search by attribute value
    text
    copied!<p>I'm working on a python utility to search and present the full path to a record in a very big config file, stored as xml file. The file size can be 12M and can hold 294460 lines, and can grow to much larger size.</p> <p>Here is an example(simplified):</p> <pre><code>&lt;?xml version="1.0" encoding="UTF-8" standalone="yes" ?&gt; &lt;root version="1.1.1"&gt; &lt;record path=""&gt; &lt;record path="path1"&gt; &lt;field name="some_name1" value="1234"/&gt; &lt;record path="path2"&gt; &lt;field name="0" value="abcd0"/&gt; &lt;field name="1" value="abcd1"/&gt; &lt;field name="2" value="abcd2"/&gt; &lt;field name="28" value="abcd28"/&gt; &lt;field name="29" value="abcd29"/&gt; &lt;/record&gt; &lt;/record&gt; &lt;record path="pathx"&gt; &lt;record path="pathy"&gt; &lt;record path="pathz"&gt; &lt;/record&gt; &lt;record path="pathv"&gt; &lt;record path="pathw"&gt; &lt;field name="some_name1" value="yes"/&gt; &lt;field name="some_name2" value="2084"/&gt; &lt;field name="some_buffer_name" value="14"/&gt; &lt;record path="cache_value"&gt; &lt;field name="some_name7000" value="12"/&gt; &lt;/record&gt; &lt;/record&gt; &lt;record path="path_something"&gt; &lt;field name="key_word1" value="8"/&gt; &lt;field name="key_word2" value="9"/&gt; &lt;field name="key_word3" value="10"/&gt; &lt;field name="key5" value="1"/&gt; &lt;field name="key6" value="1"/&gt; &lt;field name="key7" value="yes"/&gt; &lt;/record&gt; &lt;/record&gt; &lt;/root&gt; </code></pre> <p>I'm interested to run on the file and file all the nodes that hold the search string in the path or field attribute of the node. Because the node "type" (or node name) can be or record or field, and thus the name of the attribute can change from path to name.</p> <p>I used minidom to parse and search in the xml, but my code is taking too much resources and much too much time.</p> <p>This is what I wrote: xml_file_location is the location of the file string_to_search is the string that I search for in the XML file and the path to that node that I found is stored in the node of type: record , in an attribute named: path, and this is what I print to the user.</p> <pre><code>with open(xml_file_location, 'r') as inF: # search the xml file for lines with the string for search for line in inF: if string_to_search in line: found_counter = found_counter + 1 node_type = line.strip(" ").split(" ")[0] node_type = re.sub('[^A-Za-z0-9]+', '', node_type) node_attr = line.strip(" ").split(" ")[1] node_value = re.sub('[^A-Za-z0-9_]+', '', node_attr.split("=")[1]) node_attr = re.sub('[^A-Za-z0-9]+', '', node_attr.split("=")[0]) #print node_type #print node_attr #print node_value if node_type in lines_dict: if not node_attr in lines_dict[node_type]: lines_dict[node_type][node_attr] = [nome_value] elif not node_value in lines_dict[node_type][node_attr]: lines_dict[node_type][node_attr].append(node_value) else: lines_dict[node_type] = {} lines_dict[node_type][node_attr] = [node_value] print "Found: %s strings in the xml file" %found_counter #pp = pprint.PrettyPrinter(indent=4) #pp.pprint(lines_dict) print "Parsing the xml file" dom = parse(xml_file_location) print "Locating the full path" for node_type in lines_dict: # for all types of node elements = dom.getElementsByTagName(node_type) # create the elements for those nodes for node in elements: # go over all nodes in the elements if node.hasAttribute(node_attr): if node.getAttribute(node_attr) in lines_dict[node_type][node_attr]: # check if the attribute appears in the lines dict result = node.getAttribute(node_attr) # holds the path parent = node.parentNode # create a pointer to point on the parent node while parent.getAttribute("path") != "": # while didn't reach the root of the conf - a record that has an empty path attribute result = parent.getAttribute("path") + "." + result # add the path of the parent to the full path parent = parent.parentNode # advance the parent pointer print print "Found: %s" %node.toprettyxml().split("\n")[0] print "Path: %s" %result </code></pre> <p>for example: I will search for: abcd1 the utility will print the full path: path1.path2 or I will search for: pathw and the utility will return: pathx.pathy.pathv</p> <p>I understand that it's very inefficient, I go over all nodes in the config and compare them to what I put in the list_dic in the simple string search.</p> <p>I tried to use external modules to do it, but without success</p> <p>I'm looking for a more efficient way to do this kind of search and I very appreciate the help.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload