python - lxml: difference between Element addnext() and insert() in handling tail -


given lxml element xml iterate on of children c[0..n] calling c.getnext(). because need insert children on fly if necessary, , can't using iterator. elements have both text , tail set.

let me illustrate different behavior of addnext() , insert() following example. assume simple xml string, parse lxml tree, , then, sanity's sake, inspect it:

>>> import lxml.etree >>> s = "<p>this <b>bold</b> , italic text.</p>" # create new lxml element. >>> xml = lxml.etree.fromstring(s) # let's @ element, child, , texts , tails. >>> lxml.etree.tostring(xml) b'<p>this <b>bold</b> , italic text.</p>' >>> xml.text 'this ' >>> xml.tail >>> xml[0].text 'bold' >>> xml[0].tail ' , italic text.' 

so far good, , have expected (for more on lxml representation see here).

now want wrap word "italic" tags, "bold" wrapped <b> tags. that, first find index @ "italic" substring starts:

# find index of "italic" substring. >>> idx = xml[0].tail.find("italic") >>> idx 13 

then create new lxml element:

# create new element , inspect it. >>> new_c = lxml.etree.fromstring("<i>italic</i>") >>> new_c.text 'italic' >>> new_c.tail >>> 

to insert new element xml tree properly, have split original xml[0].tail string 2 substrings , remove "italic" it:

>>> new_c.tail = xml[0].tail[idx+len("italic"):] >>> xml[0].tail = xml[0].tail[:idx] 

now set insert new element xml element, , puzzles me right now. insertion of new child new_c after given 1 xml[0] had different results, , element api doesn't give me new information:

# adds element following sibling directly after element. # note tail text automatically discarded when adding @ root level. >>> xml[0].addnext(new_c) >>> lxml.etree.tostring(xml) b'<p>this <b>bold</b><i>italic</i> text. , </p>' 

and

# inserts subelement @ given position in element >>> xml.insert(1 + xml.index(xml[0]), new_c) >>> lxml.etree.tostring(xml) b'<p>this <b>bold</b> , <i>italic</i> text.</p>' 

the 2 calls seem handle tail differently (see comment on addnext() regarding tail). taking comment account, text not discarded <b> appended <i>, nor root level handled differently levels further down (i.e. exact same behavior can observed wrapping original xml in s additional <foo> tag).

what missing here?

edit related discussion on lxml mailing list here.

elem.addnext(nextelem) manipulates on xml level, i.e. adds directly after element moving tail text behind newly inserted element. done make new element directly following sibling.

parent.insert(where,elem) works if parent element list of etree.element. puts new element in list without changes etree.element instances. parent.append(elem) work way, or other list manipulation.

so, these functions have 2 different views on element tree.

>>> lxml import etree et >>>  >>> x = et.xml('<a>foo<b/>bar</a>') >>> y = et.xml('<c>c!</c>') >>>  >>> et.dump(x) <a>foo<b/>bar</a> >>> x.find('b').addnext(y) >>> et.dump(x) <a>foo<b/><c>c!</c>bar</a> 

the tail moves b element c element, keep xml document same except inserted element.

now, if inserted element has tail, addnext used insert element , text following it. directly after xml element, not after etree element-with-tail.

>>> x = et.xml('<a>foo<b/>bar</a>') >>> y = et.xml('<c>c!</c>') >>> y.tail = 'more...' >>>  >>> x.find('b').addnext(y) >>> et.dump(x) <a>foo<b/><c>c!</c>more...bar</a> 

Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -