CDATA and lxml


as So first off I know that CDATA is generally hated and just shouldn't be done, but I'm simply required to parse it and spit it back out. Parsing is pretty easy with lxml, but it's the spitting back out ...
Posted On: Saturday 10th of November 2012 02:24:04 AM Total Views:  154
View Complete with Replies

RELATED TOPICS OF Python Programming PROGRAMMING LANGUAGE




encoding in lxml

, I have a problem with character encoding in LXML. Here's how it goes: I read an HTML document from a third-party site. It is supposed to be in UTF-8, but unfortunately from time to time it's not. I parse the document like this: html_doc = HTML(string_with_document) Then I retrieve some info from the document with XPath: xpath_nodes = html_doc('/html/body/something') Now I'm guaranteed that the xpath_nodes list contains only one element. So I read it's content: xpath_nodes[0].text And I get exception here. The exception is coming from the text property of an Element object. The problem is that the text contains a non-utf8 character. LXML seems to be using strict decoding and I can't find a way to make it ignore the error. Is there anything I can do to retrieve the text without getting an exception
VIEWS ON THIS POST

149

Posted on:

Saturday 3rd November 2012
View Replies!

When I do from lxml import etree I've this error : AttributeError:'module' object has no attribute 'BytesIO'

I'm on Ubuntu 8.04.1 I've installed lxml with easy_install lxml command. Now, when I load etree I've this error : $ python Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree Traceback (most recent call last): File "", line 1, in File "lxml.etree.pyx", line 40, in lxml.etree (src/lxml/ lxml.etree.c:119415) AttributeError: 'module' object has no attribute 'BytesIO' >>> Have you some idea about this issue
VIEWS ON THIS POST

128

Posted on:

Saturday 3rd November 2012
View Replies!

lxml 2.1 released

Hi all, lxml 2.1 has been released to PyPI. This is the first official lxml release that builds and works on Python 2.3, 2.4, 2.5, 2.6 (beta) and 3.0 (beta). http://codespeak.net/lxml/dev/ http://pypi.python.org/pypi/lxml/2.1 Install with easy_install lxml==2.1 What is lxml """ In short: lxml is the most feature-rich and easy-to-use library for working with XML and HTML in the Python language. lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API. """ Feedback is very much appreciated, especially on new features like the namespace cleanup function and on Python 2.6/3.0 support. Have fun, Stefan
VIEWS ON THIS POST

206

Posted on:

Saturday 3rd November 2012
View Replies!

lxml validation and xpath id function

Hi I'm trying to use the .xpath('id("foo")') function on an lxml tree but can't get it to work. Given the following XML: And it's XMLSchema: Or in more readable, compact RelaxNG, form: element root { element child { attribute id { xsd:ID } } } Now I'm trying to parse the XML and use the .xpath() method to find the element using the id XPath function: from lxml import etree schema_root = etree.parse(file('schema.xsd')) schema = etree.XMLSchema(schema_root) parser = etree.XMLParser(schema=schema) root = etree.fromstring('', parser) root.xpath('id("foo")') --> [] I was expecting to get the element with that last statement (well, inside a list that is), but instead I just get an empty list. Is there anything obvious I'm doing wrong As far as I can see the lxml documentation says this should work. Cheers Floris
VIEWS ON THIS POST

209

Posted on:

Sunday 4th November 2012
View Replies!

lxml 2.1 beta3 released

Hi all, I'm proud to release lxml 2.1beta3 to PyPI. This is the first lxml release that builds and works on Python 2.3, 2.4, 2.5, 2.6 (beta) and 3.0 (beta). http://codespeak.net/lxml/dev/ http://pypi.python.org/pypi/lxml/2.1beta3 Install with easy_install lxml==2.1beta3 What is lxml """ In short: lxml is the most feature-rich and easy-to-use library for working with XML and HTML in the Python language. lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API. """ Unusual for a beta release, the third beta contains more new features than bug fixes, which is largely (but not only) due to adaptations with respect to Python 3. The changelog follows below. I expect this to be the last beta release before 2.1 final. Feedback is very much appreciated, especially on the "experimental" features like the namespace cleanup function and on Python 2.6/3.0 support. Your feedback will help in making the final release the best lxml ever. Have fun, Stefan 2.1beta3 (2008-06-19) Features added * Major overhaul of tools/xpathgrep.py script. * Pickling ElementTree objects in lxml.objectify. * Support for parsing from file-like objects that return unicode strings. * New function etree.cleanup_namespaces(el) that removes unused namespace declarations from a (sub)tree (experimental). * XSLT results support the buffer protocol in Python 3. * Polymorphic functions in lxml.html that accept either a tree or a parsable string will return either a UTF-8 encoded byte string, a unicode string or a tree, based on the type of the input. Previously, the result was always a byte string or a tree. * Support for Python 2.6 and 3.0 beta. * File name handling now uses a heuristic to convert between byte strings (usually filenames) and unicode strings (usually URLs). * Parsing from a plain file object frees the GIL under Python 2.x. * Running iterparse() on a plain file (or filename) frees the GIL on reading under Python 2.x. * Conversion functions html_to_xhtml() and xhtml_to_html() in lxml.html (experimental). * Most features in lxml.html work for XHTML namespaced tag names (experimental). Bugs fixed * ElementTree.parse() didn't handle target parser result. * Crash in Element class lookup classes when the __init__() method of the super class is not called from Python subclasses. * A number of problems related to unicode/byte string conversion of filenames and error messages were fixed. * Building on MacOS-X now passes the "flat_namespace" option to the C compiler, which reportedly prevents build quirks and crashes on this platform. * Windows build was broken. * Rare crash when serialising to a file object with certain encodings. Other changes * Non-ASCII characters in attribute values are no longer escaped on serialisation. * Passing non-ASCII byte strings or invalid unicode strings as .tag, namespaces, etc. will result in a ValueError instead of an AssertionError (just like the tag well-formedness check). * Up to several times faster attribute access (i.e. tree traversal) in lxml.objectify.
VIEWS ON THIS POST

147

Posted on:

Sunday 4th November 2012
View Replies!

Remove namespace declaration from ElementTree in lxml

: I want to remove an unused namespace declaration from the root element of an ElementTree in lxml. There doesn't seem to be any documented way to do this, so at the moment I'm reduced to sticking the output through str.replace() ... which is somewhat inelegant. Is there a better way -[]z.
VIEWS ON THIS POST

244

Posted on:

Monday 5th November 2012
View Replies!

lxml + mod_python: cannot unmarshal code objects in restricted execution mode

Dmitri Fedoruk wrote: > def extApplyXslt(xslt, data, logger ): > try: > strXslt = urllib2.urlopen(xslt).read() > # i have to read the xslt url to the python string > except urllib2.HTTPError, e: > ....... > except urllib2.URLError, e: > ............. > try: > xslt_parser = etree.XMLParser() > xslt_parser.resolvers.add( PrefixResolver("XSLT") ) > > # and now I have to use the string; a more elegant solution, > anyone Sure, lxml.etree can parse from file-like objects. Just hand in the result of urlopen(). Apart from that, I saw that you found your way to the lxml mailing list, I'll respond over there. Stefan , On Sep 14, 3:04 am, Graham Dumpleton wrote: > Try forcing mod_python to run your code in the first interpreter > instance created by Python. > PythonInterpreter main_interpreter Thank you very much, that solved the problem! A more detailed discussion can also be found in the lxml-dev mailing list ( http://comments.gmane.org/gmane.comp...xml.devel/2942 ) Dmitri
VIEWS ON THIS POST

304

Posted on:

Monday 5th November 2012
View Replies!

replacing xml elements with other elements using lxml

, I'm attempting to generate a random story using xml as the document, and lxml as the parser. I want the document to be simplified before processing it further, and am very close to accomplishing my goal. Below is what I have so far. Any ideas on how to move forward The goal: read and edit xml file, replacing random elements with randomly picked content from within Completed: [x] read xml [x] access first random tag [x] pick random content within random item [o] need to replace tag with picked contents xml sample: Here is some content. Here is some random content. Here is some more random content. Here is some content. Python code: from lxml import etree from StringIO import StringIO import random theXml = "Here is some content.Here is some random content.Here is some more random content.Here is some content." f = StringIO(theXml) tree = etree.parse(f) r = tree.xpath('//random') if len(r) > 0: randInt = random.randInt(0,(len(r[0]) - 1)) randContents = r[0][randInt][0] #replace parent random tag with picked content here now that I have the contents tag randomly chosen, how do I delete the parent tag, and replace it to look like this: final xml sample (goal): Here is some content. Here is some random content. Here is some content. Any idea on how to do this So close!
VIEWS ON THIS POST

184

Posted on:

Monday 5th November 2012
View Replies!

Re: [lxml-dev] Python script to optimize XML text

If your XML is well-formed, a XSLT is probably your best choice. I believe even the most trivial 'pass through' example might produce the output you expect here. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
VIEWS ON THIS POST

242

Posted on:

Monday 5th November 2012
View Replies!

lxml 1.3.6 released

Hi all, lxml 1.3.6 is up on PyPI. This is a bug fix release for the stable 1.3 series. It features two important fixes for crash bugs. Updating is recommended. http://codespeak.net/lxml/ http://pypi.python.org/pypi/lxml/ ** Install it with $ easy_install lxml==1.3.6 ** What is lxml """ In short: lxml is the most feature-rich and easy-to-use library for working with XML and HTML in the Python language. lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API. """ Have fun, Stefan 1.3.6 (2007-10-29) ================== Bugs fixed ---------- * Backported decref crash fix from 2.0 * Well hidden free-while-in-use crash bug in ObjectPath Other changes ------------- * The test suites now run ``gc.collect()`` in the ``tearDown()`` methods. While this makes them take a lot longer to run, it also makes it easier to link a specific test to garbage collection problems that would otherwise appear in later tests.
VIEWS ON THIS POST

359

Posted on:

Monday 5th November 2012
View Replies!

lxml 1.2 released

Hi all, lxml 1.2 has been released to the cheeseshop. http://cheeseshop.python.org/pypi/lxml This is a somewhat conservative release in that it brings no major new features. It rather contains a number of bug fixes and cleanups, both internally and at the API level. Building lxml should have become easier again, and hacking the build process should now be a lot simpler. The complete changelog follows. What is lxml """ lxml is a Pythonic binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more. Lxml also features a sophisticated API for custom element classes. This is a simple way to write arbitrary XML driven APIs on top of lxml. There is a separate module lxml.objectify that implements a data-binding API on top of lxml.etree. """ See the web page for more information and documentation: http://codespeak.net/lxml/ Have fun, Stefan ========== ChangeLog: ========== 1.2 (2007-02-20) ================ Features added -------------- * Rich comparison of QName objects * Support for regular expressions in benchmark selection * get/set emulation (not .attrib!) for attributes on processing instructions * ElementInclude Python module for ElementTree compatible XInclude processing that honours custom resolvers registered with the source document * ElementTree.parser property holds the parser used to parse the document * setup.py has been refactored for greater readability and flexibility * --rpath flag to setup.py to induce automatic linking-in of dynamic library runtime search paths has been renamed to --auto-rpath. This makes it possible to pass an --rpath directly to distutils; previously this was being shadowed. Bugs fixed ---------- * Element instantiation now uses locks to prevent race conditions with threads * ElementTree.write() did not raise an exception when the file wasn't writable * Error handling could crash under Python
VIEWS ON THIS POST

135

Posted on:

Monday 5th November 2012
View Replies!

[ANN] lxml 1.0 released

Hallo , I have the honour to announce the availability of lxml 1.0. http://codespeak.net/lxml/ It's downloadable from cheeseshop: http://cheeseshop.python.org/pypi/lxml """ lxml is a Pythonic binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much, much more. Its goals are: * Pythonic API. * Documented. http://codespeak.net/lxml/#documentation * FAST! http://codespeak.net/lxml/performance.html * Use Python unicode strings in API. * Safe (no segfaults). * No manual memory management! (as opposed to the official libxml2 Python bindings) """ While the list of features added since the last beta version (1.0.beta) is rather small, this version contains a large number of bug fixes found by various users and testers. Thank you all for your help! Stefan Features added since 0.9.2: * Element.getiterator() and the findall() methods support finding arbitrary elements from a namespace (pattern {namespace}*) * Another speedup in tree iteration code * General speedup of Python Element object creation and deallocation * Writing C14N no longer serializes in memory (reduced memory footprint) * PyErrorLog for error logging through the Python logging module * element.getroottree() returns an ElementTree for the root node of the document that contains the element. * ElementTree.getpath(element) returns a simple, absolute XPath expression to find the element in the tree structure * Error logs have a last_error attribute for convenience * Comment texts can be changed through the API * Formatted output via pretty_print keyword to serialization functions * XSLT can block access to file system and network via XSLTAccessControl * ElementTree.write() no longer serializes in memory (reduced memory footprint) * Speedup of Element.findall(tag) and Element.getiterator(tag) * Support for writing the XML representation of Elements and ElementTrees to Python unicode strings via etree.tounicode() * Support for writing XSLT results to Python unicode strings via unicode() * Parsing a unicode string no longer copies the string (reduced memory footprint) * Parsing file-like objects now reads chunks rather than the whole file (reduced memory footprint) * Parsing StringIO objects from the start avoids copying the string (reduced memory footprint) * Read-only 'docinfo' attribute in ElementTree class holds DOCTYPE information, original encoding and XML version as seen by the parser * etree module can be compiled without libxslt by commenting out the line include "xslt.pxi" near the end of the etree.pyx source file * Better error messages in parser exceptions * Error reporting now also works in XSLT * Support for custom document loaders (URI resolvers) in parsers and XSLT, resolvers are registered at parser level * Implementation of exslt:regexp for XSLT based on the Python 're' module, enabled by default, can be switched off with 'regexp=False' keyword argument * Support for exslt extensions (libexslt) and libxslt extra functions (node-set, document, write, output) * Substantial speedup in XPath.evaluate() * HTMLParser for parsing (broken) HTML * XMLDTDID function parses XML into tuple (root node, ID dict) based on xml:id implementation of libxml2 (as opposed to ET compatible XMLID) Bugs fixed since 0.9.2: * Memory leak in Element.__setitem__ * Memory leak in Element.attrib.items() and Element.attrib.values() * Memory leak in XPath extension functions * Memory leak in unicode related setup code * Element now raises ValueError on empty tag names * Namespace fixing after moving elements between documents could fail if the source document was freed too early * Setting namespace-less tag names on namespaced elements ('{ns}t' -> 't') didn't reset the namespace * Unknown constants from newer libxml2 versions could raise exceptions in the error handlers * lxml.etree compiles much faster * On libxml2
VIEWS ON THIS POST

115

Posted on:

Monday 5th November 2012
View Replies!

lxml 1.1.2 released

Hi , after a month of bug tracing and fixing, lxml 1.1.2 finally made it to the cheeseshop. http://cheeseshop.python.org/pypi/lxml This is mainly a bugfix release for the stable and production-ready 1.1 series, the changelog is below. As there were a number of important fixes, updating is recommended. What is lxml """ lxml is a Pythonic binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more. Lxml also features a sophisticated API for custom element classes. This is a simple way to write arbitrary XML driven APIs on top of lxml. There is a separate module lxml.objectify that implements a data-binding API on top of lxml.etree. """ See the web page for more information and documentation: http://codespeak.net/lxml/ Have fun, Stefan 1.1.2 (2006-10-30) Features added * Data elements in objectify support repr(), which is now used by dump() * Source distribution now ships with a patched Pyrex * New C-API function makeElement() to create new elements with text, tail, attributes and namespaces * Reuse original parser flags for XInclude * Simplified support for handling XSLT processing instructions Bugs fixed * Parser resources were not freed before the next parser run * Open files and XML strings returned by Python resolvers were not closed/freed * Crash in the IDDict returned by XMLDTDID * Copying Comments and ProcessingInstructions failed * Memory leak for external URLs in _XSLTProcessingInstruction.parseXSL() * Memory leak when garbage collecting tailed root elements * HTML script/style content was not propagated to .text * Show text xincluded between text nodes correctly in .text and .tail * 'integer * objectify.StringElement' operation was not supported
VIEWS ON THIS POST

156

Posted on:

Monday 5th November 2012
View Replies!

lxml and schema validation

Hi I am validating a xmlfile against a xsd (My.xsd) but i notice that the xsd has a include which includes another xsd (My1.xsd) I have written a simple program that to validate this from lxml import etree xmlschemadoc=etree.parse("My.xsd") xmlschema=etree.XMLSchema(xmlschemadoc) xmldoc=etree.parse("My.XML") xmlschema.assertValid(xmldoc) will my program validate against My.xsd and My1.xsd both I also would like my program to continue validation against the xsd and notstope at the first failure . my question would be how do i do that in python regards Hrishy
VIEWS ON THIS POST

137

Posted on:

Wednesday 7th November 2012
View Replies!

lxml removing tag, keeping text order

Using lxml to clean up auto-generated xml to validate against a dtd; I need to remove an element tag but keep the text in order. For example s0 = ''' first text ladida emphasized text middle text last text ''' I want to get rid of the tag but keep everything else as it is; that is, I need this result: first text ladida emphasized text middle text last text I'm beginning to think this an impossible task, so I'm asking here to see if there is some method that will work. What I've done so far is this: (outer encloses the parent, outside is the parent, inside is the child to remove) from lxml import etree import copy def rm_tag(elem, outer, outside, inside): newdiv = etree.Element(outside) newdiv.text = '' for e0 in elem.getiterator(outside): for i,e1 in enumerate(e0.getiterator()): if i == 0: if e1.text: newdiv.text += e1.text elif (e1.tag != inside): newdiv.append(copy.deepcopy(e1)) elif (e1.text): newdiv.text += e1.text for t in elem.getiterator(): if t.tag == outer: t.clear() t.append(newdiv) break return etree.ElementTree(elem) print etree.tostring(rm_tag(el,'option','optional','emphasis'),pretty_print=True) But the text is messed up using this method. I see why it's wrong, but not how to make it right. It returns: first text emphasized text ladida last text Maybe I should send the outside element (via tostring) to a regexp for removing the child and return that string Regexp Getting desperate, hey. Any pointers much appreciated, --Tim Arnold
VIEWS ON THIS POST

183

Posted on:

Wednesday 7th November 2012
View Replies!

lxml codespeak pages empty ?

I can no longer get codespeak's lxml page at http://codespeak.net/lxml/ (get an empty HTML document )... Am-I alone in this case Any codespeaker reading A+ Laurent.
VIEWS ON THIS POST

117

Posted on:

Wednesday 7th November 2012
View Replies!

lxml 1.1 and 1.0.4 released

Hi all, I'm proud to announce the release of lxml 1.1, right after lxml 1.0.4. http://codespeak.net/lxml/ Download: http://cheeseshop.python.org/pypi/lxml/1.1 http://cheeseshop.python.org/pypi/lxml/1.0.4 lxml 1.1 is a major new release that introduces many new features compared to the 1.0 series. lxml 1.0.4 is a ...
VIEWS ON THIS POST

139

Posted on:

Saturday 10th November 2012
View Replies!