On fixing DOMNameSpaceNode and DOM NS API Inconsistency Problems

  104687
March 12, 2019 22:54 kontakt@beberlei.de (Benjamin Eberlei)
Hi everyone,

While looking for things to work on in php-src my friend Thomas pointed me
to a peculiar special case in ext/dom that leads to massive inconsistency
problems in the API.

There is an undocumented class DOMNameSpaceNode that gets returned from
DOMElement::getAttributeNode(NS) if you select an attribute that represents
a namespace (xmlns). This special case is intentionally handled in the
code, contrary to Pythons DOM Extension which doesn't do this and contrary
to the DOM Specification, which does not have a special DOMNameSpaceNode.
Its all DOMAttr.

Problematically DOMNameSpaceNode doesn't extend from DOMAttr, you cannot
pass this class to DOMElement::removeAttributeNode(DOMAttr $attr):

    Fatal error: Uncaught TypeError: Argument 1 passed to
DOMElement::removeAttributeNode() must be an instance of DOMAttr, instance
of DOMNameSpaceNode given

Code example here: https://3v4l.org/jkC5s

In addition the DOMNameSpaceNode renames all properties compared to
DOMAttr, clearly violating the interface documented here
http://php.net/manual/de/domelement.getattributenode.php

Two potential fixes come to mind:

1. Have DOMNameSpaceNode extend DOMAttr - maybe this was the originally
intended behavior, becuase the properties are all named differently?
2. Have DOMElement::removeAttributeNode accept also a DOMNameSpaceNode
(3.) Remove the non standard DOMNameSpaceNode class and use a DOMAttr
instead

I think approach #1 is the right one from a BC perspective and also fixing
the DOM API to be more consistent with the standard.

But I am also not opposed to #3 in the longer term, looking at Github
DOMNamespaceNode in the open source world is only used for one apc test, in
PHP Compat and when dumping it in Symfony Var Dumper. I imagine its more
deeply used in closed source code working deeply with XML.

The second problem that this inconsistency creates is
DOMElement::removeAttributeNS($namespaceUri, $localName). When you want to
use it to remove a namespace attribute (xmlns). By default the xmlns
namespace has the uri http://www.w3.org/2000/xmlns/. Hence removing
xmlns:$name namespace would expected to be:

DOMElement::removeAttributeNS("http://www.w3.org/2000/xmlns/", "foo") //
removes xmlns:foo

But thats not how it works, there is special code implemented that requires
you to pick the URI from xmlns:foo value:

xmlns:foo="urn:foo" />

DOMElement::removeAttributeNS("urn:foo", "foo");

But if your node has an attribute in this namespace with the same name, it
gets removed as well:

xmlns:foo="urn:foo" foo:foo="bar" />

Would incorrectly become this, removing 2 attributes:



This is a bug that cannot be fixed in a BC way. I have no clue how to fix
this completly broken behavior in a good way. I propose to break BC here
and requiring "http://www.w3.org/2000/xmlns/" as namespaceURI from PHP 8.0
to delete a namespace (xmlns) attribute.

Should I put both issues into an RFC or are these rather bugs that can be
fixed without an RFC?

greetings
Benjamin
  104690
March 13, 2019 01:28 dzuelke@salesforce.com (David Zuelke)
I agree this should be fixed. It's pretty hilarious how this exact
case (fiddling with the xmlns prefix) is the only comment for
DOMElement::setAttributeNS:
http://php.net/manual/en/domelement.setattributens.php (although
unnecessary, as you don't have to "add namespaces" to a document, that
automatically happens internally).

My hypothesis as to why this special class even exists, and why it is
all implemented this way: whoever wrote that code didn't know about
this "hard-coded" part of the spec, or the spec wasn't even fully
finished regarding that part at that time.

For those who are curious, it's defined in Section 3 of the
"Namespaces in XML 1.1 (Second Edition)" spec:
https://www.w3.org/TR/2006/REC-xml-names11-20060816/#ns-decl

However, pay particular attention to the following note (emphasis mine):

"The prefix xmlns is used only to declare namespace bindings and is by
definition bound to the namespace name http://www.w3.org/2000/xmlns/.
It **must not be declared or undeclared**. Other prefixes must not be
bound to this namespace name, and it must not be declared as the
default namespace. Element names must not have the prefix xmlns."

The DOM Level 2 specification section 1.1.8 ("XML Namespaces")
contains another related definition:

"Note: In the DOM, all namespace declaration attributes are by
definition bound to the namespace URI:
"http://www.w3.org/2000/xmlns/". These are the attributes whose
namespace prefix or qualified name is "xmlns". Although, at the time
of writing, this is not part of the XML Namespaces specification
[Namespaces], it is planned to be incorporated in a future revision."

This means that both "xmlns:foo" and "xmlns" attributes are both in
this "hard-coded" namespace.

Both of these cases are already handled correctly by, I assume, the
underlying libxml; if you set "xmlns:blah" or "xmlns" qualified name
attributes in that namespace:

$doc = new DOMDocument();
$root = new DOMElement("root");
$doc->appendChild($root);
$root->setAttributeNS("http://www.w3.org/2000/xmlns/", "xmlns:foo", "urn:bar");
$root->setAttributeNS("http://www.w3.org/2000/xmlns/", "xmlns", "urn:baz");
$root->setAttributeNS("urn:lol", "erp:foo", "whatup");
echo $doc->saveXML();

the namespace "xmlns" does not get declared, and PHP/libxml correctly produce:


xmlns:foo="urn:bar" xmlns="urn:baz" xmlns:erp="urn:lol" erp:foo="whatup"/>

Honestly, I can't imagine there being much code around that even
relies on any of the internal classes, at least not code that deserves
to be broken, because it was written by clowns who to this day would
claim that XML is terrible, when in fact they simply didn't understand
how namespaces work, that prefixes are irrelevant, and that XML
shouldn't be parsed with regular expressions ;)

There is really no use case in the DOM that ever requires anyone to
directly deal with "xmlns" attributes, as their creation (and removal)
is something that just happens automatically; the only real case where
you even have to worry about prefixes is with
DOMElement::setAttributeNS(), where you can cause prefix collisions.
But even that is trivial; internally, the mappings always get
optimized, e.g. in the case of duplicate prefixes for the same
namespace URI:

$doc = new DOMDocument();
$root = new DOMElement("root");
$doc->appendChild($root);
$root->setAttributeNS("urn:lol", "prefixone:foo", "whatup");
$root->setAttributeNS("urn:lol", "prefixtwo:bar", "whatup");
echo $doc->saveXML();

becomes, correctly:


xmlns:prefixone="urn:lol" prefixone:foo="whatup" prefixone:bar="whatup"/>

Anyway, I think an RFC and maybe some test cases would be good. I'll
be happy to help dig out my old XML-fu and help :)

David

On Tue, Mar 12, 2019 at 11:54 PM Benjamin Eberlei <kontakt@beberlei.de> wrote:
> > Hi everyone, > > While looking for things to work on in php-src my friend Thomas pointed me > to a peculiar special case in ext/dom that leads to massive inconsistency > problems in the API. > > There is an undocumented class DOMNameSpaceNode that gets returned from > DOMElement::getAttributeNode(NS) if you select an attribute that represents > a namespace (xmlns). This special case is intentionally handled in the > code, contrary to Pythons DOM Extension which doesn't do this and contrary > to the DOM Specification, which does not have a special DOMNameSpaceNode. > Its all DOMAttr. > > Problematically DOMNameSpaceNode doesn't extend from DOMAttr, you cannot > pass this class to DOMElement::removeAttributeNode(DOMAttr $attr): > > Fatal error: Uncaught TypeError: Argument 1 passed to > DOMElement::removeAttributeNode() must be an instance of DOMAttr, instance > of DOMNameSpaceNode given > > Code example here: https://3v4l.org/jkC5s > > In addition the DOMNameSpaceNode renames all properties compared to > DOMAttr, clearly violating the interface documented here > http://php.net/manual/de/domelement.getattributenode.php > > Two potential fixes come to mind: > > 1. Have DOMNameSpaceNode extend DOMAttr - maybe this was the originally > intended behavior, becuase the properties are all named differently? > 2. Have DOMElement::removeAttributeNode accept also a DOMNameSpaceNode > (3.) Remove the non standard DOMNameSpaceNode class and use a DOMAttr > instead > > I think approach #1 is the right one from a BC perspective and also fixing > the DOM API to be more consistent with the standard. > > But I am also not opposed to #3 in the longer term, looking at Github > DOMNamespaceNode in the open source world is only used for one apc test, in > PHP Compat and when dumping it in Symfony Var Dumper. I imagine its more > deeply used in closed source code working deeply with XML. > > The second problem that this inconsistency creates is > DOMElement::removeAttributeNS($namespaceUri, $localName). When you want to > use it to remove a namespace attribute (xmlns). By default the xmlns > namespace has the uri http://www.w3.org/2000/xmlns/. Hence removing > xmlns:$name namespace would expected to be: > > DOMElement::removeAttributeNS("http://www.w3.org/2000/xmlns/", "foo") // > removes xmlns:foo > > But thats not how it works, there is special code implemented that requires > you to pick the URI from xmlns:foo value: > > xmlns:foo="urn:foo" /> > > DOMElement::removeAttributeNS("urn:foo", "foo"); > > But if your node has an attribute in this namespace with the same name, it > gets removed as well: > > xmlns:foo="urn:foo" foo:foo="bar" /> > > Would incorrectly become this, removing 2 attributes: > > > > This is a bug that cannot be fixed in a BC way. I have no clue how to fix > this completly broken behavior in a good way. I propose to break BC here > and requiring "http://www.w3.org/2000/xmlns/" as namespaceURI from PHP 8.0 > to delete a namespace (xmlns) attribute. > > Should I put both issues into an RFC or are these rather bugs that can be > fixed without an RFC? > > greetings > Benjamin
  104691
March 13, 2019 05:50 pierre.php@gmail.com (Pierre Joye)
Hi,

Only adding Rob just in case :)

best,

On Wed, Mar 13, 2019, 5:54 AM Benjamin Eberlei <kontakt@beberlei.de> wrote:

> Hi everyone, > > While looking for things to work on in php-src my friend Thomas pointed me > to a peculiar special case in ext/dom that leads to massive inconsistency > problems in the API. > > There is an undocumented class DOMNameSpaceNode that gets returned from > DOMElement::getAttributeNode(NS) if you select an attribute that represents > a namespace (xmlns). This special case is intentionally handled in the > code, contrary to Pythons DOM Extension which doesn't do this and contrary > to the DOM Specification, which does not have a special DOMNameSpaceNode. > Its all DOMAttr. > > Problematically DOMNameSpaceNode doesn't extend from DOMAttr, you cannot > pass this class to DOMElement::removeAttributeNode(DOMAttr $attr): > > Fatal error: Uncaught TypeError: Argument 1 passed to > DOMElement::removeAttributeNode() must be an instance of DOMAttr, instance > of DOMNameSpaceNode given > > Code example here: https://3v4l.org/jkC5s > > In addition the DOMNameSpaceNode renames all properties compared to > DOMAttr, clearly violating the interface documented here > http://php.net/manual/de/domelement.getattributenode.php > > Two potential fixes come to mind: > > 1. Have DOMNameSpaceNode extend DOMAttr - maybe this was the originally > intended behavior, becuase the properties are all named differently? > 2. Have DOMElement::removeAttributeNode accept also a DOMNameSpaceNode > (3.) Remove the non standard DOMNameSpaceNode class and use a DOMAttr > instead > > I think approach #1 is the right one from a BC perspective and also fixing > the DOM API to be more consistent with the standard. > > But I am also not opposed to #3 in the longer term, looking at Github > DOMNamespaceNode in the open source world is only used for one apc test, in > PHP Compat and when dumping it in Symfony Var Dumper. I imagine its more > deeply used in closed source code working deeply with XML. > > The second problem that this inconsistency creates is > DOMElement::removeAttributeNS($namespaceUri, $localName). When you want to > use it to remove a namespace attribute (xmlns). By default the xmlns > namespace has the uri http://www.w3.org/2000/xmlns/. Hence removing > xmlns:$name namespace would expected to be: > > DOMElement::removeAttributeNS("http://www.w3.org/2000/xmlns/", "foo") // > removes xmlns:foo > > But thats not how it works, there is special code implemented that requires > you to pick the URI from xmlns:foo value: > > xmlns:foo="urn:foo" /> > > DOMElement::removeAttributeNS("urn:foo", "foo"); > > But if your node has an attribute in this namespace with the same name, it > gets removed as well: > > xmlns:foo="urn:foo" foo:foo="bar" /> > > Would incorrectly become this, removing 2 attributes: > > > > This is a bug that cannot be fixed in a BC way. I have no clue how to fix > this completly broken behavior in a good way. I propose to break BC here > and requiring "http://www.w3.org/2000/xmlns/" as namespaceURI from PHP 8.0 > to delete a namespace (xmlns) attribute. > > Should I put both issues into an RFC or are these rather bugs that can be > fixed without an RFC? > > greetings > Benjamin >
  104692
March 13, 2019 15:36 rrichards@ctindustries.net (Rob Richards)
There were specific reasons for it tho I can't recall all the details
off the top of my head. Part of it related to the underlying structures
of attributes and namespace nodes not being the same and cannot be user
interchangeable in the code, unless only inspected at the nodeptr level.
Trying to manipulate namespace nodes as attributes will cause memory
corruption so they need to be handled differently. I don't remember if
this was the only reason or there was more for the special class but
while I agree with basic DOM usage no one should ever need to be dealing
directly with namespace declarations but once you get into more complex
situation, especially that involve xpath, xslt, canonicalization and
even interoperability this starts coming into play and is quite
important to deal with namespace declarations directly.

Also some of the examples in previous messages are way overly simplistic
so the conclusions are not 100% accurate. i.e. removeAttributeNS. I am
assuming that its just due to overly aggressive namespace reconciliation
happening and removing the declaration since its no longer in use. With
a nested document which uses the namespace in child nodes I would expect
the namespace declaration to remain, tho possible it may be moved to the
first child using it (been a while since I looked at that code).

My recommendation would be to tread lightly here, look at every method
any changes to namespace declaration nodes would touch and make sure
they are all handle properly otherwise you run the risk of introducing
memory corruption issues which is far worse than some inconsistency for
supposedly unused and unneeded functionality ;)

BTW I have recently started looking at the libxml code base again and
considering getting more involved here again, assuming the toxicity
level has gone done from the past. Not sure this is the first thing I
would look at but feel free to shoot over any questions.

Rob

On 3/12/19 1:50 AM, Pierre Joye wrote:
> Hi, > > Only adding Rob just in case :) > > best, > > On Wed, Mar 13, 2019, 5:54 AM Benjamin Eberlei <kontakt@beberlei.de > <mailto:kontakt@beberlei.de>> wrote: > > Hi everyone, > > While looking for things to work on in php-src my friend Thomas > pointed me > to a peculiar special case in ext/dom that leads to massive > inconsistency > problems in the API. > > There is an undocumented class DOMNameSpaceNode that gets returned > from > DOMElement::getAttributeNode(NS) if you select an attribute that > represents > a namespace (xmlns). This special case is intentionally handled in the > code, contrary to Pythons DOM Extension which doesn't do this and > contrary > to the DOM Specification, which does not have a special > DOMNameSpaceNode. > Its all DOMAttr. > > Problematically DOMNameSpaceNode doesn't extend from DOMAttr, you > cannot > pass this class to DOMElement::removeAttributeNode(DOMAttr $attr): > >     Fatal error: Uncaught TypeError: Argument 1 passed to > DOMElement::removeAttributeNode() must be an instance of DOMAttr, > instance > of DOMNameSpaceNode given > > Code example here: https://3v4l.org/jkC5s > > In addition the DOMNameSpaceNode renames all properties compared to > DOMAttr, clearly violating the interface documented here > http://php.net/manual/de/domelement.getattributenode.php > > Two potential fixes come to mind: > > 1. Have DOMNameSpaceNode extend DOMAttr - maybe this was the > originally > intended behavior, becuase the properties are all named differently? > 2. Have DOMElement::removeAttributeNode accept also a DOMNameSpaceNode > (3.) Remove the non standard DOMNameSpaceNode class and use a DOMAttr > instead > > I think approach #1 is the right one from a BC perspective and > also fixing > the DOM API to be more consistent with the standard. > > But I am also not opposed to #3 in the longer term, looking at Github > DOMNamespaceNode in the open source world is only used for one apc > test, in > PHP Compat and when dumping it in Symfony Var Dumper. I imagine > its more > deeply used in closed source code working deeply with XML. > > The second problem that this inconsistency creates is > DOMElement::removeAttributeNS($namespaceUri, $localName). When you > want to > use it to remove a namespace attribute (xmlns). By default the xmlns > namespace has the uri http://www.w3.org/2000/xmlns/. Hence removing > xmlns:$name namespace would expected to be: > > DOMElement::removeAttributeNS("http://www.w3.org/2000/xmlns/", > "foo") // > removes xmlns:foo > > But thats not how it works, there is special code implemented that > requires > you to pick the URI from xmlns:foo value: > > xmlns:foo="urn:foo" /> > > DOMElement::removeAttributeNS("urn:foo", "foo"); > > But if your node has an attribute in this namespace with the same > name, it > gets removed as well: > > xmlns:foo="urn:foo" foo:foo="bar" /> > > Would incorrectly become this, removing 2 attributes: > > > > This is a bug that cannot be fixed in a BC way. I have no clue how > to fix > this completly broken behavior in a good way. I propose to break > BC here > and requiring "http://www.w3.org/2000/xmlns/" as namespaceURI from > PHP 8.0 > to delete a namespace (xmlns) attribute. > > Should I put both issues into an RFC or are these rather bugs that > can be > fixed without an RFC? > > greetings > Benjamin >
  104774
March 17, 2019 13:33 kontakt@beberlei.de (Benjamin Eberlei)
On Wed, Mar 13, 2019 at 4:36 PM Rob Richards <rrichards@ctindustries.net>
wrote:

> There were specific reasons for it tho I can't recall all the details off > the top of my head. Part of it related to the underlying structures of > attributes and namespace nodes not being the same and cannot be user > interchangeable in the code, unless only inspected at the nodeptr level. > Trying to manipulate namespace nodes as attributes will cause memory > corruption so they need to be handled differently. I don't remember if this > was the only reason or there was more for the special class but while I > agree with basic DOM usage no one should ever need to be dealing directly > with namespace declarations but once you get into more complex situation, > especially that involve xpath, xslt, canonicalization and even > interoperability this starts coming into play and is quite important to > deal with namespace declarations directly. >
That is a very helpful insight, thank you Rob. From the few hours i poked in the code base memory management seems to be tricky abstracing libxml behind this powerful DOM API. Also some of the examples in previous messages are way overly simplistic so
> the conclusions are not 100% accurate. i.e. removeAttributeNS. I am > assuming that its just due to overly aggressive namespace reconciliation > happening and removing the declaration since its no longer in use. With a > nested document which uses the namespace in child nodes I would expect the > namespace declaration to remain, tho possible it may be moved to the first > child using it (been a while since I looked at that code). >
I'll take a look again how that extra attr gets removed. However it doesn't change the fact that semantically DOMElement::removeAttributeNS(" http://www.w3.org/2000/xmlns/", "foo") doesnt work.
> > My recommendation would be to tread lightly here, look at every method any > changes to namespace declaration nodes would touch and make sure they are > all handle properly otherwise you run the risk of introducing memory > corruption issues which is far worse than some inconsistency for supposedly > unused and unneeded functionality ;) > > BTW I have recently started looking at the libxml code base again and > considering getting more involved here again, assuming the toxicity level > has gone done from the past. Not sure this is the first thing I would look > at but feel free to shoot over any questions. >
This sounds awesome :-) I can't comment on toxicity levels of the past as I only follow this list for about 3-4 years now, but it has gone down in those years I feel. It is still a draft but Thomas and I have started working on an RFC and code to update ext/dom to cover the latest standard release: https://wiki.php.net/rfc/dom_living_standard_api - we plan on proposing that soon, maybe you have some feedback.
> Rob > > On 3/12/19 1:50 AM, Pierre Joye wrote: > > Hi, > > Only adding Rob just in case :) > > best, > > On Wed, Mar 13, 2019, 5:54 AM Benjamin Eberlei <kontakt@beberlei.de> > wrote: > >> Hi everyone, >> >> While looking for things to work on in php-src my friend Thomas pointed me >> to a peculiar special case in ext/dom that leads to massive inconsistency >> problems in the API. >> >> There is an undocumented class DOMNameSpaceNode that gets returned from >> DOMElement::getAttributeNode(NS) if you select an attribute that >> represents >> a namespace (xmlns). This special case is intentionally handled in the >> code, contrary to Pythons DOM Extension which doesn't do this and contrary >> to the DOM Specification, which does not have a special DOMNameSpaceNode. >> Its all DOMAttr. >> >> Problematically DOMNameSpaceNode doesn't extend from DOMAttr, you cannot >> pass this class to DOMElement::removeAttributeNode(DOMAttr $attr): >> >> Fatal error: Uncaught TypeError: Argument 1 passed to >> DOMElement::removeAttributeNode() must be an instance of DOMAttr, instance >> of DOMNameSpaceNode given >> >> Code example here: https://3v4l.org/jkC5s >> >> In addition the DOMNameSpaceNode renames all properties compared to >> DOMAttr, clearly violating the interface documented here >> http://php.net/manual/de/domelement.getattributenode.php >> >> Two potential fixes come to mind: >> >> 1. Have DOMNameSpaceNode extend DOMAttr - maybe this was the originally >> intended behavior, becuase the properties are all named differently? >> 2. Have DOMElement::removeAttributeNode accept also a DOMNameSpaceNode >> (3.) Remove the non standard DOMNameSpaceNode class and use a DOMAttr >> instead >> >> I think approach #1 is the right one from a BC perspective and also fixing >> the DOM API to be more consistent with the standard. >> >> But I am also not opposed to #3 in the longer term, looking at Github >> DOMNamespaceNode in the open source world is only used for one apc test, >> in >> PHP Compat and when dumping it in Symfony Var Dumper. I imagine its more >> deeply used in closed source code working deeply with XML. >> >> The second problem that this inconsistency creates is >> DOMElement::removeAttributeNS($namespaceUri, $localName). When you want to >> use it to remove a namespace attribute (xmlns). By default the xmlns >> namespace has the uri http://www.w3.org/2000/xmlns/. Hence removing >> xmlns:$name namespace would expected to be: >> >> DOMElement::removeAttributeNS("http://www.w3.org/2000/xmlns/", "foo") // >> removes xmlns:foo >> >> But thats not how it works, there is special code implemented that >> requires >> you to pick the URI from xmlns:foo value: >> >> xmlns:foo="urn:foo" /> >> >> DOMElement::removeAttributeNS("urn:foo", "foo"); >> >> But if your node has an attribute in this namespace with the same name, it >> gets removed as well: >> >> xmlns:foo="urn:foo" foo:foo="bar" /> >> >> Would incorrectly become this, removing 2 attributes: >> >> >> >> This is a bug that cannot be fixed in a BC way. I have no clue how to fix >> this completly broken behavior in a good way. I propose to break BC here >> and requiring "http://www.w3.org/2000/xmlns/" as namespaceURI from PHP >> 8.0 >> to delete a namespace (xmlns) attribute. >> >> Should I put both issues into an RFC or are these rather bugs that can be >> fixed without an RFC? >> >> greetings >> Benjamin >> > >
  104775
March 17, 2019 13:52 cananian@wikimedia.org ("C. Scott Ananian")
On Sun, Mar 17, 2019, 9:34 AM Benjamin Eberlei <kontakt@beberlei.de> wrote:

> > It is still a draft but Thomas and I have started working on an RFC and > code to update ext/dom to cover the latest standard release: > https://wiki.php.net/rfc/dom_living_standard_api - we plan on proposing > that soon, maybe you have some feedback. >
Updating the DOM extension would be something the Wikimedia Foundation would very much like to see happen. It's more complicated than just adding some new methods, though: there are significantly spec-compliance issues with the current code and performance problems too. We've been porting code from JS to PHP which (in the JS version) used a good spec-compliance DOM implementation, and have been keeping a list of all the crazy bugs and workarounds that have been necessary. Start from the basic fact that the modern DOM requires Node#nodeName to be uppercase for HTML elements, and the current code uses all lowercase. It's hard to see how that could be addressed without breaking backward compat. Here are our notes/discussions/etc: https://phabricator.wikimedia.org/T215000 https://mediawiki.org/wiki/Parsoid/PHP/Help_wanted (and there's more where that came from) --scott PS. My personal feeling at this time is that it would be better to put the core libxml abstractions in an extension, to allow fast xpath and perhaps parse/serialize, but that the actual DOM should be built as a php library on top of that, in order to allow rapid changes (the WHATWG is pretty actively making additions/changes to the we spec these days) which are decoupled from the PHP release cycle.
  104777
March 17, 2019 17:35 kontakt@beberlei.de (Benjamin Eberlei)
On Sun, Mar 17, 2019 at 2:52 PM C. Scott Ananian <cananian@wikimedia.org>
wrote:

> On Sun, Mar 17, 2019, 9:34 AM Benjamin Eberlei <kontakt@beberlei.de> > wrote: > >> >> It is still a draft but Thomas and I have started working on an RFC and >> code to update ext/dom to cover the latest standard release: >> https://wiki.php.net/rfc/dom_living_standard_api - we plan on proposing >> that soon, maybe you have some feedback. >> > > Updating the DOM extension would be something the Wikimedia Foundation > would very much like to see happen. It's more complicated than just adding > some new methods, though: there are significantly spec-compliance issues > with the current code and performance problems too. We've been porting > code from JS to PHP which (in the JS version) used a good spec-compliance > DOM implementation, and have been keeping a list of all the crazy bugs and > workarounds that have been necessary. > > Start from the basic fact that the modern DOM requires Node#nodeName to be > uppercase for HTML elements, and the current code uses all lowercase. It's > hard to see how that could be addressed without breaking backward compat. > > Here are our notes/discussions/etc: > https://phabricator.wikimedia.org/T215000 >
That is a really good resource of thinks we should look at :-) But it is not fully true that this javascript library is DOM spec compliant. It does provide extra features that https://dom.spec.whatwg.org/ doesn't have such as "body", "title", "head" properties on DOMDocument, or innerHtml/outerHtml attributes on elements. I didn't find a spec where these were defined, the html spec also doesn't mention them. The DOM Spec also doesn't impelement HtmlElement, that is from the HTML spec. The way forward without BC break like uppercase nodeName or getAttribute returning NULL and not empty strings in newer implementations could be to allow users to specify which implementation they want the DOMDocument to follow.
> > https://mediawiki.org/wiki/Parsoid/PHP/Help_wanted > (and there's more where that came from) > --scott >
> PS. My personal feeling at this time is that it would be better to put the > core libxml abstractions in an extension, to allow fast xpath and perhaps > parse/serialize, but that the actual DOM should be built as a php library > on top of that, in order to allow rapid changes (the WHATWG is pretty > actively making additions/changes to the we spec these days) which are > decoupled from the PHP release cycle. >
It would still require libxml to be an extension, which would only happen in a newer version, so it is not going to help without requiring that version. I don't think the effort is worth it though, compared to just working on the existing ext/dom.
  104780
March 17, 2019 22:57 rrichards@ctindustries.net (Rob Richards)
On 3/17/19 6:52 AM, C. Scott Ananian wrote:
> On Sun, Mar 17, 2019, 9:34 AM Benjamin Eberlei <kontakt@beberlei.de> wrote: > >> It is still a draft but Thomas and I have started working on an RFC and >> code to update ext/dom to cover the latest standard release: >> https://wiki.php.net/rfc/dom_living_standard_api - we plan on proposing >> that soon, maybe you have some feedback. >> > Updating the DOM extension would be something the Wikimedia Foundation > would very much like to see happen. It's more complicated than just adding > some new methods, though: there are significantly spec-compliance issues > with the current code and performance problems too. We've been porting > code from JS to PHP which (in the JS version) used a good spec-compliance > DOM implementation, and have been keeping a list of all the crazy bugs and > workarounds that have been necessary. > > Start from the basic fact that the modern DOM requires Node#nodeName to be > uppercase for HTML elements, and the current code uses all lowercase. It's > hard to see how that could be addressed without breaking backward compat. > > Here are our notes/discussions/etc: > https://phabricator.wikimedia.org/T215000 > https://mediawiki.org/wiki/Parsoid/PHP/Help_wanted > (and there's more where that came from) > --scott > > PS. My personal feeling at this time is that it would be better to put the > core libxml abstractions in an extension, to allow fast xpath and perhaps > parse/serialize, but that the actual DOM should be built as a php library > on top of that, in order to allow rapid changes (the WHATWG is pretty > actively making additions/changes to the we spec these days) which are > decoupled from the PHP release cycle. >
I'll take a look through the lists you have but you need to remember that the DOM specs were not written for HTML. While HTML might have some more restrictive requirements that piggy back on the DOM specs themselves, they are not the standard for DOM. The output HTML functionality was more convenience methods than part of the core extension, Rob