Re: [PHP-DEV] On fixing DOMNameSpaceNode and DOM NS API Inconsistency Problems

This is only part of a thread. view whole thread
  104775
March 17, 2019 13:52 cananian@wikimedia.org ("C. Scott Ananian")
On Sun, Mar 17, 2019, 9:34 AM Benjamin Eberlei <kontakt@beberlei.de> wrote:

> > It is still a draft but Thomas and I have started working on an RFC and > code to update ext/dom to cover the latest standard release: > https://wiki.php.net/rfc/dom_living_standard_api - we plan on proposing > that soon, maybe you have some feedback. >
Updating the DOM extension would be something the Wikimedia Foundation would very much like to see happen. It's more complicated than just adding some new methods, though: there are significantly spec-compliance issues with the current code and performance problems too. We've been porting code from JS to PHP which (in the JS version) used a good spec-compliance DOM implementation, and have been keeping a list of all the crazy bugs and workarounds that have been necessary. Start from the basic fact that the modern DOM requires Node#nodeName to be uppercase for HTML elements, and the current code uses all lowercase. It's hard to see how that could be addressed without breaking backward compat. Here are our notes/discussions/etc: https://phabricator.wikimedia.org/T215000 https://mediawiki.org/wiki/Parsoid/PHP/Help_wanted (and there's more where that came from) --scott PS. My personal feeling at this time is that it would be better to put the core libxml abstractions in an extension, to allow fast xpath and perhaps parse/serialize, but that the actual DOM should be built as a php library on top of that, in order to allow rapid changes (the WHATWG is pretty actively making additions/changes to the we spec these days) which are decoupled from the PHP release cycle.
  104777
March 17, 2019 17:35 kontakt@beberlei.de (Benjamin Eberlei)
On Sun, Mar 17, 2019 at 2:52 PM C. Scott Ananian <cananian@wikimedia.org>
wrote:

> On Sun, Mar 17, 2019, 9:34 AM Benjamin Eberlei <kontakt@beberlei.de> > wrote: > >> >> It is still a draft but Thomas and I have started working on an RFC and >> code to update ext/dom to cover the latest standard release: >> https://wiki.php.net/rfc/dom_living_standard_api - we plan on proposing >> that soon, maybe you have some feedback. >> > > Updating the DOM extension would be something the Wikimedia Foundation > would very much like to see happen. It's more complicated than just adding > some new methods, though: there are significantly spec-compliance issues > with the current code and performance problems too. We've been porting > code from JS to PHP which (in the JS version) used a good spec-compliance > DOM implementation, and have been keeping a list of all the crazy bugs and > workarounds that have been necessary. > > Start from the basic fact that the modern DOM requires Node#nodeName to be > uppercase for HTML elements, and the current code uses all lowercase. It's > hard to see how that could be addressed without breaking backward compat. > > Here are our notes/discussions/etc: > https://phabricator.wikimedia.org/T215000 >
That is a really good resource of thinks we should look at :-) But it is not fully true that this javascript library is DOM spec compliant. It does provide extra features that https://dom.spec.whatwg.org/ doesn't have such as "body", "title", "head" properties on DOMDocument, or innerHtml/outerHtml attributes on elements. I didn't find a spec where these were defined, the html spec also doesn't mention them. The DOM Spec also doesn't impelement HtmlElement, that is from the HTML spec. The way forward without BC break like uppercase nodeName or getAttribute returning NULL and not empty strings in newer implementations could be to allow users to specify which implementation they want the DOMDocument to follow.
> > https://mediawiki.org/wiki/Parsoid/PHP/Help_wanted > (and there's more where that came from) > --scott >
> PS. My personal feeling at this time is that it would be better to put the > core libxml abstractions in an extension, to allow fast xpath and perhaps > parse/serialize, but that the actual DOM should be built as a php library > on top of that, in order to allow rapid changes (the WHATWG is pretty > actively making additions/changes to the we spec these days) which are > decoupled from the PHP release cycle. >
It would still require libxml to be an extension, which would only happen in a newer version, so it is not going to help without requiring that version. I don't think the effort is worth it though, compared to just working on the existing ext/dom.
  104780
March 17, 2019 22:57 rrichards@ctindustries.net (Rob Richards)
On 3/17/19 6:52 AM, C. Scott Ananian wrote:
> On Sun, Mar 17, 2019, 9:34 AM Benjamin Eberlei <kontakt@beberlei.de> wrote: > >> It is still a draft but Thomas and I have started working on an RFC and >> code to update ext/dom to cover the latest standard release: >> https://wiki.php.net/rfc/dom_living_standard_api - we plan on proposing >> that soon, maybe you have some feedback. >> > Updating the DOM extension would be something the Wikimedia Foundation > would very much like to see happen. It's more complicated than just adding > some new methods, though: there are significantly spec-compliance issues > with the current code and performance problems too. We've been porting > code from JS to PHP which (in the JS version) used a good spec-compliance > DOM implementation, and have been keeping a list of all the crazy bugs and > workarounds that have been necessary. > > Start from the basic fact that the modern DOM requires Node#nodeName to be > uppercase for HTML elements, and the current code uses all lowercase. It's > hard to see how that could be addressed without breaking backward compat. > > Here are our notes/discussions/etc: > https://phabricator.wikimedia.org/T215000 > https://mediawiki.org/wiki/Parsoid/PHP/Help_wanted > (and there's more where that came from) > --scott > > PS. My personal feeling at this time is that it would be better to put the > core libxml abstractions in an extension, to allow fast xpath and perhaps > parse/serialize, but that the actual DOM should be built as a php library > on top of that, in order to allow rapid changes (the WHATWG is pretty > actively making additions/changes to the we spec these days) which are > decoupled from the PHP release cycle. >
I'll take a look through the lists you have but you need to remember that the DOM specs were not written for HTML. While HTML might have some more restrictive requirements that piggy back on the DOM specs themselves, they are not the standard for DOM. The output HTML functionality was more convenience methods than part of the core extension, Rob
  104833
March 20, 2019 15:48 cananian@wikimedia.org ("C. Scott Ananian")
On Sun, Mar 17, 2019 at 6:57 PM Rob Richards <rrichards@ctindustries.net>
wrote:

> I'll take a look through the lists you have but you need to remember > that the DOM specs were not written for HTML. While HTML might have some > more restrictive requirements that piggy back on the DOM specs > themselves, they are not the standard for DOM. The output HTML > functionality was more convenience methods than part of the core extension, >
Just dealing with the latest WHATWG DOM specs (not the html-specific part) would still involve grappling with `getAttribute` behavior, the case of nodeName (search for "HTML-uppercased qualified name" in the DOM spec), the numeric node type of the root document, performance of methods on namespaced nodes, etc. It would certainly be a useful step forward, even if it weren't a full HTML DOM. What would be even better is if the hooks were present to allow subclassing the core DOM types so that you could implement HTMLElement in pure PHP as a subclass of DOMElement, etc. For example, right now it is impossible to create a new DOMNodeList from PHP, which makes pure PHP implementations of the missing HTML DOM methods (like querySelectorAll) impossible (for example wikimedia/zest-css has to return an array instead of a DOMNodeList). --scott
  106083
June 27, 2019 21:20 cananian@wikimedia.org ("C. Scott Ananian")
Brief update:

We found a few more PHP DOM compatibility bugs, including one with
DOMNode::normalize(): https://bugs.php.net/bug.php?id=78221

In addition, one of our engineers implemented a pure-PHP DOM
implementation: https://github.com/linehan/DOMOperator

We're not using that in production yet, but it's probably quicker than
trying to make PHP's DOM implementation spec-compliant.
 --scott

On Wed, Mar 20, 2019 at 11:48 AM C. Scott Ananian <cananian@wikimedia.org>
wrote:

> On Sun, Mar 17, 2019 at 6:57 PM Rob Richards <rrichards@ctindustries.net> > wrote: > >> I'll take a look through the lists you have but you need to remember >> that the DOM specs were not written for HTML. While HTML might have some >> more restrictive requirements that piggy back on the DOM specs >> themselves, they are not the standard for DOM. The output HTML >> functionality was more convenience methods than part of the core >> extension, >> > > Just dealing with the latest WHATWG DOM specs (not the html-specific part) > would still involve grappling with `getAttribute` behavior, the case of > nodeName (search for "HTML-uppercased qualified name" in the DOM spec), the > numeric node type of the root document, performance of methods on > namespaced nodes, etc. It would certainly be a useful step forward, even > if it weren't a full HTML DOM. > > What would be even better is if the hooks were present to allow > subclassing the core DOM types so that you could implement HTMLElement in > pure PHP as a subclass of DOMElement, etc. For example, right now it is > impossible to create a new DOMNodeList from PHP, which makes pure PHP > implementations of the missing HTML DOM methods (like querySelectorAll) > impossible (for example wikimedia/zest-css has to return an array instead > of a DOMNodeList). > --scott > >
-- (http://cscott.net)