Offset-only results from preg_match

  104723
March 14, 2019 19:33 cananian@wikimedia.org ("C. Scott Ananian")
I'm floating an idea for an RFC here.

I'm working on the wikimedia/remex-html library for high-performance
PHP-native HTML5 parsing.  When creating a high-performance lexer, it is
worthwhile to try to reduce the number of string copies made.  You can
generally perform matches using offsets into your master source string.
However, preg_match* will copy a substring for the entire matched region
($matches[0]) as well as for all captured patterns ($matches[1...n]).
These substring copies can get expensive if the matched region/captured
patterns are very large.

It would be helpful if PHP's preg_match* functions offered a flag, say
PREG_LENGTH_CAPTURE, which returned the numeric length instead of the
matched/captured string.  In combination,
PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in
element 0 and the numeric offset in element 1, and avoid the need to copy
the matched substring unnecessarily.  This would allow greatly reducing the
number of substring copies made during lexing.

Thoughts?
 --scott

ps. more ambitious would be to introduce a new "substring" type, which
would share the allocation of a parent string with its own offset and
length fields.  That would probably be as invasive as the ZVAL_INTERNED_STR
type, though -- a much much bigger project.

pps. while I'm wishing -- preg_replace would benefit from some way to pass
options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get
the offset for each replacement match.  Knowing the offset of the match
allows you to do (for example) error reporting from the callback function.

-- 
(http://cscott.net)
  104770
March 16, 2019 08:49 markus@fischer.name (Markus Fischer)
On 14.03.19 20:33, C. Scott Ananian wrote:
> ps. more ambitious would be to introduce a new "substring" type, which > would share the allocation of a parent string with its own offset and > length fields. That would probably be as invasive as the ZVAL_INTERNED_STR > type, though -- a much much bigger project.
That feature would be incredible to have. Erlang/Elixir use these concept and it's incredible what they can do with this. However, the data types there are also immutable so it's whole other league… - Markus
  104786
March 18, 2019 13:43 nikita.ppv@gmail.com (Nikita Popov)
On Thu, Mar 14, 2019 at 8:33 PM C. Scott Ananian <cananian@wikimedia.org>
wrote:

> I'm floating an idea for an RFC here. > > I'm working on the wikimedia/remex-html library for high-performance > PHP-native HTML5 parsing. When creating a high-performance lexer, it is > worthwhile to try to reduce the number of string copies made. You can > generally perform matches using offsets into your master source string. > However, preg_match* will copy a substring for the entire matched region > ($matches[0]) as well as for all captured patterns ($matches[1...n]). > These substring copies can get expensive if the matched region/captured > patterns are very large. > > It would be helpful if PHP's preg_match* functions offered a flag, say > PREG_LENGTH_CAPTURE, which returned the numeric length instead of the > matched/captured string. In combination, > PREG_OFFSET_CAPTURE|PREG_LENGTH_CAPTURE would return the numeric length in > element 0 and the numeric offset in element 1, and avoid the need to copy > the matched substring unnecessarily. This would allow greatly reducing the > number of substring copies made during lexing. >
Generally sounds reasonable to me. Do you maybe have a sample input and regular expression where you suspect this is a particularly large problem, so we can test how much of a difference this makes?
> Thoughts? > --scott > > ps. more ambitious would be to introduce a new "substring" type, which > would share the allocation of a parent string with its own offset and > length fields. That would probably be as invasive as the ZVAL_INTERNED_STR > type, though -- a much much bigger project. > > pps. while I'm wishing -- preg_replace would benefit from some way to pass > options, so that (for example) you could pass PREG_OFFSET_CAPTURE and get > the offset for each replacement match. Knowing the offset of the match > allows you to do (for example) error reporting from the callback function. >
I've implemented this bit in https://github.com/php/php-src/pull/3958. Nikita