Re: [PHP-DEV] [RFC] token_get_all() TOKEN_AS_OBJECT mode

This is only part of a thread. view whole thread
  108573
February 14, 2020 13:31 thruska@cubiclesoft.com (Thomas Hruska)
On 2/14/2020 1:48 AM, Nikita Popov wrote:
> On Thu, Feb 13, 2020 at 6:06 PM Larry Garfield <larry@garfieldtech.com> > wrote: > >> On Thu, Feb 13, 2020, at 3:47 AM, Nikita Popov wrote: >>> Hi internals, >>> >>> This has been discussed a while ago already, now as a proper proposal: >>> https://wiki.php.net/rfc/token_as_object >>> >>> tl;dr is that it allows you to get token_get_all() output as an array of >>> PhpToken objects. This reduces memory usage, improves performance, makes >>> code more uniform and readable... What's not to like? >>> >>> An open question is whether (at least to start with) PhpToken should be >>> just a data container, or whether we want to add some helper methods to >> it. >>> If this generates too much bikeshed, I'll drop methods from the proposal. >>> >>> Regards, >>> Nikita >> >> I love everything about this. >> >> 1) I would agree with Nicolas that a static constructor would be better. >> I don't know about polyfilling it, but it's definitely more >> self-descriptive. >> >> 2) I'm skeptical about the methods. I can see them being useful, but also >> being bikeshed material. For instance, if you're doing annotation parsing >> then docblocks are not ignorable. They're what you're actually looking for. >> >> Two possible additions, feel free to ignore if they're too complicated: >> >> 1) Should it return an array of token objects, or a lazy iterable? If I'm >> only interested in certain types (eg, doc strings, classes, etc.) then a >> lazy iterable would allow me to string some filter and map operations on to >> it and use even less memory overall, since the whole tree is not in memory >> at once. >> > > I'm going to take you up on your offer and ignore this one :P Returning > tokens as an iterator is inefficient because it requires full lexer state > backups and restores for each token. Could be optimized, but I wouldn't > bother with it for this feature. I also personally have no use-case for a > lazy token stream. (It's technically sufficient for parsing, but if you > want to preserve formatting, you're going to be preserving all the tokens > anyway.)
Try passing a 10MB PHP file that's all code into token_get_all(). It's pretty easy to hit hard memory limits and/or start crashing PHP when token_get_all() tokenizes the whole thing into a giant array or set of objects. Calling gc_mem_caches() when the previous RAM bits aren't needed anymore helps. Stream-based token parsing would be better for RAM usage but I can see how that might be complex to implement and largely not worth it since such scenarios will be rare and require the ability to maintain lexer state externally as you mentioned and would only be used by this part of the software. -- Thomas Hruska CubicleSoft President I've got great, time saving software that you will find useful. http://cubiclesoft.com/ And once you find my software useful: http://cubiclesoft.com/donate/