"Reader" as alternative to Iterator

  99693
July 1, 2017 17:38 andreas@dqxtech.net (Andreas Hennings)
Hello internals,
(this is my first email to this list, hopefully I'm doing ok.)

--------------------------------------------------------------------------------

Background / motivation:

Currently in PHP we have an interface "Iterator", and a final class
"Generator"
(and others) that implement it.

Using an iterator in foreach () is straightforward:
foreach ($iterator as $key => $value) {..}

However, if we want to iterate only a portion and then continue elsewhere
at the
position where we stopped, we need to do something like this:

for ($iterator->rewind(); $iterator->valid(); $iterator->next()) {
  $value = $iterator->current();
  [..]
}

This is unpleasantly verbose, and also adds performance overhead due to
additional function calls.

Also, manually writing an iterator is quite painful.

I sometimes implement "readers" that can be used like this (*):

// Gets a reader at position zero.
$reader = $readerProvider->getReader();
while (FALSE !== $value = $reader->read()) {
  [..]
}

(*) Note that I am using FALSE as an equivalent for "end of data". Of
course it
would be nice if we had a dedicated constant for this, that does not
constrain
the range of possible values of the iterator.

Such readers are much easier to write than iterators.
However, there is no native support for foreach(), and for generator syntax
with
yield.

Adapters from Iterator to Reader and vice versa are possible to write.
However, such userland adapters add additional performance overhead: One
call to
->read() will trigger one call to ->valid(), one to ->current(), one to
->next().

--------------------------------------------------------------------------------

Proposal:

Establish a new interface in core, "Reader" or "ReaderInterface" (*).
This interface has only one method, "->read()".
The existing interface Iterator will remain unchanged.

(*) I am open to other naming suggestions. In fact in my own projects I
called
this thing "StreamInterface", and distinguish between
"ObjectStreamInterface",
"RowStreamInterface" etc. Currently I think "Reader" might be more suitable.

Let the final class "Generator", and possibly other native iterators,
implement
Reader in addition to Iterator.

Optionally, add an interface "ReaderAggregate", or
"ReaderAggregateInterface",
or "ReaderProvider". This interface has only one method, "getReader()".

Let ReaderAggregate extend Traversable, and add foreach() support.
The key will simply be a counter.

Open questions:
- The naming of "Reader", "ReaderAggregate", "->read()".
- Which return value to use for "end of data", so that FALSE would become a
valid value.

I currently don't see a better option than FALSE.

--------------------------------------------------------------------------------

- Andreas Hennings
(https://github.com/donquixote)
  99696
July 3, 2017 02:15 pollita@php.net (Sara Golemon)
On Sat, Jul 1, 2017 at 1:38 PM, Andreas Hennings <andreas@dqxtech.net> wrote:
> Hello internals, > (this is my first email to this list, hopefully I'm doing ok.) > Welcome to php-internals!
> Establish a new interface in core, "Reader" or "ReaderInterface" (*). > This interface has only one method, "->read()". > The existing interface Iterator will remain unchanged. > Rather than introduce a new kind of iteration mechanism, how about
just creating a library in userspace which provides a proxy interface: class Reader implements Iterator { protected $it; public function __construct(iterable $it) { $this->it = $it; } public function current/key/next/rewind/valid() { $this->it->current/key/next/rewind/valid(); } public function read() { /* What a reader should be */ } } Then you can use any already extant iterable via: $reader = new Reader($iterable); This could live in a composer/packagist library and work on already released versions of PHP. Or is there something I'm missing? -Sara
  99697
July 3, 2017 02:49 andreas@dqxtech.net (Andreas Hennings)
Well, on a current project I made an attempt to write different
adapters in userland.
I finally decided that the clutter is not worth.
So for this project I wrote everything as "readers", and not as iterators.

With a native solution, one could do this:

function generate() {
  yield 'a';
  yield 'b';
  yield 'c';
}

$it = generate();

while (FALSE !== $v = $it->read()) {
  print $v . "\n";
}

With a userland adapter, the code would read like this:

function generate() {
  yield 'a';
  yield 'b';
  yield 'c';
}

$it = generate();

// This is the added line.
$reader = new IteratorReaderAdapter($it);

while (FALSE !== $v = $reader->read()) {
  print $v . "\n";
}


This does add clutter and performance overhead.
In this example I'd say this is acceptable / survivable.

In the project I was working on, I have multiple layers of "readers":
- One layer to read raw data from a CSV file.
- One layer to re-key the rows by column labels.
- One layer to turn each row into an object, with structured
properties instead of just string cells.

With the "reader" approach, for each layer I would have one "reader
provider" class and one "reader" class per layer.
(In fact I have much more, but let's keep it simple)
Generator syntax would allow to have only one class per layer.
But with the additional adapter, we again increase the number of classes.
(I wanted dedicated reader classes for different return types, e.g.
one "RowReader", one "AssocReader", one "ObjectReader". So here I
would need one adapter class per type. But let's focus on the simple
case, where you can use the same reader class.)

In addition to more classes and more function calls (performance), the
adapters also make stack traces heavier to look at.

Overall, for this project, I decided that it was not worth it.

The main purpose of the generator syntax is to simplify the code. The
need for userland adapters defeated this purpose.
Native readers would have made generators worthwhile for my use case.

The idea of "dedicated reader types" e.g. Reader could be
added to "Open questions"..
  99699
July 3, 2017 07:09 me@kelunik.com (Niklas Keller)
2017-07-03 4:49 GMT+02:00 Andreas Hennings <andreas@dqxtech.net>:

> Well, on a current project I made an attempt to write different > adapters in userland. > I finally decided that the clutter is not worth. > So for this project I wrote everything as "readers", and not as iterators. > > With a native solution, one could do this: > > function generate() { > yield 'a'; > yield 'b'; > yield 'c'; > } > > $it = generate(); > > while (FALSE !== $v = $it->read()) { > print $v . "\n"; > } > > With a userland adapter, the code would read like this: > > function generate() { > yield 'a'; > yield 'b'; > yield 'c'; > } > > $it = generate(); > > // This is the added line. > $reader = new IteratorReaderAdapter($it); > > while (FALSE !== $v = $reader->read()) { > print $v . "\n"; > } > > > This does add clutter and performance overhead. > In this example I'd say this is acceptable / survivable. > > In the project I was working on, I have multiple layers of "readers": > - One layer to read raw data from a CSV file. > - One layer to re-key the rows by column labels. > - One layer to turn each row into an object, with structured > properties instead of just string cells. > > With the "reader" approach, for each layer I would have one "reader > provider" class and one "reader" class per layer. > (In fact I have much more, but let's keep it simple) > Generator syntax would allow to have only one class per layer. > But with the additional adapter, we again increase the number of classes. > (I wanted dedicated reader classes for different return types, e.g. > one "RowReader", one "AssocReader", one "ObjectReader". So here I > would need one adapter class per type. But let's focus on the simple > case, where you can use the same reader class.) > > In addition to more classes and more function calls (performance), the > adapters also make stack traces heavier to look at. > > Overall, for this project, I decided that it was not worth it. > > The main purpose of the generator syntax is to simplify the code. The > need for userland adapters defeated this purpose. > Native readers would have made generators worthwhile for my use case. > > The idea of "dedicated reader types" e.g. Reader could be > added to "Open questions"..
Hey Andreas, what you're trying to do here seems to be premature optimization. While you save a bunch of method calls, your I/O will be the actual bottleneck there. It's entirely fine to implement such logic in userland. Amp has a similar interface for its streams, but those have only string|null as types. If you want to allow all values, you either need a second method or need to wrap all values in an object. http://amphp.org/byte-stream/#inputstream + http://amphp.org/amp/iterators/#iterator-consumption Regards, Niklas
  99711
July 3, 2017 15:50 andreas@dqxtech.net (Andreas Hennings)
On Mon, Jul 3, 2017 at 9:09 AM, Niklas Keller <me@kelunik.com> wrote:
> > Hey Andreas, > > what you're trying to do here seems to be premature optimization. While you > save a bunch of method calls, your I/O will be the actual bottleneck there. > It's entirely fine to implement such logic in userland.
I will let this stand unchallenged, until I have some reproducible data..
> > Amp has a similar interface for its streams, but those have only string|null > as types. If you want to allow all values, you either need a second method > or need to wrap all values in an object. > > http://amphp.org/byte-stream/#inputstream + > http://amphp.org/amp/iterators/#iterator-consumption
This library looks interesting. It seems to do a lot more than I currently need, with its concurrency approach. I am a bit puzzled by the yield keyword in this code: while (($chunk = yield $inputStream->read()) !== null) { $buffer .= $chunk; } In my experience with generators so far, you either use yield for sending or for receiving. So either "yield $value;" or "$value = yield;" In this case it seems to do both. I assume this is to achieve the concurrency. (This is not an argument for or against anything, just an observation)
  99719
July 3, 2017 17:17 me@kelunik.com (Niklas Keller)
2017-07-03 17:50 GMT+02:00 Andreas Hennings <andreas@dqxtech.net>:

> On Mon, Jul 3, 2017 at 9:09 AM, Niklas Keller <me@kelunik.com> wrote: > > > > Hey Andreas, > > > > what you're trying to do here seems to be premature optimization. While > you > > save a bunch of method calls, your I/O will be the actual bottleneck > there. > > It's entirely fine to implement such logic in userland. > > I will let this stand unchallenged, until I have some reproducible data.. > > > > > Amp has a similar interface for its streams, but those have only > string|null > > as types. If you want to allow all values, you either need a second > method > > or need to wrap all values in an object. > > > > http://amphp.org/byte-stream/#inputstream + > > http://amphp.org/amp/iterators/#iterator-consumption > > This library looks interesting. > It seems to do a lot more than I currently need, with its concurrency > approach. > I am a bit puzzled by the yield keyword in this code: > > while (($chunk = yield $inputStream->read()) !== null) { > $buffer .= $chunk; > } > > In my experience with generators so far, you either use yield for > sending or for receiving. So either "yield $value;" or "$value = > yield;" In this case it seems to do both. >
Indeed, it can do both at the same time.
> I assume this is to achieve the concurrency.
Not exactly, it "outputs" a promise and "inputs" the resolution value or exception once that promise resolves. Regards, Niklas
  99702
July 3, 2017 13:42 pollita@php.net (Sara Golemon)
On Sun, Jul 2, 2017 at 10:49 PM, Andreas Hennings <andreas@dqxtech.net> wrote:
> (I wanted dedicated reader classes for different return types, e.g. > one "RowReader", one "AssocReader", one "ObjectReader". So here I > would need one adapter class per type. But let's focus on the simple > case, where you can use the same reader class.) > You need that anyway. If the current iterator returns one type and
you want to transform that into another type, then you need something to actually do that. Having a reader interface won't magically know that you want to change the output type. If I'm misunderstanding that, and you're saying that the output type of the original iterator is already different and you somehow need a different proxy to blindly pass through the different type then... No. You don't. A single reader adapter will handle whatever type you pass through it.
> the adapters also make stack traces heavier to look at. > That is the only slightly compelling argument I've seen so far. Not
compelling enough IMO.
> The main purpose of the generator syntax is to simplify the code. The > need for userland adapters defeated this purpose. > ONE adapter, which lives in a library that you don't touch once it's
written. That doesn't defeat your purposes at all. That is, in fact, identical to having done the implementation in the core, expect when done in userland you have the opportunity to use it tomorrow rather than in December, and the ability to fix any bugs immediately rather than on a release cycle.
> Native readers would have made generators worthwhile for my use case. > You've yet to demonstrate that they are not worthwhile.
> The idea of "dedicated reader types" e.g. Reader could be > added to "Open questions".. > You have yet to demonstrate the need for dedicated and/or templatized
reader types. -Sara
  99710
July 3, 2017 15:43 andreas@dqxtech.net (Andreas Hennings)
Thanks everyone so far for the replies!
I think I need to do some "homework", and come back with benchmarks
and provide real examples.
I remember that the overhead did make a difference in performance, but
I should back that up with real data.

For now just some inline replies.


On Mon, Jul 3, 2017 at 3:42 PM, Sara Golemon <pollita@php.net> wrote:
> On Sun, Jul 2, 2017 at 10:49 PM, Andreas Hennings <andreas@dqxtech.net> wrote: >> (I wanted dedicated reader classes for different return types, e.g. >> one "RowReader", one "AssocReader", one "ObjectReader". So here I >> would need one adapter class per type. But let's focus on the simple >> case, where you can use the same reader class.) >> > You need that anyway. If the current iterator returns one type and > you want to transform that into another type, then you need something > to actually do that. Having a reader interface won't magically know > that you want to change the output type. > > If I'm misunderstanding that, and you're saying that the output type > of the original iterator is already different and you somehow need a > different proxy to blindly pass through the different type then... No. > You don't. A single reader adapter will handle whatever type you pass > through it.
Yes, the adapters were just blind proxies. In my own library I had dedicated reader types like XmlElementReaderInterface, RowReaderInterface (e.g. for CSV), AssocReaderInterface, ObjectReaderInterface, each with their own ->read() methods like ->getElement(), ->getRow(), ->getAssoc(), ->getObject(). So I ended up writing a distinct iterator adapter for each reader type. The different interfaces were helpful to prevent using a reader of the wrong type. But we would lose this anyway if this was implemented in core. All the methods would all collapse into just one method ->read(). Maybe some PhpDoc magic could help the IDE to distinguish the reader type. Currently my IDE does not support such a thing.
  99706
July 3, 2017 15:27 johannes@schlueters.de (Johannes =?ISO-8859-1?Q?Schl=FCter?=)
On Sa, 2017-07-01 at 19:38 +0200, Andreas Hennings wrote:
> Hello internals, > (this is my first email to this list, hopefully I'm doing ok.) > > ------------------------------------------------------------------- > ------------- > > Background / motivation: > > Currently in PHP we have an interface "Iterator", and a final class > "Generator" > (and others) that implement it. > > Using an iterator in foreach () is straightforward: > foreach ($iterator as $key => $value) {..} > > However, if we want to iterate only a portion and then continue > elsewhere at the position where we stopped, we need to do something > like this: > > for ($iterator->rewind(); $iterator->valid(); $iterator->next()) { >   $value = $iterator->current(); >   [..] > } > > This is unpleasantly verbose, and also adds performance overhead due > to additional function calls.
Wouldn't SPL's NoRewindIterator be enough? $nit = new NoRewindIterator($it); foreach ($nit as $row) {  break; } foreach ($nit as $row) {   // continues same iteration }
> > Also, manually writing an iterator is quite painful. > > I sometimes implement "readers" that can be used like this (*): > > // Gets a reader at position zero. > $reader = $readerProvider->getReader(); > while (FALSE !== $value = $reader->read()) { >   [..] > } > > (*) Note that I am using FALSE as an equivalent for "end of data". Of > course it would be nice if we had a dedicated constant for this, that > does not constrain the range of possible values of the iterator.
That distinction is the reason why next() and valid() are different methods in iterators. johannes
  99708
July 3, 2017 15:32 me@kelunik.com (Niklas Keller)
2017-07-03 17:27 GMT+02:00 Johannes Schlüter <johannes@schlueters.de>:

> On Sa, 2017-07-01 at 19:38 +0200, Andreas Hennings wrote: > > Hello internals, > > (this is my first email to this list, hopefully I'm doing ok.) > > > > ------------------------------------------------------------------- > > ------------- > > > > Background / motivation: > > > > Currently in PHP we have an interface "Iterator", and a final class > > "Generator" > > (and others) that implement it. > > > > Using an iterator in foreach () is straightforward: > > foreach ($iterator as $key => $value) {..} > > > > However, if we want to iterate only a portion and then continue > > elsewhere at the position where we stopped, we need to do something > > like this: > > > > for ($iterator->rewind(); $iterator->valid(); $iterator->next()) { > > $value = $iterator->current(); > > [..] > > } > > > > This is unpleasantly verbose, and also adds performance overhead due > > to additional function calls. > > Wouldn't SPL's NoRewindIterator be enough? > > $nit = new NoRewindIterator($it); > > foreach ($nit as $row) { > break; > } > foreach ($nit as $row) { > // continues same iteration > } > > > > > Also, manually writing an iterator is quite painful. > > > > I sometimes implement "readers" that can be used like this (*): > > > > // Gets a reader at position zero. > > $reader = $readerProvider->getReader(); > > while (FALSE !== $value = $reader->read()) { > > [..] > > } > > > > (*) Note that I am using FALSE as an equivalent for "end of data". Of > > course it would be nice if we had a dedicated constant for this, that > > does not constrain the range of possible values of the iterator. > > That distinction is the reason why next() and valid() are different > methods in iterators.
Not really, Iterator::next() returns void, so could as well return bool. Regards, Niklas
  99712
July 3, 2017 15:53 johannes@schlueters.de (Johannes =?ISO-8859-1?Q?Schl=FCter?=)
On Mo, 2017-07-03 at 17:32 +0200, Niklas Keller wrote:

> > That distinction is the reason why next() and valid() are different > > methods in iterators. > > Not really, Iterator::next() returns void, so could as well return > bool.
Well, that story is a bit longer and I cut it short. Let's assume we remove valid and use next's return value. Then we don't know if the first element exists or not, as next is only called after the first iteration. An alternative might be using only current() but then we need a special, magic, return value to mark the end oder references. Both are bad. johannes
  99715
July 3, 2017 16:08 andreas@dqxtech.net (Andreas Hennings)
On Mon, Jul 3, 2017 at 5:53 PM, Johannes Schlüter
<johannes@schlueters.de> wrote:
> On Mo, 2017-07-03 at 17:32 +0200, Niklas Keller wrote: > >> > That distinction is the reason why next() and valid() are different >> > methods in iterators.
I would rather say this is the reason why current() and valid() are different methods. A combination of ->current() and next() would not cause problems.
>> >> Not really, Iterator::next() returns void, so could as well return >> bool. > > Well, that story is a bit longer and I cut it short. Let's assume we > remove valid and use next's return value. Then we don't know if the > first element exists or not, as next is only called after the first > iteration. An alternative might be using only current() but then we > need a special, magic, return value to mark the end oder references. > Both are bad.
The proposed ->read() method is more or less the same as ->current() plus ->next(), if ->current() always returns FALSE on end-of-data. An object that has both ->read() and ->valid() would remove the need for a special magic return value. But in many many cases, having a magic return value FALSE is totally acceptable. Think of fgetcsv(), fread(), or PDOStatement::fetch(). So if we wanted, we could have one interface "SimpleReader" or "FalseTerminatedReader" with only the ->read() method, and another method "Reader" with both ->read() and ->valid(). (Sometimes I would like to have a new data type "unique symbol" for constants that are not strings or integers. So we could say "while (EOF !== $value = $reader->read());" but this is another topic)
> > johannes
  99716
July 3, 2017 16:11 andreas@dqxtech.net (Andreas Hennings)
On Mon, Jul 3, 2017 at 6:08 PM, Andreas Hennings <andreas@dqxtech.net> wrote:
> On Mon, Jul 3, 2017 at 5:53 PM, Johannes Schlüter > <johannes@schlueters.de> wrote: >> On Mo, 2017-07-03 at 17:32 +0200, Niklas Keller wrote: >> >>> > That distinction is the reason why next() and valid() are different >>> > methods in iterators. > > I would rather say this is the reason why current() and valid() are > different methods. > A combination of ->current() and next() would not cause problems. > >>> >>> Not really, Iterator::next() returns void, so could as well return >>> bool.
Just so we are on the same page: Changing the behavior of Iterator interface would be a BC break. Introducing a new interface, and letting Generator implement it, would not break anything. (except for the possible nameclash, if the an interface with the same name already exists in userland outside of a namespace) I assume we already agree on this, and this was more of a thought experiment.
  99714
July 3, 2017 15:58 andreas@dqxtech.net (Andreas Hennings)
On Mon, Jul 3, 2017 at 5:27 PM, Johannes Schlüter
<johannes@schlueters.de> wrote:
> > Wouldn't SPL's NoRewindIterator be enough? > > $nit = new NoRewindIterator($it); > > foreach ($nit as $row) { > break; > } > foreach ($nit as $row) { > // continues same iteration > }
I had not noticed this class :) My motivation was to be able to use iterators and readers interchangeably. Readers are easier to implement as classes. Iterators have the benefit of the generator syntax. The idea is to have a library where some stuff is implemented as reader classes, and some other things are implemented as iterator classes. Then for the actual operational code, I need to choose to use either iterator syntax or reader syntax. So either I need adapters for all iterators, or I need adapters for all readers. The NoRewindIterator would make it more viable to work with iterator syntax on operation level. So yes, it is a helpful hint. Not sure if it makes me fully happy.
  99759
July 5, 2017 10:51 rowan.collins@gmail.com (Rowan Collins)
On 3 July 2017 16:58:16 BST, Andreas Hennings <andreas@dqxtech.net> wrote:
>My motivation was to be able to use iterators and readers >interchangeably. >Readers are easier to implement as classes. >Iterators have the benefit of the generator syntax. >The idea is to have a library where some stuff is implemented as >reader classes, and some other things are implemented as iterator >classes. > >Then for the actual operational code, I need to choose to use either >iterator syntax or reader syntax. So either I need adapters for all >iterators, or I need adapters for all readers.
An alternative to adapters in this case might be a Trait that allows you to write the class as a Reader, but turn it into a fully-functioning Iterator with one line of code. If you always call your read method read() you only need a single adapter which calls that at the appropriate times and buffers the value to know if it's reached the end. That adapter could be "pasted" by a Trait into each class where read() is defined, and read() could even be private. Regards, -- Rowan Collins [IMSoP]
  99806
July 7, 2017 05:21 davey@php.net (Davey Shafik)
I believe the correct solution to this is an OuterIterator — an
OuterIterator contains the logic to constrain the loop, for example, we
have the LimitIterator, FilterIterator, RegexIterator, etc.

See this example:
http://php.net/manual/en/class.limititerator.php#example-4473

You can build your own OuterIterators, which implements whatever logic your
Reader should have.

OuterIterators are also compose-able, you can stack them, though of course
there are performance considerations here.

I would say that any alternative to OuterIterator's should have a clear and
demonstrable benefit in both developer experience, and performance before
it should be considered.

- Davey

On Wed, Jul 5, 2017 at 3:51 AM, Rowan Collins collins@gmail.com>
wrote:

> On 3 July 2017 16:58:16 BST, Andreas Hennings <andreas@dqxtech.net> wrote: > >My motivation was to be able to use iterators and readers > >interchangeably. > >Readers are easier to implement as classes. > >Iterators have the benefit of the generator syntax. > >The idea is to have a library where some stuff is implemented as > >reader classes, and some other things are implemented as iterator > >classes. > > > >Then for the actual operational code, I need to choose to use either > >iterator syntax or reader syntax. So either I need adapters for all > >iterators, or I need adapters for all readers. > > An alternative to adapters in this case might be a Trait that allows you > to write the class as a Reader, but turn it into a fully-functioning > Iterator with one line of code. If you always call your read method read() > you only need a single adapter which calls that at the appropriate times > and buffers the value to know if it's reached the end. That adapter could > be "pasted" by a Trait into each class where read() is defined, and read() > could even be private. > > Regards, > > -- > Rowan Collins > [IMSoP] > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php >