[PHP-DEV] [RFC] [VOTE] is_literal

  115306
July 5, 2021 18:14 craig@craigfrancis.co.uk (Craig Francis)
Hi Internals,

I have opened voting on https://wiki.php.net/rfc/is_literal for the
is-literal function.

The vote closes 2021-07-19

The proposal is to add the function is_literal(), a simple way to identify
if a string was written by a developer, removing the risk of a variable
containing an Injection Vulnerability.

This implementation is for literal strings ONLY (after discussion over
allowing integers) and, thanks to the amazing work of Joe Watkins, now
works fully with compiler optimisations, interned strings etc.

Craig
  115308
July 6, 2021 06:38 george.banyard@gmail.com ("G. P. B.")
On Mon, 5 Jul 2021 at 20:15, Craig Francis <craig@craigfrancis.co.uk> wrote:

> Hi Internals, > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > is-literal function. > > The vote closes 2021-07-19 > > The proposal is to add the function is_literal(), a simple way to identify > if a string was written by a developer, removing the risk of a variable > containing an Injection Vulnerability. > > This implementation is for literal strings ONLY (after discussion over > allowing integers) and, thanks to the amazing work of Joe Watkins, now > works fully with compiler optimisations, interned strings etc. > > Craig >
Hi Craig, Although I think the idea of the feature is useful, I'm not so sure about the implementation. I watched the talk you referenced a couple of times at different points in time (the first being a couple of years back), and I fail to see how this RFC is a similar implementation to it. As how they do it at Google is to have it part of the type systems (arguably in a weird way but nonaless), and due to the language being compiled the compiler will just flat out refuse to produce an executable if the types mismatch. From my understanding, the RFC's implementation is similar to what Google does, which is to "annotate" the string, but without having the guarantees of a compiler to back it up. This approach is totally reasonable for static analysis, as running it is akin to the compilation step in checking the validity. However, having this approach built into the language itself seems rather problematic to me. Ideally we would want to assign a variable to be of 'literal' type to ensure none of the actions applied to it demote it from being a literal, and when such a demotion would occur, for it to TypeError. Due to PHP's nature we cannot do this (yet?), therefore overloading the concatenation operation seems rather unwise. The case where concatenation between a literal and a non-literal happens, without error, is very similar to passing around a nullable type until one function/method/property doesn't accept null where it blows up into your face, and you need to track down where on earth did the null value came from, which might be multiple calls prior. And there has been a hell of a lot of talks/articles/etc. about *not* using nullable types due to this issue. This is I think the main issue with the current shape of the proposal. This implementation will detect certain security issues, but finding the root cause for them is going to be rather complicated, as the concatenation operation is basically kicking the can down the road about the responsibility of checking whether or not the result is a literal. Whereas using a function like concat_literal() which checks that the inputs are indeed literals provides immediate feedback that the type constraint is not being violated. Due to this reason, I'm voting against this proposal. Best regards, George P. Banyard
  115311
July 6, 2021 08:32 rowan.collins@gmail.com (Rowan Tommins)
On 06/07/2021 07:38, G. P. B. wrote:
> This is I think the main issue with the current shape of the proposal. > This implementation will detect certain security issues, but finding the > root cause for them is going to be rather complicated, as the concatenation > operation is basically kicking the can down the road about the > responsibility of checking whether or not the result is a literal. > Whereas using a function like concat_literal() which checks that the inputs > are indeed literals provides immediate feedback that the type constraint is > not being violated.
I still don't follow this reasoning, for the reasons I outlined here: https://externals.io/message/114835#114868 and again later: https://externals.io/message/115037#115156 You can write your own concat_literal() function with the current implementation: function concat_literal(string $a, string $b): string {    if( ! is_literal($a) || ! is_literal($b) ) {         throw new TypeError;    }    return $a . $b; } This does everything a native version would, and can be used in all the same places. What it won't do, is tell you when you've forgotten to use it, and used the normal string concatenation operator - but nor would a built-in implementation. Whether "$foo . $bar" is always non-literal, or non-literal only if one of its operands is, you're going to get an error about a non-literal string somewhere else in the program, and have to trace back to find where the "bad" concatenation happened. Regards, -- Rowan Tommins [IMSoP]
  115386
July 11, 2021 17:00 Danack@basereality.com (Dan Ackroyd)
Rowan Tommins collins@gmail.com> wrote:

> I still don't follow this reasoning, for the reasons I outlined here: > ... > Whether "$foo . $bar" is always non-literal, or non-literal only if one > of its operands is, you're going to get an error about a non-literal > string somewhere else in the program, and have to trace back to find > where the "bad" concatenation happened.
There are two differences. The first is "even if there is a problem, you at least can find the relevant code". With string concatenation preserving literal flags, when a problem is detected, the situation is: "There is a non-literal somewhere in this chunk of HTML. Good luck finding it in the string, and then figuring out where it comes from in the code." With having to jump through the hoop of using a specific function to preserve the literal flag, the situation is: "This line of code is wrong. It needs to be changed to use escaping." To be precise, you wouldn't _have_ to trace back to find where the non-literal value comes from. At the point where the error is you would probably just change it to use the appropriate escaping. e.g. if this line of fails, because the color suddenly becomes a non-literal: literal_concat("
"); you don't need to figure out where the color is coming from, you can just change it to use a function that does the appropriate escaping: html("
", $up->getColor()); In general, any use of literal_concat() where the correctness of it isn't obvious from reading the code, should probably start a conversation of "have you considered defaulting to using escaping/prepared query here"? # It prevents the problem reaching an end-user. The solution at Google prevents any security problem from reaching the end-user by refusing to compile code. One of the problems with PHP is that it is 'normal' for logs to be filled with warnings...which means that most people just ignore them. By allowing security problems to get through, rather than defaulting to being an error means that they will reach end-users, and the warning messages will end up in log files, and the process for finding and fixing them will take more time than people will like. So instead of defaulting to safe, it will default to being ignorable. And the whole point (in my understanding) of Chris Kern's talk, is that you need to make software safe by default.
> What it won't do, is tell you when you've forgotten to use it,
Which is a huge difference. If you're aware that you're touching a security sensitive bit of code, you're likely to do the right thing anyway, and so won't need the crutch of is_literal. If you're unaware that you're touching a security sensitive bit of code, you're more likely to accidentally include user input where it isn't meant to be used. That means the feature would be annoying for the people who need it most. Allowing errors to get through to users, for them to be allegedly fixed at a later date is referred to as "Normalization of deviance" and isn't a good choice for this RFC, imo. cheers Dan Ack * problems with static analysis as a solution ** Not everyone runs it, and it would be nicer to allow wordpress et al to provide this security threat detection rather than hoping users decided to use static analysis. ** It might require libraries also using the same static analysis annotations. ** It might get quite tricky to implement correctly, in particular for functions where the literal-ness of the output depends on the literal-ness of the input.
  115408
July 12, 2021 14:46 rowan.collins@gmail.com (Rowan Tommins)
On 11/07/2021 18:00, Dan Ackroyd wrote:
>> What it won't do, is tell you when you've forgotten to use it, > Which is a huge difference.
You've edited out the crucial last part of that sentence: > but nor would a built-in implementation I don't know how to make this point more emphatically: Neither a built-in literal_concat function, nor setting the internal flag to false when you use the "." operator, can prevent you using plain concatenation in the wrong place. Regardless of what the . operator does to the internal literal flag, the following code will give an error in exactly the same place: // What this line does to the literal flag is irrelevant: $blah = 'a' . 'b'; // This line will set the literal flag to false, but will not raise any errors: $foo = $blah . $_GET['foo']; // $foo is not literal, so $bar is not literal; but there are still no errors: $bar = $foo . '123'; // Finally, an error will be raised when trying to use $bar in literal_concat, regardless of whether it's built-in or user-implemented: $baz = literal_concat($bar, '456'); The only way to put that error in a different place would be if the . operator itself was able to raise errors, and I can't think of any way that would be possible. Regards, -- Rowan Tommins [IMSoP]
  115312
July 6, 2021 09:32 craig@craigfrancis.co.uk (Craig Francis)
On Tue, 6 Jul 2021 at 7:38 am, G. P. B. banyard@gmail.com> wrote:

> Although I think the idea of the feature is useful, > I'm not so sure about the implementation. > [...] > Whereas using a function like concat_literal() which checks that the > inputs are indeed literals provides immediate feedback that the type > constraint is not being violated.
Hi George, Thank you for your message. We have provided a userland `literal_concat()` function in the RFC to do exactly what you’re suggesting, while allowing developers to choose to do things like raise exceptions during development/testing, and ignore/log issues when running in production. So you absolutely can use it like that if you want. https://wiki.php.net/rfc/is_literal#support_functions We also agree that a dedicated type would be useful, but as noted by Joe and someniatko, that should come in 8.2 once the function is established (allowing us to potentially build on Intersection Types, and will involve a separate discussion). This is noted under "Future Scope". https://externals.io/message/114835#114847 The only difference is that we decided to allow string concatenation of literals, as we want to provide something that’s usable for everyone immediately, provides the same level of security, and doesn’t require a mass rewriting of existing code. (Which is basically a death sentence for most security-based improvements, as a lot of people won’t have the time/energy to do that. The more automatic security can be, the better, which is why something libraries can implement is ideal). https://wiki.php.net/rfc/is_literal#string_concatenation Excluding concatenation would almost certainly prevent libraries from using this check, simply because developers do use concatenation, which will result in too many invalid errors, requiring them to make substantial/unnecessary changes (i.e. replacing every string concat with `literal_concat()`, or using a special Query Builder for their SQL/HTML/CLI/etc). It’s also worth noting that developers who want to use `strict_types` are probably using static analysis already, where Psalm has just added support for this (thank you Matthew): https://github.com/vimeo/psalm/releases/tag/4.8.0 Thanks, Craig
  115387
July 11, 2021 17:09 Danack@basereality.com (Dan Ackroyd)
On Tue, 6 Jul 2021 at 10:32, Craig Francis <craig@craigfrancis.co.uk> wrote:
> which will result in too many invalid errors, requiring them to make > substantial/unnecessary changes
My understanding is that the majority of PHP developers use one of the many frameworks available. All of those provide libraries that people use rather than accessing the DB directly. At the other end of the scale, you have companies with large-ish engineering teams (e.g. 30+) where access to the DB is done through an in-house library. The only people who don't already go through an access layer of code to reach the DB are people who don't depend on frameworks, and maintain their own applications with a small (or singular) engineering team. This type of person may be over-represented in internals, but are probably relatively few in number compared to the majority of PHP users.
> requiring them to make substantial/unnecessary changes
My very strong suspicion is that if/when people start using this, the majority of changes that would be required would be actual security problems that ought to be fixed.
> (i.e. replacing every string concat with > `literal_concat()`, or using a special Query Builder for their > SQL/HTML/CLI/etc).
Most people should be using a library for building HTML and CLI escaping, because remembering how to do the escaping properly is really difficult. I certainly never remember the difference between the correct escaping for CSS and JS. But the changes required for supporting DB queries just don't seem that big to me. I've put a code example of an implementation below. I'm pretty sure that the changes needed to fix the actual security problems that exist in their code would be bigger, and also that tracking down a single leaked non-literal would take longer than migrating code to use the new feature. As I said in my reply to Rowan, making it easy to track down issues where they occur, minimises the cost of using this feature over the years it would be used. cheers Dan Ack # Before $query = "select * from preferences user_id = " . $_GET['user_id']; db_exec($query); # After $query[] = "select * from preferences user_id = "; $query[] = $_GET['user_id']; db_exec($query); And the function db_exec would be changed to something like: function db_exec(string|array $query_parts) { if (is_string($query_parts) && !is_literal($query_parts)) { throw new \Exception("Cannot use non literal string as query. Please pass the parts in as an array"); } else { foreach ($query_parts as $query_part) { if (is_string($query_part) && !is_literal($query_part) { throw new \Exception("non-literal string found [$query_part]"); } else if (is_int($query_part)) { // todo - decide if you want to allow this or not. // I personally wouldn't. } else { // todo - support other types } } $query = implode("", $query_parts); } // rest of db_exec here... }
  115392
July 12, 2021 01:09 craig@craigfrancis.co.uk (Craig Francis)
On Sun, 11 Jul 2021 at 18:09, Dan Ackroyd <Danack@basereality.com> wrote:

> As I said in my reply to Rowan, making it easy to track down issues where > they occur, minimises the cost of using this feature over the years it > would be used. >
This implementation allows you to do all of that. If you find debugging is a problem, you can choose not to use string concatenation (I haven't needed to). The implementation allows you to use a $query array, exactly as you describe: https://3v4l.org/aCFIT/rfc#vrfc.literals And the RFC itself provides you with a userland version of the `literal_concat()` function as you proposed, allowing anyone to customise it to their needs (e.g. only throwing an exception during development). But forcing every single project in existence to replace every instance of concatenation, especially when it is not necessary, and doesn’t improve security in any way, I would argue makes it too strict, and would harm adoption. Also, thanks for mentioning in your email to Rowan the problems with static analysis. Craig
  115405
July 12, 2021 13:16 Danack@basereality.com (Dan Ackroyd)
On Mon, 12 Jul 2021 at 02:09, Craig Francis <craig@craigfrancis.co.uk> wrote:
> > you can choose not to use string concatenation ... allowing anyone to customise it to their needs
Please can you explain how: i) An individual programmer can enforce that they don't accidentally use string concatenation for stuff that is used in a security sensitive context. ii) A team of 50 programmers can enforce that they don't accidentally use string concatenation for stuff that is used in a security sensitive context. iii) A library can enforce that string concatenation isn't used in the params passed to it.
> and doesn’t improve security in any way,
It prevents issues from being able to make it through to production. But the main reason would be to reduce the cost of using this feature long-term.
> you can choose not to use string concatenation (I haven't needed to).
Wait....what? Is your position both that preserving literal-ness across string concatenation is required, otherwise this feature is too hard to use, and at the same time, you've not needed that in your own applications. Is that right? Because if preserving literal-ness across strings wasn't required for you...why would it be required for every other project? And to be clear, I don't think it's required. I think by listening to feedback from people who aren't sure it's a good idea, who said "this is a good idea but only if it's really easy to start using it" that this RFC has been watered down from the most useful proposal. At the core, there is a good idea behind this RFC, but the set of trade-offs chosen just aren't the right ones, and aren't the "proven" trade-offs made at Google. cheers Dan Ack
  115414
July 12, 2021 18:56 craig@craigfrancis.co.uk (Craig Francis)
On Mon, 12 Jul 2021 at 14:17, Dan Ackroyd <Danack@basereality.com> wrote:

> On Mon, 12 Jul 2021 at 02:09, Craig Francis <craig@craigfrancis.co.uk> > wrote: > > > > you can choose not to use string concatenation ... allowing anyone to > customise it to their needs > > Please can you explain how: > > i) An individual programmer can enforce that they don't accidentally > use string concatenation for stuff that is used in a security > sensitive context. >
i) Usually if developers want to add specific restrictions for themselves or their team that the vast majority of a userbase would not need or use, they would use a separate tool that fills that niche to enforce their chosen coding style (like how they might want to enforce "tabs vs spaces" in their own codebase).
> ii) A team of 50 programmers can enforce that they don't accidentally > use string concatenation for stuff that is used in a security > sensitive context. >
ii) Same answer as i) as it scales. iii) A library can enforce that string concatenation isn't used in the
> params passed to it. >
iii) A library shouldn't care if the developer used concatenation, it should only care if user data has been included incorrectly (i.e. checking for an Injection Vulnerability). A library’s purpose is to ensure whether code is safe or not, not about enforcing personal coding styles.
> and doesn’t improve security in any way, > > It [disallowing concatenation] prevents issues from being able to make it > through to production. > But the main reason would be to reduce the cost of using this feature > long-term. >
Disallowing concatenation doesn’t guarantee that issues can’t get through to production though; for example, something that can sometimes be a non-literal, but during development defaults to a literal. From a security point of view, it doesn’t matter whether the error is caught at the point of concatenation, or when it’s checked as the library receives it, because the injection vulnerability gets caught either way. I take it the "cost" that you’re referring to is just the debugging time? Ultimately any extra debug time that might occur, is magnitudes less than the time it would take almost every developer to check and rewrite most of the projects that exist today, which is what the idea of not supporting concatenation requires. For a developer who really finds debugging too onerous, there are other ways of debugging - using tools better suited for it such as XDebug or Static Analysis tools like Psalm (and as above, if you were wanting to force yourself/your team to not concatenate, you would be using a Static Analysis tool already (i/ii)).
> you can choose not to use string concatenation (I haven't needed to). > > Wait....what? > > Is your position both that preserving literal-ness across string > concatenation is required, otherwise this feature is too hard to use, > and at the same time, you've not needed that in your own applications. > > Is that right? Because if preserving literal-ness across strings > wasn't required for you...why would it be required for every other > project? > > And to be clear, I don't think it's required. >
We are trying to improve the language for the majority of developers. I’m an experienced PHP developer who is genuinely passionate about security. I find it fun, curating my code to be as secure as possible is practically a hobby. That makes me an Outlier. Like a lot of us here. My focus is on writing an RFC that works for as many developers as possible, so whether it’s ‘necessary’ in my own personal projects is irrelevant to whether a simple safety feature should exist for the community. We need to make things easy and safe for the people who are /not/ just high-level programmers, but for people who don’t know everything we do and need things to be simple and functional as possible. My projects do use a lot of string concatenation. Not an erroneously high amount, probably about normal. So let’s say we do break string concatenation: For some numbers, I've found roughly 1,300 instances of SQL and HTML concatenation in my projects. And even if I would be willing to go through such a big task to replace them, the real problem is actually finding them, because if you’re trying to use a search, well, who would have thought looking for "." would return so many results?
> I think by listening to feedback from people who aren't sure it's a > good idea, who said "this is a good idea but only if it's really easy > to start using it" that this RFC has been watered down from the most > useful proposal. At the core, there is a good idea behind this RFC, > but the set of trade-offs chosen just aren't the right ones, and > aren't the "proven" trade-offs made at Google. >
The proposal hasn’t changed. This is keeping to the original concept, and while you wanted to remove string concatenation support, that does not mean that anything so strict was our intention. The proposal was always meant for the greatest and easiest adoption possible, but that was your creative difference with us, which is fine, but doesn’t mean that this isn’t exactly as originally intended. The only real change to `is_literal()` that has been made since the start of this RFC is improvements to the compilation process. Otherwise it is the same as day one. If it will put your mind at ease, Krzysztof Kotowicz is probably the best placed person to provide feedback, as he is the implementor of this same principle in JavaScript. And he provided feedback to this project, saying that they trust concatenated constants, and that yes, while a programmer could go out of their way to do something *intentionally* dangerous (like building up an array of single character literals, and joining them together based on user input), the “go-safe-html” library authors decided that "the ergonomics of trusting concatenated constants far outweighs the security concern". Craig
  115440
July 16, 2021 14:50 Danack@basereality.com (Dan Ackroyd)
On Mon, 12 Jul 2021 at 19:57, Craig Francis <craig@craigfrancis.co.uk> wrote:

> the “go-safe-html” library authors decided that > "the ergonomics of trusting concatenated constants far outweighs the security concern".
Go is a quite different programming language to PHP. When they say 'constants', they appear to be able to enforce that by using a clever feature of that language https://github.com/google/safehtml/blob/2057dd9c30f9e264f4d01c29d886d51f1b519302/template/template.go#L440-L452 : ``` // stringConstant is an unexported string type. Users of this package cannot // create values of this type except by passing an untyped string constant to // functions which expect a stringConstant. This type must be used only in // function and method parameters. type stringConstant string func stringConstantsToStrings(strs []stringConstant) []string { ret := make([]string, 0, len(strs)) for _, s := range strs { ret = append(ret, string(s)) } return ret } ``` i.e. this code works, as the url template is an actual string constant (not a variable): ``` safehtml.TrustedResourceURLFormatFromConstant( `//www.youtube.com/v/%{id}?hl=%{lang}`, map[string]string{ "id": "abc0def1", "lang": "en", }) ``` Attempting to use a string variable, fails to even compile: ``` format := `//www.youtube.com/v/%{id}?hl=%{lang}` safehtml.TrustedResourceURLFormatFromConstant( format, map[string]string{ "id": "abc0def1", "lang": "en", }) cannot use format (type string) as type safehtml.stringConstant in argument to safehtml.TrustedResourceURLFormatFromConstant ``` The current JavaScript equivalent ideas for string literals appear to be inactive or archived: * https://github.com/mikewest/tc39-proposal-literals - This repository has been archived by the owner. It is now read-only. * https://github.com/tc39/proposal-array-is-template-object - not particularly active. But my understanding is that like the Go implementation, they are proposing to enforce literal-ness of function parameters through special compilation rules for parameters to function calls, rather than passing literal strings around to arbitrary places - below* or https://tc39.es/proposal-array-is-template-object/#sec-gettemplateobject The other JavaScript approach for dealing with trusted types (https://auth0.com/blog/securing-spa-with-trusted-types/) is even more different than this proposal. It seems pretty inaccurate to claim that either the safehtml library or the proposal for JavaScript support the choice made for the PHP RFC. They don't appear to actually allow carrying literal-ness through variables, only through compile-time constants that are placed inside the parentheses or a function call. They also work in a different way, and use features of those languages not available in PHP. cheers Dan Ack * this is the definition for the proposed JavaScript literal implementation, I think. ``` Runtime Semantics: GetTemplateObject ( templateLiteral ) The abstract operation GetTemplateObject is called with a Parse Node, templateLiteral, as an argument. It performs the following steps: 1. Let realm be the current Realm Record. 2. Let templateRegistry be realm.[[TemplateMap]]. 3. For each element e of templateRegistry, do a. If e.[[Site]] is the same Parse Node as templateLiteral, then i. Return e.[[Array]]. 4. Let rawStrings be TemplateStrings of templateLiteral with argument true. 5. Let cookedStrings be TemplateStrings of templateLiteral with argument false. 6. Let count be the number of elements in the List cookedStrings. 7. Assert: count ≤ 232 - 1. 8. Let template be ! ArrayCreate(count). 9. Set template.[[TemplateObject]] to true. 10. Let rawObj be ! ArrayCreate(count). 11. Let index be 0. 12. ... ```
  115443
July 16, 2021 17:40 craig@craigfrancis.co.uk (Craig Francis)
On Fri, 16 Jul 2021 at 15:50, Dan Ackroyd <Danack@basereality.com> wrote:

> On Mon, 12 Jul 2021 at 19:57, Craig Francis <craig@craigfrancis.co.uk> > wrote: > > > the “go-safe-html” library authors decided that > > "the ergonomics of trusting concatenated constants far outweighs the > security concern". > > Go is a quite different programming language to PHP. >
Go is different, it's limited to running the check at compile time. That's why I was referencing the “go-safe-html” library. PHP is more dynamic, so we don't need to have the same restrictions (we can allow the developer to concatenate the string, which allows us to support existing code, rather than relying on the library). The current JavaScript equivalent ideas for string literals appear to be
> inactive or archived.
As noted in the RFC, it's this one: https://github.com/tc39/proposal-array-is-template-object Krzysztof has just confirmed that he’s working on it, and is currently getting it through tc39 (specifically updates related to Realms, a way of executing JavaScript within the context of a new global object, something PHP does not need to worry about). The other JavaScript approach for dealing with trusted types
> (https://auth0.com/blog/securing-spa-with-trusted-types/) is even more > different than this proposal.
While the article shows some React/Angular code, the focus is on Trusted Types, which works with this concept. It protects unsafe APIs (like innerHTML), where you can create a policy with methods to check/filter values (e.g. forcing the use of DOMPurify). The isTemplateObject method (which checks that a template string was created by the developer) will work with Trusted Types, so you don't need to rely on filtering (unreliable/limited). Craig
  115328
July 6, 2021 14:48 dik.takken@gmail.com (Dik Takken)
On 05-07-2021 20:14, Craig Francis wrote:
> Hi Internals, > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > is-literal function.
I am glad to see that the RFC eventually turned out as originally proposed. It is simple and provides clear and strong guarantees about the origins of string data. While it has its limitations I do see the potential it has as a building block for building strong security guarantees in applications. Success of this feature probably depends on it being integrated in frameworks. If I had a vote I would vote along with framework authors. Regards, Dik Takken
  115337
July 7, 2021 09:23 nikita.ppv@gmail.com (Nikita Popov)
On Mon, Jul 5, 2021 at 8:15 PM Craig Francis <craig@craigfrancis.co.uk>
wrote:

> Hi Internals, > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > is-literal function. > > The vote closes 2021-07-19 > > The proposal is to add the function is_literal(), a simple way to identify > if a string was written by a developer, removing the risk of a variable > containing an Injection Vulnerability. > > This implementation is for literal strings ONLY (after discussion over > allowing integers) and, thanks to the amazing work of Joe Watkins, now > works fully with compiler optimisations, interned strings etc. >
The new implementation, while being very predictable, also comes at a cost. The RFC doesn't really explain how this works, so let me explain it here (based on a cursory look through the patch): Strings have a new flag that indicates whether they are "literal". The problem is that PHP performs string interning, which means that certain strings, mostly those seen during compilation of PHP code, are deduplicated. If you write "foo" two times in a piece of PHP code, there will only be one interned string "foo". Now, what happens if there is both a literal string "foo" and a non-literal string "foo" and we want to intern both? I believe what the original implementation did is to just assume that all interned strings are literal strings (this is an educated guess, I don't know where to find the old implementation). This makes things technically very simple, and is a pretty sensible assumption in practice. The reason is that interned strings are almost always derived from literal strings in the program. The main exception to that are a) single character strings, where a lot of code will prefer fetching an interned string rather than allocating a new one and b) the result of optimizations in conjunction with opcache, which will render strings like $a=1; $b="Foo".$a; interned as a result of constant propagation. What the new implementation does is to actually allow two separate interned strings for "foo"(literal) and "foo"(not literal). Part of the hashtable implementation needs to be duplicated, because it now needs to distinguish strings that are nominally equal. The complexity involved here doesn't look terrible, but I'm unsure about the performance and memory usage impact. Where per-request strings are concerned, I expect duplication to be low, because most per-request interned strings are going to be literal. However, I don't think this holds for permanent interned strings, which will be predominantly (or entirely?) non-literal. At least it doesn't look to me like names of internal functions/classes/etc are considered literal. I think this may be problematic, in that now the permanent non-literal interned string in the function/class tables will be different from the per-request literal interned string used by user code to reference it. This means that all of these are going to be duplicated, and will no longer benefit from efficient identity comparison, and will have to be compared based on string contents instead. I'm not sure what the actual impact on performance would be. The RFC indicates that the impact is minor, but I believe those measurements were made with the original version of the RFC, which did not try to distinguish literal and non-literal interned strings. Overall, this is a no vote from my side. While I was not entirely convinced by the RFC, I was willing to accept the original approach based on sheer simplicity -- tracking the "probably literal" flag didn't really cost us anything in terms of either technical complexity or performance. With the new implementation this isn't the case anymore. I could be convinced that the technical implications here are unproblematic based on further analysis, but the current RFC doesn't really discuss this point at all. Regards, Nikita
  115338
July 7, 2021 09:47 craig@craigfrancis.co.uk (Craig Francis)
On Wed, 7 Jul 2021 at 10:23 am, Nikita Popov ppv@gmail.com> wrote:

> The RFC indicates that the impact is minor, but I believe those > measurements were made with the original version of the RFC, which did not > try to distinguish literal and non-literal interned strings
Hi Nikita, The performance test results in the RFC are based on the latest version, updated on Monday (before the vote started), Máté Kocsis did these tests independently to ensure there was no bias from me (my own tests were roughly the same). Joe Watkins would be better placed to explain the details of the implementation. We have always tried to keep the implementation as simple as possible, but there were concerns that certain optimisations by the compiler would make certain strings appear as literals, not in an unsafe way, but in a way that might be confusing to developers, with the examples being noted in the RFC: https://wiki.php.net/rfc/is_literal#compiler_optimisations Thanks, Craig
  115339
July 7, 2021 09:50 krakjoe@gmail.com (Joe Watkins)
The perf numbers are for current implementation.

Predictability seems more important than simplicity, considering what the
feature is for.

The only way to make the simpler implementation predictable in the same
kind of way was found to be unacceptable during discussion.

Cheers
Joe

On Wednesday, 7 July 2021, Nikita Popov ppv@gmail.com> wrote:

> On Mon, Jul 5, 2021 at 8:15 PM Craig Francis <craig@craigfrancis.co.uk> > wrote: > > > Hi Internals, > > > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > > is-literal function. > > > > The vote closes 2021-07-19 > > > > The proposal is to add the function is_literal(), a simple way to > identify > > if a string was written by a developer, removing the risk of a variable > > containing an Injection Vulnerability. > > > > This implementation is for literal strings ONLY (after discussion over > > allowing integers) and, thanks to the amazing work of Joe Watkins, now > > works fully with compiler optimisations, interned strings etc. > > > > The new implementation, while being very predictable, also comes at a cost. > The RFC doesn't really explain how this works, so let me explain it here > (based on a cursory look through the patch): > > Strings have a new flag that indicates whether they are "literal". The > problem is that PHP performs string interning, which means that certain > strings, mostly those seen during compilation of PHP code, are > deduplicated. If you write "foo" two times in a piece of PHP code, there > will only be one interned string "foo". Now, what happens if there is both > a literal string "foo" and a non-literal string "foo" and we want to intern > both? > > I believe what the original implementation did is to just assume that all > interned strings are literal strings (this is an educated guess, I don't > know where to find the old implementation). This makes things technically > very simple, and is a pretty sensible assumption in practice. The reason is > that interned strings are almost always derived from literal strings in the > program. The main exception to that are a) single character strings, where > a lot of code will prefer fetching an interned string rather than > allocating a new one and b) the result of optimizations in conjunction with > opcache, which will render strings like $a=1; $b="Foo".$a; interned as a > result of constant propagation. > > What the new implementation does is to actually allow two separate interned > strings for "foo"(literal) and "foo"(not literal). Part of the hashtable > implementation needs to be duplicated, because it now needs to distinguish > strings that are nominally equal. > > The complexity involved here doesn't look terrible, but I'm unsure about > the performance and memory usage impact. Where per-request strings are > concerned, I expect duplication to be low, because most per-request > interned strings are going to be literal. However, I don't think this holds > for permanent interned strings, which will be predominantly (or entirely?) > non-literal. At least it doesn't look to me like names of internal > functions/classes/etc are considered literal. I think this may be > problematic, in that now the permanent non-literal interned string in the > function/class tables will be different from the per-request literal > interned string used by user code to reference it. This means that all of > these are going to be duplicated, and will no longer benefit from efficient > identity comparison, and will have to be compared based on string contents > instead. > > I'm not sure what the actual impact on performance would be. The RFC > indicates that the impact is minor, but I believe those measurements were > made with the original version of the RFC, which did not try to distinguish > literal and non-literal interned strings. > > Overall, this is a no vote from my side. While I was not entirely convinced > by the RFC, I was willing to accept the original approach based on sheer > simplicity -- tracking the "probably literal" flag didn't really cost us > anything in terms of either technical complexity or performance. With the > new implementation this isn't the case anymore. I could be convinced that > the technical implications here are unproblematic based on further > analysis, but the current RFC doesn't really discuss this point at all. > > Regards, > Nikita >
  115373
July 8, 2021 22:23 kocsismate90@gmail.com (=?UTF-8?B?TcOhdMOpIEtvY3Npcw==?=)
Hi Nikita,

I performed a few other benchmarks in order to provide a little bit more
insights into the performance aspect of the RFC. My latest measurement is
using the same setup as the previous
one, although I made a few smaller changes in the process (e.g. changing
the order of the tests). So at the end, I got the following results:

- Laravel demo app: +0.1%
- Symfony demo app: +0.43%
- bench.php: +0.4%
- custom concat bench: +3.89%

The results are comparing the "literals" branch with the commit in master
on which Joe's work is based. As far as I remember, the simpler variant of
is_literals had a 0.1% slowdown
in case of Symfony, and 2.09% in case of the custom concat benchmark.
Personally, I think even the current numbers are acceptable, especially
because PHP 8.1 is going to be
roughly 30% faster than PHP 8.0 according to my benchmark in case of the
Symfony demo app. Laravel's result is very similar, but I don't remember
about the exact numbers.

Regards:
Máté
  115435
July 16, 2021 10:04 nikita.ppv@gmail.com (Nikita Popov)
On Fri, Jul 9, 2021 at 12:23 AM Máté Kocsis <kocsismate90@gmail.com> wrote:

> Hi Nikita, > > I performed a few other benchmarks in order to provide a little bit more > insights into the performance aspect of the RFC. My latest measurement is > using the same setup as the previous > one, although I made a few smaller changes in the process (e.g. changing > the order of the tests). So at the end, I got the following results: > > - Laravel demo app: +0.1% > - Symfony demo app: +0.43% > - bench.php: +0.4% > - custom concat bench: +3.89% > > The results are comparing the "literals" branch with the commit in master > on which Joe's work is based. As far as I remember, the simpler variant of > is_literals had a 0.1% slowdown > in case of Symfony, and 2.09% in case of the custom concat benchmark. > Personally, I think even the current numbers are acceptable, especially > because PHP 8.1 is going to be > roughly 30% faster than PHP 8.0 according to my benchmark in case of the > Symfony demo app. Laravel's result is very similar, but I don't remember > about the exact numbers. >
Thanks Máté! While I'm not really happy with the technical implications, your numbers clearly show that there is barely any impact in practice. I've dropped my "no" vote for that reason. Regards, Nikita
>
  115378
July 9, 2021 10:49 lauri.kentta@gmail.com (=?UTF-8?Q?Lauri_Kentt=C3=A4?=)
On 2021-07-07 12:23, Nikita Popov wrote:
> permanent interned strings, which will be predominantly (or entirely?) > non-literal. At least it doesn't look to me like names of internal > functions/classes/etc are considered literal. I think this may be > problematic, in that now the permanent non-literal interned string in > the > function/class tables will be different from the per-request literal > interned string used by user code to reference it.
Just wondering, if this is a major performance problem, is there a compelling reason why the internal names couldn't be marked literal (even if it's technically "wrong")? It seems unlikely to get a function or class name accidentally in a SQL query and even less likely that user input was involved. -- Lauri Kenttä
  115381
July 9, 2021 14:07 craig@craigfrancis.co.uk (Craig Francis)
On Fri, 9 Jul 2021 at 11:49, Lauri Kenttä kentta@gmail.com> wrote:

> is there a compelling reason why the internal names couldn't be marked > literal > (even if it's technically "wrong")?
Hi Lauri, Just again quickly noting we’re only talking about ~0.43% difference, nothing major in any way. But we wanted to ensure developers didn't wonder why something was seen as a literal in one case, but not in another (we wanted to ensure consistency), because most people don’t know much about how the internals of PHP work - especially optimisations which are supposed to be kind of ‘invisible’ to developers. This change also helped when a couple of other compiler optimisations happen: https://wiki.php.net/rfc/is_literal#compiler_optimisations Craig
  115425
July 14, 2021 14:52 craig@craigfrancis.co.uk (Craig Francis)
Injection Vulnerabilities remain at the top of the OWASP "Top 10 Web
Application Security Risks".

It’s important to remember that Injection Vulnerabilities don't just affect
the developer, but rather the data of potentially thousands of people using
the website/system.

These can even occur when using libraries. Take this example from CakePHP,
where the developer has dangerously included user data into the SQL:

  $users->find()->where(['age >= ' . $_GET['age']]);

By distinguishing strings from a trusted developer, from strings that may
be attacker controlled, libraries can ensure values that go directly into
the SQL, HTML, CLI, etc have not been "Injected" with user data.

PHP is now lagging behind other languages, where Java and Go can already
test for developer defined strings (it's also being implemented in
JavaScript).

is_literal() is a simple and minor change that simply utilises a currently
unused flag on strings to mark whether the string was written by the
developer. It requires no rewriting of code by the developer to work, no
grand visionary overhaul of the PHP language, with only a 0.43% difference
in speed that is too small to measure with normal internet/database
variability. It’s just a basic but effective way of being able to warn
about and locate Injection Vulnerabilities (and therefore providing a way
for libraries to directly educate developers).

The vote for this RFC ends on Monday the 19th of July, 7:30pm UK time and
6:30pm UTC, and needs your support.
https://wiki.php.net/rfc/is_literal

The following link provides more examples of these mistakes, based on code
I’ve found on production servers. They show how similar they are to the
examples found in the libraries official documentation, and how easy it is
for a developer to make a small tweak that ends up being very dangerous:
https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/mistakes.php

I have created 3 example libraries you can experiment with, to see what
is_literal() can do:
https://github.com/craigfrancis/php-is-literal-rfc/tree/main/examples

I'm happy to take questions on and off list.

Vote ends on Monday the 19th of July, 7:30pm UK time and 6:30pm UTC.
https://wiki.php.net/rfc/is_literal

Thanks,
Craig

>
  115429
July 16, 2021 00:47 craig@craigfrancis.co.uk (Craig Francis)
Just another day, and another injection vulnerability (please patch):

https://woocommerce.com/posts/critical-vulnerability-detected-july-2021/

If only escaping wasn't being used, so user values did not get included in
certain strings :-)

diff -r
woocommerce.5.5.0/includes/data-stores/class-wc-webhook-data-store.php
woocommerce.5.5.1/includes/data-stores/class-wc-webhook-data-store.php
280c280
< $search          = ! empty( $args['search'] ) ? "AND `name` LIKE '%" .
$wpdb->esc_like( sanitize_text_field( $args['search'] ) ) . "%'" : '';
---
> $search = ! empty( $args['search'] ) ? $wpdb->prepare( "AND `name` LIKE %s", '%' . $wpdb->esc_like( sanitize_text_field(
$args['search'] ) ) . '%' ) : ''; The vote for the is_literal RFC ends on Monday the 19th of July, 7:30pm UK time and 6:30pm UTC, and needs your support. https://wiki.php.net/rfc/is_literal
  115444
July 16, 2021 20:23 divinity76@gmail.com (Hans Henrik Bergan)
short of a bug in esc_like(), i don't even see the vulnerability issue in
that code?
that sanitize call looks like a data corruption issue and i bet it fails to
search for binary data, but i don't see the critical vulnerability?
  115446
July 17, 2021 01:44 craig@craigfrancis.co.uk (Craig Francis)
On Fri, 16 Jul 2021 at 21:24, Hans Henrik Bergan <divinity76@gmail.com>
wrote:

> short of a bug in esc_like(), i don't even see the vulnerability issue in > that code? >
Sorry Hans, I copied the wrong diff. There were only 2 changes from woocommerce 5.5.0 to 5.5.1. Like you I was wondering what that diff was doing before posting - I'm fairly sure it's just to be consistent with the other lines (which all use $wpdb->prepare). The diff I should have copied is: diff -r woocommerce.5.5.0/packages/woocommerce-blocks/src/StoreApi/Utilities/ProductQueryFilters.php woocommerce.5.5.1/packages/woocommerce-blocks/src/StoreApi/Utilities/ProductQueryFilters.php 86c86,92 < $attributes_to_count = array_map( 'wc_sanitize_taxonomy_name', $attributes ); ---
> $attributes_to_count = array_map( > function( $attribute ) { > $attribute = wc_sanitize_taxonomy_name( $attribute ); > return esc_sql( $attribute ); > }, > $attributes > );
In context `$attributes_to_count` simply goes to: $attributes_to_count_sql = 'AND term_taxonomy.taxonomy IN ("' . implode( '","', $attributes_to_count ) . '")'; Where the the esc_sql() is basically a call to mysqli_real_escape_string(), which explains why it needs risky quotes in/around implode. Craig
  115449
July 17, 2021 07:58 divinity76@gmail.com (Hans Henrik Bergan)
oh thanks, now the vulnerability is clear. (i would still complain on that
as a pull request though, using double quotes for strings is just a
horrible idea, it's not compliant with ISO sql, and it depends on the MySQL
server it's running on *not* having @@SQL_MODE=ANSI_QUOTES enabled which
changes double-quotes to be ISO-compliant instead of mysql-borky (double
quotes for strings only work on MySQL because of a MySQL quirk, ISO SQL say
double quotes is meant for identifiers just like ` ), and it's not portable
across different SQL servers, and all of these problems go away if you just
use the sensible alternative: single quotes - seriously someone should
complain to whoever wrote that, so at least he doesn't do the same in the
future
i can tell from only that diff that, at least as of 5.5.1,  woocommerce is
not compatible with @@SQL_MODE=ANSI_QUOTES :p
)


On Sat, 17 Jul 2021 at 03:45, Craig Francis <craig@craigfrancis.co.uk>
wrote:

> On Fri, 16 Jul 2021 at 21:24, Hans Henrik Bergan <divinity76@gmail.com> > wrote: > >> short of a bug in esc_like(), i don't even see the vulnerability issue in >> that code? >> > > > Sorry Hans, I copied the wrong diff. > > There were only 2 changes from woocommerce 5.5.0 to 5.5.1. > > Like you I was wondering what that diff was doing before posting - I'm > fairly sure it's just to be consistent with the other lines (which all use > $wpdb->prepare). > > The diff I should have copied is: > > diff -r > woocommerce.5.5.0/packages/woocommerce-blocks/src/StoreApi/Utilities/ProductQueryFilters.php > woocommerce.5.5.1/packages/woocommerce-blocks/src/StoreApi/Utilities/ProductQueryFilters.php > 86c86,92 > < $attributes_to_count = array_map( 'wc_sanitize_taxonomy_name', > $attributes ); > --- > > $attributes_to_count = array_map( > > function( $attribute ) { > > $attribute = wc_sanitize_taxonomy_name( $attribute ); > > return esc_sql( $attribute ); > > }, > > $attributes > > ); > > In context `$attributes_to_count` simply goes to: > > $attributes_to_count_sql = 'AND term_taxonomy.taxonomy IN ("' . implode( > '","', $attributes_to_count ) . '")'; > > Where the the esc_sql() is basically a call > to mysqli_real_escape_string(), which explains why it needs risky quotes > in/around implode. > > Craig > > > >
  115451
July 17, 2021 09:59 craig@craigfrancis.co.uk (Craig Francis)
On Sat, 17 Jul 2021 at 08:59, Hans Henrik Bergan <divinity76@gmail.com>
wrote:

> i can tell from only that diff that, at least as of 5.5.1, woocommerce is > not compatible with @@SQL_MODE=ANSI_QUOTES :p
Yep, and I did that years ago - I preferred to use single quotes for strings in PHP (so variables stood out), and double quotes for SQL. Just for fun, `NO_BACKSLASH_ESCAPES` is a good way to mess with people (I believe the "real" escaping method can catch this, but I wouldn't trust it). When it comes to escaping in general, my favourite mistake is: $sql = 'WHERE id = ' . $mysqli->real_escape_string($_GET['id']); Where the assumed to be number hasn't been quoted at all :-) PDO::quote() does work a bit better (by adding the quotes itself), but I still prefer its documentation - noting how the quoted string is only "theoretically safe", and that "you are strongly recommended to use PDO::prepare() to prepare SQL statements with bound parameters". https://www.php.net/manual/en/pdo.quote.php Craig
  115488
July 19, 2021 11:50 guilliam.xavier@gmail.com (Guilliam Xavier)
On Fri, Jul 16, 2021 at 2:47 AM Craig Francis <craig@craigfrancis.co.uk>
wrote:

> Just another day, and another injection vulnerability (please patch): > > https://woocommerce.com/posts/critical-vulnerability-detected-july-2021/ > > If only escaping wasn't being used, so user values did not get included in > certain strings :-) > > diff -r > woocommerce.5.5.0/includes/data-stores/class-wc-webhook-data-store.php > woocommerce.5.5.1/includes/data-stores/class-wc-webhook-data-store.php > 280c280 > < $search = ! empty( $args['search'] ) ? "AND `name` LIKE '%" . > $wpdb->esc_like( sanitize_text_field( $args['search'] ) ) . "%'" : ''; > --- > > $search = ! empty( $args['search'] ) ? $wpdb->prepare( "AND > `name` LIKE %s", '%' . $wpdb->esc_like( sanitize_text_field( > $args['search'] ) ) . '%' ) : ''; >
On Sat, Jul 17, 2021 at 3:45 AM Craig Francis <craig@craigfrancis.co.uk> wrote:
> On Fri, 16 Jul 2021 at 21:24, Hans Henrik Bergan <divinity76@gmail.com> > wrote: > > > short of a bug in esc_like(), i don't even see the vulnerability issue in > > that code? > > > > > Sorry Hans, I copied the wrong diff. > > There were only 2 changes from woocommerce 5.5.0 to 5.5.1. > > Like you I was wondering what that diff was doing before posting - I'm > fairly sure it's just to be consistent with the other lines (which all use > $wpdb->prepare). >
I don't think so. Looking at https://developer.wordpress.org/reference/functions/sanitize_text_field/ and https://developer.wordpress.org/reference/classes/wpdb/esc_like/ you can see that they *don't* escape single quotes, so there was *indeed* an SQL injection vulnerability in that code. (Which is [one of the reasons] why I avoid WordPress [and especially third-party themes / plugins] as much as possible.) -- Guilliam Xavier
  115492
July 19, 2021 14:19 craig@craigfrancis.co.uk (Craig Francis)
On Mon, 19 Jul 2021 at 12:51, Guilliam Xavier xavier@gmail.com>
wrote:

> there was *indeed* an SQL injection vulnerability in that code.
Yep, you're right, there was an issue in there as well. esc_like() also needs to use esc_sql() for the value to be added directly to the SQL string. By changing to $wpdb->prepare(), assuming a literal string for the $query argument, then esc_sql() would have been used automatically (technically escape_by_ref). Shall we just put it down as another example of escaping going wrong, why Parameterised Queries work, and how Taint Checking isn't really a solution (as the value was escaped, ish). And if you want another fun one, not that anyone should be using inline JavaScript, but esc_js() doesen't escape single quotes: $variable = '\' onmouseover=alert(1) a=\''; $html = "Link"; https://developer.wordpress.com/themes/escaping/#javascript Craig
  115455
July 17, 2021 15:05 ocramius@gmail.com (Marco Pivetta)
Hey Craig,

On Mon, Jul 5, 2021 at 8:15 PM Craig Francis <craig@craigfrancis.co.uk>
wrote:

> Hi Internals, > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > is-literal function. > > The vote closes 2021-07-19 > > The proposal is to add the function is_literal(), a simple way to identify > if a string was written by a developer, removing the risk of a variable > containing an Injection Vulnerability. > > This implementation is for literal strings ONLY (after discussion over > allowing integers) and, thanks to the amazing work of Joe Watkins, now > works fully with compiler optimisations, interned strings etc. > > Craig >
I ended up voting "no" on this. Even though I used to work on Doctrine ORM for years, and we had a private discussion around this, my belief is that this is not a runtime problem, but rather a type-level issue with tainted/untainted input/output. This has been brought up in the discussion phase by Tyson Andre, as far as I remember. There is good progress in taint analysis in Psalm, specifically: * https://psalm.dev/articles/detect-security-vulnerabilities-with-psalm * https://psalm.dev/docs/security_analysis/ There's also the issue, from my end, that I would like the ecosystem to pick up static analysis more, rather than people adding more runtime roadblocks to their code, leading to further complexity in error analysis. This is my personal push for putting more problems into build-time rather than runtime. You could call it a political choice to push a tool like psalm, instead of the language having a runtime construct to further "duck-type safety". In addition to that, a mechanism to un-taint values is missing, but it is not a concept symmetrical to `is_literal()`. Greets, Marco Pivetta http://twitter.com/Ocramius http://ocramius.github.com/
  115458
July 17, 2021 16:48 craig@craigfrancis.co.uk (Craig Francis)
On Sat, 17 Jul 2021 at 4:05 pm, Marco Pivetta <ocramius@gmail.com> wrote:

> my belief is that this is not a runtime problem, but rather a type-level > issue with tainted/untainted input/output. >
Thank you for the feedback Marco, As you appreciate, I don’t believe we can get every PHP developer to use Static Analysis. It’s an extra step that developers with less time, energy, or care, will not setup and use. Putting something in the base language, means that libraries can just use it, and people using the sites/systems of rushed or lazier developers will have these checks helping keep their data secure. Data breeches can have life-changing consequences for people, Injection Vulnerabilities are one of the biggest causes of them, and since we have the ability for libraries to warn all developers about these mistake, we should. At the moment our house can catch on fire and we don’t even have a smoke alarm. This is the smoke alarm. And there are reasons why it’s builders and landlords that have to install them, and we don’t rely on the tenants going and sorting them out themselves. Because if they don’t, for the best or the worse reasons, either way there are severe consequences to everybody. In regards to Taint Checking, it has a significant problem as it creates a false sense of security, hence these examples in the RFC: $sql = 'SELECT * FROM users WHERE id = ' . $db->real_escape_string($id); // INSECURE $html = ""; // INSECURE $html = "..."; // INSECURE Fortunately Psalm has just implemented the is_literal() concept, so those developers who do use Psalm can protect themselves from these issues: https://github.com/vimeo/psalm/releases/tag/4.8.0 In addition to that, a mechanism to un-taint values is missing,
>
That’s the main flaw with Taint Checking, because it’s not possible to mark something as safe without knowing about the context. As in, developers use an escaping function (to mark as untainted), think the value is now “safe”, and incorrectly use that value in a way that causes a security vulnerability. is_literal() simplifies this problem considerably, by just identifying developer defined strings, and instead using libraries to handle user values. Craig
  115459
July 18, 2021 02:41 jordan.ledoux@gmail.com (Jordan LeDoux)
Related to the general topic of injection attacks, I was considering
submitting a PR to change the default of PDO::ATTR_EMULUATE_PREPARES to
FALSE, since this mistakenly can lead people to believe that using prepared
statements with PDO and MySQL protects against injection attacks. In fact,
this is only the case if they create the PDO object with the option
specified as false. I'm not aware however to reasoning for enabling prepare
emulation by default for MySQL. I would assume it's a performance choice,
but how long ago was this choice made and is it worth revisiting? Would
this be something that requires its own RFC? It's a single line change.

Jordan

On Sat, Jul 17, 2021 at 9:48 AM Craig Francis <craig@craigfrancis.co.uk>
wrote:

> On Sat, 17 Jul 2021 at 4:05 pm, Marco Pivetta <ocramius@gmail.com> wrote: > > > my belief is that this is not a runtime problem, but rather a type-level > > issue with tainted/untainted input/output. > > > > > Thank you for the feedback Marco, > > As you appreciate, I don’t believe we can get every PHP developer to use > Static Analysis. It’s an extra step that developers with less time, energy, > or care, will not setup and use. > > Putting something in the base language, means that libraries can just use > it, and people using the sites/systems of rushed or lazier developers will > have these checks helping keep their data secure. Data breeches can have > life-changing consequences for people, Injection Vulnerabilities are one of > the biggest causes of them, and since we have the ability for libraries to > warn all developers about these mistake, we should. > > At the moment our house can catch on fire and we don’t even have a smoke > alarm. This is the smoke alarm. And there are reasons why it’s builders and > landlords that have to install them, and we don’t rely on the tenants going > and sorting them out themselves. Because if they don’t, for the best or the > worse reasons, either way there are severe consequences to everybody. > > In regards to Taint Checking, it has a significant problem as it creates a > false sense of security, hence these examples in the RFC: > > $sql = 'SELECT * FROM users WHERE id = ' . $db->real_escape_string($id); // > INSECURE > > $html = ""; // INSECURE > > $html = "..."; // INSECURE > > Fortunately Psalm has just implemented the is_literal() concept, so those > developers who do use Psalm can protect themselves from these issues: > > https://github.com/vimeo/psalm/releases/tag/4.8.0 > > > > In addition to that, a mechanism to un-taint values is missing, > > > > > That’s the main flaw with Taint Checking, because it’s not possible to mark > something as safe without knowing about the context. As in, developers use > an escaping function (to mark as untainted), think the value is now “safe”, > and incorrectly use that value in a way that causes a security > vulnerability. > > is_literal() simplifies this problem considerably, by just identifying > developer defined strings, and instead using libraries to handle user > values. > > Craig >
  115460
July 18, 2021 07:47 php.lists@allenjb.me.uk (AllenJB)
On 18/07/2021 03:41, Jordan LeDoux wrote:
> Related to the general topic of injection attacks, I was considering > submitting a PR to change the default of PDO::ATTR_EMULUATE_PREPARES to > FALSE, since this mistakenly can lead people to believe that using prepared > statements with PDO and MySQL protects against injection attacks. In fact, > this is only the case if they create the PDO object with the option > specified as false. I'm not aware however to reasoning for enabling prepare > emulation by default for MySQL. I would assume it's a performance choice, > but how long ago was this choice made and is it worth revisiting? Would > this be something that requires its own RFC? It's a single line change. > > Jordan
There's some BC-breaks to be aware of when switching emulated prepares. One example I know of is that when using emulated prepares you can reuse the same placeholder (as in the following example), but with emulated prepares disabled this does not work. $sql = "SELECT * FROM table WHERE column1 LIKE CONCAT('%', :searchValue, '%') OR column2 LIKE CONCAT('%', :searchValue, '%');"; $params = [     "searchValue" => "test", ]; With emulated prepares disabled you have to use: $sql = "SELECT * FROM table WHERE column1 LIKE CONCAT('%', :searchValue, '%') OR column2 LIKE CONCAT('%', :searchValue2, '%');"; $params = [     "searchValue" => "test",     "searchValue2" => "test", ];
  115461
July 18, 2021 08:51 jordan.ledoux@gmail.com (Jordan LeDoux)
That sounds like something that would require both a deprecation and an RFC
for the change then, even if the actual change in the source is small.

It still may be worth exploring, since this surely gives a large number of
people false confidence in protection against injection attacks, as nearly
every available tutorial on using PDO in PHP declares that simply using
prepared statements protects you from injection attacks.

On Sun, Jul 18, 2021 at 12:47 AM AllenJB lists@allenjb.me.uk> wrote:

> > On 18/07/2021 03:41, Jordan LeDoux wrote: > > Related to the general topic of injection attacks, I was considering > > submitting a PR to change the default of PDO::ATTR_EMULUATE_PREPARES to > > FALSE, since this mistakenly can lead people to believe that using > prepared > > statements with PDO and MySQL protects against injection attacks. In > fact, > > this is only the case if they create the PDO object with the option > > specified as false. I'm not aware however to reasoning for enabling > prepare > > emulation by default for MySQL. I would assume it's a performance choice, > > but how long ago was this choice made and is it worth revisiting? Would > > this be something that requires its own RFC? It's a single line change. > > > > Jordan > > There's some BC-breaks to be aware of when switching emulated prepares. > One example I know of is that when using emulated prepares you can reuse > the same placeholder (as in the following example), but with emulated > prepares disabled this does not work. > > $sql = "SELECT * FROM table WHERE column1 LIKE CONCAT('%', :searchValue, > '%') OR column2 LIKE CONCAT('%', :searchValue, '%');"; > $params = [ > "searchValue" => "test", > ]; > > With emulated prepares disabled you have to use: > > $sql = "SELECT * FROM table WHERE column1 LIKE CONCAT('%', :searchValue, > '%') OR column2 LIKE CONCAT('%', :searchValue2, '%');"; > $params = [ > "searchValue" => "test", > "searchValue2" => "test", > ]; > > >
  115466
July 18, 2021 22:37 benjamin.morel@gmail.com (Benjamin Morel)
> > There's some BC-breaks to be aware of when switching emulated prepares. > One example I know of is that when using emulated prepares you can reuse > the same placeholder (as in the following example), but with emulated > prepares disabled this does not work. > > $sql = "SELECT * FROM table WHERE column1 LIKE CONCAT('%', :searchValue, > '%') OR column2 LIKE CONCAT('%', :searchValue, '%');"; > $params = [ > "searchValue" => "test", > ]; >
This, and do note that from a performance point of view, disabling emulated prepares is not innocuous : most of the time, you effectively send twice as many queries to the database : one prepare, one execute. There is a *small* performance improvement in using prepared statements if you're executing the same statement many times; but in all other cases, you're spending twice as much time waiting for the database. Are there documented SQL injection opportunities when using emulated prepares? I'm not aware of any. — Benjamin
  115467
July 19, 2021 02:11 jordan.ledoux@gmail.com (Jordan LeDoux)
> Are there documented SQL injection opportunities when using emulated prepares? I'm not aware of any.
This was from my reading of the actual source, which of course may be flawed. It appeared that if emulated prepares were used the values were escaped and then passed as strings as part of the query, the same as if it had been concatenated and wrapped in real_escape_string. I hadn't gone too far in actually debugging it yet to find out how it behaved under different circumstances as I was still trying to figure out how "small" of a change this was from the perspective of internals. On Sun, Jul 18, 2021 at 3:38 PM Benjamin Morel morel@gmail.com> wrote:
> There's some BC-breaks to be aware of when switching emulated prepares. >> One example I know of is that when using emulated prepares you can reuse >> the same placeholder (as in the following example), but with emulated >> prepares disabled this does not work. >> >> $sql = "SELECT * FROM table WHERE column1 LIKE CONCAT('%', :searchValue, >> '%') OR column2 LIKE CONCAT('%', :searchValue, '%');"; >> $params = [ >> "searchValue" => "test", >> ]; >> > > This, and do note that from a performance point of view, disabling > emulated prepares is not innocuous : most of the time, you effectively send > twice as many queries to the database : one prepare, one execute. > There is a *small* performance improvement in using prepared statements > if you're executing the same statement many times; but in all other cases, > you're spending twice as much time waiting for the database. > > Are there documented SQL injection opportunities when using emulated > prepares? I'm not aware of any. > > — Benjamin >
  115470
July 19, 2021 04:04 pierre.php@gmail.com (Pierre Joye)
Good morning,

On Mon, Jul 19, 2021 at 9:11 AM Jordan LeDoux ledoux@gmail.com> wrote:
> > > Are there documented SQL injection opportunities when using emulated > prepares? I'm not aware of any. > > This was from my reading of the actual source, which of course may be > flawed. It appeared that if emulated prepares were used the values were > escaped and then passed as strings as part of the query, the same as if it > had been concatenated and wrapped in real_escape_string. I hadn't gone too > far in actually debugging it yet to find out how it behaved under different > circumstances as I was still trying to figure out how "small" of a change > this was from the perspective of internals.
I also don't think there is any left over possible SQL injection in any of the core DB extensions (PDO or 'native'). It will indeed do not prevent inserting invalid data but if there were any actual SQL injection left, I am very confident we would have known it by now. :) -- Pierre @pierrejoye | http://www.libgd.org
  115472
July 19, 2021 07:24 nikita.ppv@gmail.com (Nikita Popov)
On Sun, Jul 18, 2021 at 4:42 AM Jordan LeDoux ledoux@gmail.com>
wrote:

> Related to the general topic of injection attacks, I was considering > submitting a PR to change the default of PDO::ATTR_EMULUATE_PREPARES to > FALSE, since this mistakenly can lead people to believe that using prepared > statements with PDO and MySQL protects against injection attacks. In fact, > this is only the case if they create the PDO object with the option > specified as false. I'm not aware however to reasoning for enabling prepare > emulation by default for MySQL. I would assume it's a performance choice, > but how long ago was this choice made and is it worth revisiting? Would > this be something that requires its own RFC? It's a single line change. > > Jordan >
Please do refrain from spreading this FUD. While there are certain tradeoffs between choosing emulated or native prepared statements, security considerations do not factor into it. There's a very narrow window where emulated prepared statements can lead to incorrect escaping (it involves picking an exotic non-ASCII-compatible charset, not specifying it in the connection DSN, and switching to it at runtime), but it's not something you can hit by accident. Regards, Nikita On Sat, Jul 17, 2021 at 9:48 AM Craig Francis <craig@craigfrancis.co.uk>
> wrote: > > > On Sat, 17 Jul 2021 at 4:05 pm, Marco Pivetta <ocramius@gmail.com> > wrote: > > > > > my belief is that this is not a runtime problem, but rather a > type-level > > > issue with tainted/untainted input/output. > > > > > > > > > Thank you for the feedback Marco, > > > > As you appreciate, I don’t believe we can get every PHP developer to use > > Static Analysis. It’s an extra step that developers with less time, > energy, > > or care, will not setup and use. > > > > Putting something in the base language, means that libraries can just use > > it, and people using the sites/systems of rushed or lazier developers > will > > have these checks helping keep their data secure. Data breeches can have > > life-changing consequences for people, Injection Vulnerabilities are one > of > > the biggest causes of them, and since we have the ability for libraries > to > > warn all developers about these mistake, we should. > > > > At the moment our house can catch on fire and we don’t even have a smoke > > alarm. This is the smoke alarm. And there are reasons why it’s builders > and > > landlords that have to install them, and we don’t rely on the tenants > going > > and sorting them out themselves. Because if they don’t, for the best or > the > > worse reasons, either way there are severe consequences to everybody. > > > > In regards to Taint Checking, it has a significant problem as it creates > a > > false sense of security, hence these examples in the RFC: > > > > $sql = 'SELECT * FROM users WHERE id = ' . $db->real_escape_string($id); > // > > INSECURE > > > > $html = ""; // INSECURE > > > > $html = "..."; // INSECURE > > > > Fortunately Psalm has just implemented the is_literal() concept, so those > > developers who do use Psalm can protect themselves from these issues: > > > > https://github.com/vimeo/psalm/releases/tag/4.8.0 > > > > > > > > In addition to that, a mechanism to un-taint values is missing, > > > > > > > > > That’s the main flaw with Taint Checking, because it’s not possible to > mark > > something as safe without knowing about the context. As in, developers > use > > an escaping function (to mark as untainted), think the value is now > “safe”, > > and incorrectly use that value in a way that causes a security > > vulnerability. > > > > is_literal() simplifies this problem considerably, by just identifying > > developer defined strings, and instead using libraries to handle user > > values. > > > > Craig > > >
  115480
July 19, 2021 09:07 jordan.ledoux@gmail.com (Jordan LeDoux)
Thanks Nikita, that's good to know. I'm still familiarizing myself with the
source right now, so I apologize if this is something that commonly gets
spread as false information, I honestly was exploring the workings of
injection protection in the source after following this discussion, but
that's why I asked. My intention was not to spread FUD, I just thought I'd
found something that slipped through the cracks.

Cheers,
Jordan

On Mon, Jul 19, 2021 at 12:24 AM Nikita Popov ppv@gmail.com> wrote:

> On Sun, Jul 18, 2021 at 4:42 AM Jordan LeDoux ledoux@gmail.com> > wrote: > >> Related to the general topic of injection attacks, I was considering >> submitting a PR to change the default of PDO::ATTR_EMULUATE_PREPARES to >> FALSE, since this mistakenly can lead people to believe that using >> prepared >> statements with PDO and MySQL protects against injection attacks. In fact, >> this is only the case if they create the PDO object with the option >> specified as false. I'm not aware however to reasoning for enabling >> prepare >> emulation by default for MySQL. I would assume it's a performance choice, >> but how long ago was this choice made and is it worth revisiting? Would >> this be something that requires its own RFC? It's a single line change. >> >> Jordan >> > > Please do refrain from spreading this FUD. While there are certain > tradeoffs between choosing emulated or native prepared statements, security > considerations do not factor into it. There's a very narrow window where > emulated prepared statements can lead to incorrect escaping (it involves > picking an exotic non-ASCII-compatible charset, not specifying it in the > connection DSN, and switching to it at runtime), but it's not something you > can hit by accident. > > Regards, > Nikita > > On Sat, Jul 17, 2021 at 9:48 AM Craig Francis <craig@craigfrancis.co.uk> >> wrote: >> >> > On Sat, 17 Jul 2021 at 4:05 pm, Marco Pivetta <ocramius@gmail.com> >> wrote: >> > >> > > my belief is that this is not a runtime problem, but rather a >> type-level >> > > issue with tainted/untainted input/output. >> > > >> > >> > >> > Thank you for the feedback Marco, >> > >> > As you appreciate, I don’t believe we can get every PHP developer to use >> > Static Analysis. It’s an extra step that developers with less time, >> energy, >> > or care, will not setup and use. >> > >> > Putting something in the base language, means that libraries can just >> use >> > it, and people using the sites/systems of rushed or lazier developers >> will >> > have these checks helping keep their data secure. Data breeches can have >> > life-changing consequences for people, Injection Vulnerabilities are >> one of >> > the biggest causes of them, and since we have the ability for libraries >> to >> > warn all developers about these mistake, we should. >> > >> > At the moment our house can catch on fire and we don’t even have a smoke >> > alarm. This is the smoke alarm. And there are reasons why it’s builders >> and >> > landlords that have to install them, and we don’t rely on the tenants >> going >> > and sorting them out themselves. Because if they don’t, for the best or >> the >> > worse reasons, either way there are severe consequences to everybody. >> > >> > In regards to Taint Checking, it has a significant problem as it >> creates a >> > false sense of security, hence these examples in the RFC: >> > >> > $sql = 'SELECT * FROM users WHERE id = ' . >> $db->real_escape_string($id); // >> > INSECURE >> > >> > $html = ""; // INSECURE >> > >> > $html = "..."; // INSECURE >> > >> > Fortunately Psalm has just implemented the is_literal() concept, so >> those >> > developers who do use Psalm can protect themselves from these issues: >> > >> > https://github.com/vimeo/psalm/releases/tag/4.8.0 >> > >> > >> > >> > In addition to that, a mechanism to un-taint values is missing, >> > > >> > >> > >> > That’s the main flaw with Taint Checking, because it’s not possible to >> mark >> > something as safe without knowing about the context. As in, developers >> use >> > an escaping function (to mark as untainted), think the value is now >> “safe”, >> > and incorrectly use that value in a way that causes a security >> > vulnerability. >> > >> > is_literal() simplifies this problem considerably, by just identifying >> > developer defined strings, and instead using libraries to handle user >> > values. >> > >> > Craig >> > >> >
  115503
July 19, 2021 18:59 craig@craigfrancis.co.uk (Craig Francis)
On Mon, 5 Jul 2021 at 19:14, Craig Francis <craig@craigfrancis.co.uk> wrote:

> Hi Internals, > I have opened voting on https://wiki.php.net/rfc/is_literal for the > is-literal function. >
This RFC has been rejected; with 10 votes in favour, and 23 against. I'd like to thank everyone who has been involved in this project, in particular Joe Watkins for creating the implementation and helping me though this process, Máté Kocsis for the independent performance tests, Rowan Francis for helping me with the RFC text and emails, Derick Rethans for covering this on the PHP Internals News podcast, and for everyone who has discussed this implementation and provided their input (especially Larry Garfield, Scott Arciszewski, and Rowan Tommins). And thank you to Matthew Brown for adding the 'literal-string' type to Psalm: https://github.com/vimeo/psalm/releases/tag/4.8.0 As an aside, only 4 of 23 'no' voters provided any comment as to why they voted that way on the mailing list, which I feel undermines the point of the Request For Comment process, with an additional 5 responding personally off-list after prompting. This makes it harder (or impossible) for points to be discussed and addressed. Craig
  115505
July 20, 2021 00:23 tysonandre775@hotmail.com (tyson andre)
Hi Craig Francis,

> As an aside, only 4 of 23 'no' voters provided any comment as to why they > voted that way on the mailing list, which I feel undermines the point of > the Request For Comment process, with an additional 5 responding personally > off-list after prompting. This makes it harder (or impossible) for points > to be discussed and addressed.
1. My earlier comments about static analysis, and on behavior depending on whether opcache is enabled 2. This might prevent certain optimizations in the future. For example, currently, 1-byte strings are all interned to save memory. If is_literal was part of php prior to proposing that optimization, then that optimization may be rejected. 3. PHP's `string` type is used both for (hopefully) valid unicode strings and for low level operations on literal byte arrays (e.g. cryptogrophy). It seems really, really strange for a type system to track trustedness for a low level primitive to track byte arrays. (the php 6 unicode string refactoring failed) Further, if this were to be extended in the future beyond the original proposal (e.g. literal strings that are entirely digits are automatically interned or marked as trusted), this might open previously safe code acting on byte arrays to side channel issues such as timing attacks (https://en.wikipedia.org/wiki/Timing_attack) 4. Internal functions and userland polyfills for those functions may unintentionally differ significantly for the resulting taintedness, e.g. base64_decode in userland being built up byte by byte would end up being possibly untainted? 5. The fact that 1-byte strings are almost always interned seems like a noticeable inconsistency (though library authors can deal with it once they're aware of it), though for it to become an issue a library may need to take multiple strings as input (e.g. a contrived example`"echo -- " . $trustedPrefix . shell_escape($notTrusted)` for $trustedPrefix of "'" (or "\n") and $notTrusted of "; evaluate command" 6. Including it in core would make it harder to remove later if it interfered with opcache or jit work, or to migrate code to alternative interpreters for php if those were ever implemented (if frameworks were to extensively depend on is_literal) 7. Tracking whether a string is untrusted is a definition only suitable for a few (extremely common) formats for php. But for less common features, even stringified integers may be a problem (e.g. binary file formats, etc) This is relatively minor given that php is typically used in a web programming context with json or html or js/css output I'd think is_interned()/intern_string() is much closer to tracking something that corresponds with php's internals (e.g. and may allow saving memory in long-running processes which receive duplicate strings as input), though the 10 people who wanted fully featured trustedness checking would probably want is_literal instead 8. Serializing and unserializing data would lose information about trustedness of inputs, unpredictably (e.g. unserialize() in php 8.0 interns array keys). There's no (efficient) way to change trusted strings to untrusted or vice versa, though there are inefficient workarounds (modifying a byte and restoring it to stop trusting it, imploding single characters to create a trusted string) This may done implicitly in frameworks using APCu/memcached/redis as a cache (I definitely don't think the serialization data format should track is_literal()) 9. Future refactorings, optimizations or deoptimizations (or security fixes) to unserialize(), etc. may unexpectedly break code using is_literal that throw instead of warn (more bug reports, discourage users from upgrading, etc.) 10. This RFC adds an unknown amount of future work for php-src and PECLs to *intuitively* support mapping trusted inputs to trusted outputs or vice versa - less commonly used or unmaintained functions may not behave as expected for a while 11. https://pecl.php.net/package/taint is available already for a use case with some overlap for setups that need this Aside: I'd have to wonder if ZSTR_IS_INTERNED (and the function to make an interned string) would make sense to expose in a PECL as a regular `extension` (not a `zend_extension`) if is_interned also fails. Unlike the zend_extension for https://www.php.net/manual/en/function.is-tainted.php , something simple may be possible without needing the performance hit and future conflicts with XDebug that I assume https://www.php.net/manual/en/function.is-tainted.php would be prone to. (https://pecl.php.net/package/taint seems to use a separate bit to track this. The latest release of the Taint pecl fixes XDebug compatibility) - Other languages, such as Java, have exposed this for memory management purposes (rather than security) though it's rarely used directly or in frameworks, e.g. https://docs.oracle.com/javase/10/docs/api/java/lang/String.html#intern() (https://docs.python.org/3/library/sys.html#sys.intern) Thanks, Tyson
  115542
July 21, 2021 15:10 craig@craigfrancis.co.uk (Craig Francis)
> On 20 Jul 2021, at 01:23, tyson andre <tysonandre775@hotmail.com> wrote: > >> As an aside, only 4 of 23 'no' voters provided any comment as to why they >> voted that way on the mailing list, which I feel undermines the point of >> the Request For Comment process, with an additional 5 responding personally >> off-list after prompting. This makes it harder (or impossible) for points >> to be discussed and addressed. > > 1. My earlier comments about static analysis, and on behavior depending on whether opcache is enabled
Thanks Tyson, Just to check I've not missed anything, are you referring to your comments from March 2020? https://news-web.php.net/php.internals/109192 <https://news-web.php.net/php.internals/109192> I mentioned your email in the RFC, as I agree that developers should use Static Analysis, but I think it's highly unlikely we will be able to get most PHP developers to use it: https://wiki.php.net/rfc/is_literal#static_analysis <https://wiki.php.net/rfc/is_literal#static_analysis> As to the OPcache, Joe's implementation already works with its optimisations, so the results are consistent. https://wiki.php.net/rfc/is_literal#compiler_optimisations <https://wiki.php.net/rfc/is_literal#compiler_optimisations>
> 2. This might prevent certain optimizations in the future. For example, currently, 1-byte strings are all interned to save memory. > If is_literal was part of php prior to proposing that optimization, then that optimization may be rejected.
Like the OPcache optimisations, the 1-byte strings and other existing optimisations work with Joe's implementation. I cannot comment or predict any future optimisation work, but this argument could be made for any change to PHP.
> 3. PHP's `string` type is used both for (hopefully) valid unicode strings and for low level operations on literal byte arrays (e.g. cryptogrophy). > It seems really, really strange for a type system to track trustedness for a low level primitive to track byte arrays. (the php 6 unicode string refactoring failed) > > Further, if this were to be extended in the future beyond the original proposal (e.g. literal strings that are entirely digits are automatically interned or marked as trusted), > this might open previously safe code acting on byte arrays to side channel issues such as timing attacks (https://en.wikipedia.org/wiki/Timing_attack)
We're just adding a flag to say if the string was written by the developer, which is key for identifying Injection Vulnerabilities (it's similar to the Interned flag). With cryptography and timing attacks, the only part affected would be during string concatenation, which has already got several optimisations that introduce variability in timings (e.g. concatenating variables containing 'a' with 'b' is about twice as slow than 'a' with ''). We did have a discussion about integer support, but the way integers/floats/booleans are currently stored in memory would require a big change to note if those values were defined in the developers PHP source code. We also explored the idea of allowing all integers, and while they cannot lead to an Injection Vulnerability, the consensus was that a developer defined string is easier to understand: https://wiki.php.net/rfc/is_literal#integer_values <https://wiki.php.net/rfc/is_literal#integer_values>
> 4. Internal functions and userland polyfills for those functions may unintentionally differ significantly for the resulting taintedness, > e.g. base64_decode in userland being built up byte by byte would end up being possibly untainted?
All of these functions would return a non-literal string, as the values were not written by the developer (as in, directly defined in their source code).
> 5. The fact that 1-byte strings are almost always interned seems like a noticeable inconsistency (though library authors can deal with it once they're aware of it), though for it to become an issue a library may need to take multiple strings as input > (e.g. a contrived example`"echo -- " . $trustedPrefix . shell_escape($notTrusted)` for $trustedPrefix of "'" (or "\n") and $notTrusted of "; evaluate command"
1-byte strings have already been addressed in the implementation. Libraries need to take user values separately from developer defined strings, because developers frequently make mistakes: https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/mistakes.php?ts=4 <https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/mistakes.php?ts=4> Especially when it comes to escaping values: https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/escaping.php?ts=4 <https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification/escaping.php?ts=4> In your command example, the library could copy how Parameterised Queries work in SQL. https://github.com/craigfrancis/php-is-literal-rfc/blob/main/examples/cli-basic.php <https://github.com/craigfrancis/php-is-literal-rfc/blob/main/examples/cli-basic.php>
> 6. Including it in core would make it harder to remove later if it interfered with opcache or jit work, or to migrate code to alternative interpreters for php if those were ever implemented (if frameworks were to extensively depend on is_literal)
This is why Joe was very careful with the implementation. Overall the approach is simple, where it's just providing libraries with the key bit of information to identify their primary source of Injection Vulnerabilities (when a developer has incorrectly included user data).
> 7. Tracking whether a string is untrusted is a definition only suitable for a few (extremely common) formats for php. But for less common features, even stringified integers may be a problem (e.g. binary file formats, etc) > > This is relatively minor given that php is typically used in a web programming context with json or html or js/css output > > I'd think is_interned()/intern_string() is much closer to tracking something that corresponds with php's internals (e.g. and may allow saving memory in long-running processes which receive duplicate strings as input), though the 10 people who wanted fully featured trustedness checking would probably want is_literal instead
The implementation is pretty close to matching the Interned flag, but isn't exactly the same, as the Interned flag can be unpredictable for normal PHP developers (e.g. the chr() function), and it could be changed in the future (e.g. Joe noted how there was a suggestion to intern keys while JSON decoding or unserializing).
> 8. Serializing and unserializing data would lose information about trustedness of inputs, unpredictably (e.g. unserialize() in php 8.0 interns array keys). > > There's no (efficient) way to change trusted strings to untrusted or vice versa, though there are inefficient workarounds (modifying a byte and restoring it to stop trusting it, imploding single characters to create a trusted string) > > This may done implicitly in frameworks using APCu/memcached/redis as a cache > > (I definitely don't think the serialization data format should track is_literal())
Agreed, serializing and unserializing (or any other modifications) does not set the literal flag. As to providing an easy way to mark a non-literal string to a literal - in short, we do not want to recreate the biggest flaw of Taint Checking, where naive developers would see this as a way to mark escaped strings as "safe", giving them a false sense of security that these strings can now be trusted.
> 9. Future refactorings, optimizations or deoptimizations (or security fixes) to unserialize(), etc. may unexpectedly break code using is_literal that throw instead of warn (more bug reports, discourage users from upgrading, etc.)
The implementation already includes several tests, and output from functions like unserialize() do not set the literal flag.
> 10. This RFC adds an unknown amount of future work for php-src and PECLs to *intuitively* support mapping trusted inputs to trusted outputs or vice versa - less commonly used or unmaintained functions may not behave as expected for a while
This shouldn't be an issue, as the flag is only being used to identify literal strings. If new strings are being created or modified, they are no longer considered literal strings (the flag is off by default).
> 11. https://pecl.php.net/package/taint is available already for a use case with some overlap for setups that need this
Libraries need something that's available to everyone (we all make mistakes; and the developers most likely to introduce these vulnerabilities are unlikely to install an extension, or use Static Analysis). And with the Taint extension in particular, it has really helped me experiment with this concept, but that approach does give a false sense of security, which is why this RFC did not re-create it: https://wiki.php.net/rfc/is_literal#taint_checking <https://wiki.php.net/rfc/is_literal#taint_checking>
> Aside: I'd have to wonder if ZSTR_IS_INTERNED (and the function to make an interned string) would make sense to expose in a PECL as a regular `extension` (not a `zend_extension`) if is_interned also fails. > Unlike the zend_extension for https://www.php.net/manual/en/function.is-tainted.php , > something simple may be possible without needing the performance hit and > future conflicts with XDebug that I assume https://www.php.net/manual/en/function.is-tainted.php would be prone to. > (https://pecl.php.net/package/taint seems to use a separate bit to track this. The latest release of the Taint pecl fixes XDebug compatibility)
As noted above, we couldn't expose the Interned flag directly, as it's a little unpredictable. That's why we added a new flag - one that at first glance might look the same, but handles the edge cases. Thanks again, Craig
  115985
September 7, 2021 09:49 craig@craigfrancis.co.uk (Craig Francis)
On Mon, 19 Jul 2021 at 19:59, Craig Francis <craig@craigfrancis.co.uk>
wrote:

> On Mon, 5 Jul 2021 at 19:14, Craig Francis <craig@craigfrancis.co.uk> > wrote: > >> I have opened voting on https://wiki.php.net/rfc/is_literal for the >> is-literal function. >> > > This RFC has been rejected; with 10 votes in favour, and 23 against. > [...] > And thank you to Matthew Brown for adding the 'literal-string' type to > Psalm: > https://github.com/vimeo/psalm/releases/tag/4.8.0 >
FYI: The 'literal-string' type has now been added to PHPStan, thanks to Ondřej Mirtes: https://github.com/phpstan/phpstan/releases/tag/0.12.97 Obviously I'd still like libraries to be able to protect everyone from introducing Injection Vulnerabilities (as the majority of programmers don't use static analysis), but that's for another day. Craig
  116000
September 8, 2021 06:33 claude.pache@gmail.com (Claude Pache)
> Le 7 sept. 2021 à 11:49, Craig Francis <craig@craigfrancis.co.uk> a écrit : > > > Obviously I'd still like libraries to be able to protect everyone from > introducing Injection Vulnerabilities (as the majority of programmers don't > use static analysis), but that's for another day. >
Hi, We all want to protect from injection vulnerability, but I think there are better way than is_literal. One way is to use templates, an area where PHP is ironically lagging behind. I suggest looking at JS tagged templates: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals> For example: $qb->select('u') ->from('User', 'u') ->where('u.id = ' . $_GET['id']); // INSECURE could be written as exec ` SELECT u FROM User u WHERE u.id = %{ $_GET['id'] } ` ?> where the part between %{ ... } is transformed into an SQL literal string (with delimiters "...", not just “escaping”) when it is a string; into the SQL expression NULL when it is null; into an SQL subexpression if it is an object (provided by the library) that represents a well-formed SQL subexpression, etc. —Claude
  116003
September 8, 2021 08:33 php-lists@koalephant.com (Stephen Reay)
> On 8 Sep 2021, at 13:33, Claude Pache pache@gmail.com> wrote: > > > >> Le 7 sept. 2021 à 11:49, Craig Francis <craig@craigfrancis.co.uk> a écrit : >> >> >> Obviously I'd still like libraries to be able to protect everyone from >> introducing Injection Vulnerabilities (as the majority of programmers don't >> use static analysis), but that's for another day. >> > > > Hi, > > We all want to protect from injection vulnerability, but I think there are better way than is_literal. > > One way is to use templates, an area where PHP is ironically lagging behind. I suggest looking at JS tagged templates: > > https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals> > > For example: > > > $qb->select('u') > ->from('User', 'u') > ->where('u.id = ' . $_GET['id']); // INSECURE > could be written as > > > $qb->exec ` > SELECT u > FROM User u > WHERE u.id = %{ $_GET['id'] } > ` > ?> > > where the part between %{ ... } is transformed into an SQL literal string (with delimiters "...", not just “escaping”) when it is a string; into the SQL expression NULL when it is null; into an SQL subexpression if it is an object (provided by the library) that represents a well-formed SQL subexpression, etc. > > —Claude >
Resending from on-list address because I’m an idiot. Apologies for the dupe Claude/Craig. Hi Claude, I had my share of issues with Craig’s PR, but I think the original goal of it was a good and useful concept - provide developers (mostly lib authors, but its not like it couldn’t be used by end developers too) a way to _know_ that a string came from something hard coded in a php file. A ‘tagged template’ like that doesn’t help solve the problem in any way that parameterised queries can’t already do, and if you want to make it more ’templated’ like that, you could implement the same thing already by passing a printf-compatible template and the arguments to a function/method. None of that helps solve what the `is_literal` function (or potential type hint) would help with: when the part of the query that needs to be substituted, is something that cannot be parameterised at the SQL level (i.e. a column name) you _really_ don’t want that to accept user input of any kind. Cheers Stephen
  116042
September 14, 2021 14:50 craig@craigfrancis.co.uk (Craig Francis)
On Wed, 8 Sept 2021 at 07:33, Claude Pache pache@gmail.com> wrote:

> We all want to protect from injection vulnerability, but I think there are > better way than is_literal. > > One way is to use templates, an area where PHP is ironically lagging > behind. I suggest looking at JS tagged templates: >
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#tagged_templates Hi Claude, Posting on-list, as I've not had a reply (was confirming I've not missed anything). I have looked at JavaScript Tagged Templates before, and while they could be made to work (ish), I don't believe they are better than the `is_literal()` proposal to protect against Injection Vulnerabilities: 1) It would require developers and libraries to re-write all of their existing code to use Tagged Templates. 2) If we copied JavaScript, the methods/functions can still be called incorrectly: function template(html, ...values) {
> console.log(html, values); > } > template`

Hi ${name}

`; > template([`

Hi ${name}

`]); // Wrong > template(['

Hi ', name, '

']); // Wrong

PHP could provide a way for Libraries to check the developer has used a Tagged Template, but that's basically what the `is_literal()` proposal does. With JavaScript, this is why `isTemplateObject()` is being developed, and Trusted Types might get `fromLiteral()`. 3) Libraries would not be able to use Tagged Templates and easily support older versions of PHP. 4) The backtick character is already used for `shell_exec()` like functionality. Craig