Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip processing handler if this is not a valid shortcode #26

Closed
wants to merge 1 commit into from
Closed

Skip processing handler if this is not a valid shortcode #26

wants to merge 1 commit into from

Conversation

rhukster
Copy link

I was getting some very strange content corruption when I enabled my shortcode plugin that uses your fantastic little shortcode library. This was caused by my having some square bracketed content that did not match up with any of my defined shortcode handlers:

##### html([title][, alt][, classes])

!! In Markdown this method is implicitly called when using the `![]` syntax.

The `html` action will output a valid HTML tag for the media based on the current display mode.

This [title], [, alt] and [, classes] text is used in my documentation to display optional method params. However, it was corrupting that title to something like:

<h5>html([t[title] alt][, classes])</h5>

I debugged this in the Shortcode library and discovered that you were first collecting all the potential shortcodes in the entire page (quite a lot for me), and then iterating over them. However, when you retrieved the handler for every shortcode, it would continue to try to process the shortcode even if the handler was null. This ultimately leads to all kinds of corruption of the end content.

Simply checking to see if this handler is null, and then continuing without doing any more processing ensures there is no corruption, and also speeds up the page parsing considerably.

FYI - I created a core shortcode plugin for Grav CMS, and a subsequent shortcode plugin that adds some UI specific shortcodes in case you are interested.

@rhukster
Copy link
Author

Weird it didn't pass travis checks, looks valid to me :)

Anyway, something else is going on. Even with valid shortcodes that work in other pages, it's not working in this one page. so will investigate further.

@rhukster
Copy link
Author

Ok, have narrowed down the corruption problem with valid shortcodes to this in my content before any of the shortcodes:

`![](image.jpg)`

I'm guessing the [] in the markdown URL is causing the issue. Not had chance to dig in deeper to see where that is tripping things up.

@thunderer
Copy link
Owner

@rhukster Hi, thanks for reporting the issue! I'd really like to help you, but first let me clarify some points you made here:

  • first of all, shortcode name can be only a non-empty string with alphanumeric characters and dash -, so there will be no matched shortcodes in ![](image.jpg) and only one in html([title][, alt][, classes]) (the [title]), the rest of the bracketed parts (both [, alt] and [, classes]) you mentioned have comma as the name so it won't be even matched,
  • secondly, the PR you submitted skips processing shortcodes if no handler is present right in the heart of Processor::processIteration() method, but if you look closely there is a method Processor::processHandler() called just under that line which does exactly that (look at the ternary operator at the end). It does not "skip" processing the shortcode but if there is no handler it just returns its raw string match. So if there is no handler for [title], the result of processing it will be [title]. There is one exception though - if you set up the default handler using HandlerContainer::setDefault() then it will be processed with to the callable you passed there,
  • to add more to the previous point, this is because shortcode without handler may contain other shortcodes inside its content that do have handlers and such case needs to be processed correctly (you can of course configure Processor to disable recursion, but the point remains valid),
  • as for performance penalty for parsing all shortcodes it is recommended to cache the processing result and reprocess only when saving edited version of the article,
  • while writing this answer I found a bug in RegularParser that allowed arbitrary shortcode names (no validation for the rules mentioned above), it was fixed in master in a22cd63 . RegexParser is not affected, though I really recommend using RegularParser.

In order to fully debug this, I'd need the text that is causing problems for you and which parser you used (I saw RegexParser in the plugin code, is that it?). I'd check first for what is returned from the parser and if that is correct I'd then check if the replacements are applied correctly. Could you please prepare a gist and tell me what did you get and what did you expect? Thanks in advance.

@thunderer thunderer self-assigned this Jan 20, 2016
@rhukster
Copy link
Author

I didn't take into account all the various nested situations that may arise, so my fix was probably a bit hasty. Also I switched to the RegularParser and all my corruption problems went away.

I definitely feel like my markdown image code sample: ![](image.jpg) was triggering an opening empty shortcode. Surely an empty bracket pair [] is not a valid shortcode? maybe the RegexParser is treating it as such? Anyway i'll close this PR as it seems it is not valid. Cheers!

@rhukster rhukster closed this Jan 20, 2016
@thunderer
Copy link
Owner

@rhukster I'm glad that everything is correct now. I've tagged release 0.5.2 with the fix mentioned above so you can go back to using thunderer/shortcode in composer.json.

I'd really want to take a look at the problem you had with RegexParser though. I added invalid test cases extracted from the data you provided and since tests passed I'm pretty sure that none of the parsers included in this library recognize shortcodes in mentioned invalid sequences ([] and [, alt]). Also from the whole ParserTest class you see that I made sure that no invalid string can get inside. :) Could you please send me a gist with the problematic text and sample Shortcode implementation?

All in all I'm very happy that you use Shortcode in GravCMS! If you have any ideas how to improve the library or any other problems with using it, let me know! I'm now working in my spare time on the better documentation, there is also events subsystem and bundled handlers waiting in line, you can see the code in the other branches in this repository.

@rhukster
Copy link
Author

I actually have a test setup for you. I think you will find it interesting.. give me a few to package it up.

@rhukster
Copy link
Author

Ok, so please clone https://github.com/rhukster/shortcode-test into your webroot somewhere and composer install.

I have created a simple test to replicate the issues as closely as possible.

Basically I am loading a markdown file, using Parsedown to process it and get the resulting HTML. Then i'm running it through Shortcode with a couple of simple shortcode handlers.

  1. As is, it's using the RegularParser and a small version of the file. All looks good, although it's very slow taking 224ms even on this small chunk of content.

    2016-01-20 11-46-44

  2. If you change this to the RegexParser, the corruption occurs, although things are much faster, taking only 3.2ms:

    2016-01-20 11-48-02

  3. Now switching the content file to examle-full.md on line 16, and rerunning, it is of course still corrupted, but processing is still only 38ms, slower than I would prefer but acceptable as Grav does of course cache the contents of these files.

    2016-01-20 11-50-09

  4. Now changing back to the RegularParser, the page fails to load because it takes > 30 seconds to process. If I turn off the time limit, it was still spinning after 15 minutes!

Grav is a flat-file CMS where speed is paramount. A typical Grav site renders every page within 20ms or so after cache, and certainly < 100ms without cache. So we really need the speed of the RegexParser without the corruption issues :)

Hopefully this test repo will give you a good largish-real world example to do some performance optimizations.

Cheers!

@rhukster
Copy link
Author

Oh one last thing.. I noticed this in PHP 5.6 and 5.7, where as PHP 5.5 doesn't seem to cause the problems with the RegexParser at least in my test setup.

@rhukster
Copy link
Author

Arrggg.. sorry for the rambling but i think i found the culprit. I had this text:

## Where to put your media files

Usually you’ll use a media file within a page...

Looks harmless but I noticed that in PHP 5.5 the apostrophe showed as a special character in my test setup. But looked like a regular quote in PHP 5.6 and 7.0. So this was pasted in as an extended UTF-8 character.

This appears to be throwing the regex out of whack (I tried WordpressParser and RegexParser with same results).

Changing this to a regular ' stops the corruption in all versions of PHP.

@rhukster
Copy link
Author

After adding an mb_internal_encoding("UTF-8"); in the test index.php i can get the same results under PHP5.5 -> 7.0 as PHP 5.6 encoding was set to UTF-8 by default.

@thunderer
Copy link
Owner

@rhukster Thanks for the test repository, it's great! I already made some changes and shaved off several milliseconds. :) I'm happy that you resolved the issue with your Markdown code. Internal encoding is a great clue because I use mb_*() functions a lot, possibly forcing the encoding to UTF-8 inside parsers and processor will solve such problems. This could also solve the issue #25 where @michaloo had to change multibyte functions to regular ones because he suspected that it may be corrupting his output. I'll also add some tests with various UTF-8 sequences to be sure that everything is okay.

BTW Could you tell me what your default mb_internal_encoding() result is? Something different than UTF-8? What is the environment you're working in (OS, HTTP server, PHP version, FPM or mod_php)?

@rhukster
Copy link
Author

I know i didn't make it very clear, but let me recount the situation:

PHP 5.5 - Default internal encoding is ISO-8859-1, the UTF-8 character displayed strangely as a garbled font character, but the shortcodes rendered OK.

PHP 5.6+ - Default internal encoding is already UTF-8, the UTF-8 character displays correctly , but the shorcodes are corrupted.

I forced the mb_internal_encoding to UTF-8 in the test so PHP 5.5 would act the same as 5.6+

This is also set this way in Grav for consistency.

Regarding platform. I'm running on Mac OS X 10.11.3. PHP 5.5, 5.6 and 7.0. I am able to run Apache, Nginx, Caddy server in parallel (different ports), with mod_php, and php-fpm, all exhibit the same behavior now irrespective of the PHP version or web server.

@rhukster
Copy link
Author

BTW, in summary there are two issues currently outstanding:

  1. Regex and WordPress parsers are both effected by UTF-8 characters causing corruption in page content when used in a PHP environment with UTF-8 encoding.
  2. RegularParser can easily crash PHP when used on real-world content. Your testing only provides tiny snippets of shortcodes and does not show how crippling slow and processor-intensive this parser is. It should definitely NOT be used for real-world situations until sorted. Regex and WordPress parsers are the only ones that you can safely use beyond a few lines of text.

Should I open up issues for each of these?

@thunderer
Copy link
Owner

@rhukster thanks for the clarification. I think I understand what is going on.

I spent some time tinkering on the "small" example from your repo and Blackfire reported 54% improvement in RegularParser, you can see the changes on the new branch. I could make a PR right now, but I want to analyze it a bit more as I have a couple of ideas to try.

Unfortunately "full" example is still unable to complete, but I know where the problem is: there are many unclosed shortcodes like [media-name]. RegularParser was built first and foremost for parsing correctness (that is impossible with any regex-based parser) so for each opening tag it searches till the end of the text for the closing tag, matching every nested shortcode on the way. If it reaches the end without finding closing tag, it discards everything, marks that shortcode as "just closed" and starts over from the next opening tag. If those shortcodes would be "self-closed" like [media-name /] then they would be consumed by parser in no time. I could "dodge" that problem by automatically merging all found shortcodes on the second level without reparsing the rest of the text, but then it would create exact same problem for the third nesting level and so on. Just like in programming language there is no way of telling that code will have a syntax error without parsing it up to the place with that error. If I'll find any better approach, you'll be first to know.

I didn't have enough time to check the corruption problem in RegexParser yet, but I'll work on it as soon as possible. Opening separate issues for these problems is a good idea, especially given the excellent input you've provided here. Discussing isolated issues is much better than commenting on everything on closed PR. :)

@rhukster
Copy link
Author

ok will do.. btw i have some ideas that would improve flexibility too.. i'll open up an issue to discuss that too.

@rhukster
Copy link
Author

Probably already saw but I added some ideas to the ideas issue: #16

I think at this point we can communicate on those 3 open issue, and this one can remained closed, and RIP :)

@rhukster rhukster mentioned this pull request Jan 21, 2016
35 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants