Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Element.hasClass ignores html strict mode #1446

Open
evaknin opened this issue Nov 2, 2020 · 7 comments
Open

Element.hasClass ignores html strict mode #1446

evaknin opened this issue Nov 2, 2020 · 7 comments

Comments

@evaknin
Copy link

evaknin commented Nov 2, 2020

Hi,
Jsoup ignores case sensitive class selector.
This happens regardless if we use strict html mode or not (<!DOCTYPE html>).
It causes different behaviour from a browser behaviour when using strict mode.

For example:
<!DOCTYPE html>
<html><head><style type="text/css">
.c1{ font-size:44px; }
.C1{ color:red; }
</style></head><body>
<div class="c1">
Some text
</div></body></html>

The following will fetch the div, although the c is in lowercase in the div:
document.select(".C1");

My findings:
The class evaluator matches method calls Element.hasClass.
Element.has class checks for a match - ignoring the case sensitive.

@LIKP0
Copy link

LIKP0 commented Apr 16, 2021

Excuse me. I have reproduced the behaviour by:

public static void main(String[] args) throws IOException {
        String path = "<!DOCTYPE html>\n" +
                "<html>\n" +
                "<head>\n" +
                "    <style type=\"text/css\">\n" +
                "        .c1 {\n" +
                "            font-size: 44px;\n" +
                "        }\n" +
                "\n" +
                "        .C1 {\n" +
                "            color: #ffa578;\n" +
                "        }\n" +
                "    </style>\n" +
                "</head>\n" +
                "<body>\n" +
                "<div class=\"c1\">\n" +
                "    Some text\n" +
                "</div>\n" +
                "</body>\n" +
                "</html>";
        Document doc = Jsoup.parse(path);
        System.out.println(doc.select("[class=C1]").get(0).text());
        System.out.println(doc.select("[class=c1]").get(0).text());
    }

Could you tell me how to use html strict mode so I can test and add some features for Jsoup.select()?

@evaknin
Copy link
Author

evaknin commented Apr 16, 2021

Hi,
In html5, we set strict mode by adding at the beginning of the html.
If we remove it, we don't use strict mode.

@RyderCRD
Copy link
Contributor

Hi,
I think jsoup currently does not support case-sensitive select() and does not depend on whether it is html strict mode.
from here you can see that selectors in jsoup are case insensitive.

For simplicity, maybe you can do text replacement before select, and replace the uppercase or lowercase search content with different content to eliminate conflicts, or you can nest another case sensitive method after selection.

There is no doubt that your findings are correct. In source code of jsoup 1.13.1 (the latest version so far), if we change the 1374th line of Element.java from "return className.equalsIgnoreCase(classAttr);" to "return className.equals(classAttr);" then the problem with the example you gave is solved. Class ".c1" with a lowercase c in it will not be selected by document.select(".C1"); any more.

If we want to solve this problem completely, we need to add an case sensitive option to the selectors in jsoup. Due to default parameters are not supported in Java, and for not to distrubing old funtions, overloading the hasxxx methods seems a good solution.

For example:

public boolean hasClass(String className) {
    return this.hasClass(className, false);
}

public boolean hasClass(String className, boolean caseSensitive) {
    //some code here

    if (len == wantLen) {
        if(caseSensitive)
            return className.equals(classAttr);
        return className.equalsIgnoreCase(classAttr);
    }

    //some code here
    }
}

But then a branch of methods need to be modified like this, since the methods are nested and we need to pass the boolean value from head to tail. This will make jsoup more complex, I'm not sure if it will bring some bad effects.

Also, as far as I know, HTML class names are case-sensitive, while CSS selectors are generally case-insensitive. My suggestion is that we should always write code case sensitively.

@RyderCRD
Copy link
Contributor

I have tried to fix this issue, following is my pull request. #1527
Now you can case-sensitively select classes with .select(".classname", true) if you want.

@RyderCRD
Copy link
Contributor

Here‘s the code. Hope this helps you.

@evaknin
Copy link
Author

evaknin commented Apr 25, 2021

Great.
Thanks :)

@RyderCRD
Copy link
Contributor

RyderCRD commented Apr 26, 2021

You're welcome! Just a reminder, you may also write like this to automatically determine whether to use strict mode.

        boolean htmlStrictMode;
        try{
            htmlStrictMode = doc.documentType().name().equals("html");
        }catch (NullPointerException e) {
            htmlStrictMode = false;
        }
        doc.select(".classname", htmlStrictMode);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants