Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "PatternCaptureGroupTokenFilter" generates identical offsets, which causes issues with highlighting the string. #13783

Open
shikhasharma3708 opened this issue Sep 13, 2024 · 0 comments
Labels

Comments

@shikhasharma3708
Copy link

Description

I am implementing the PatternCaptureGroupTokenFilter in my code to generate tokens based on multiple regular expressions, with the goal of highlighting any matches found within the string. Currently, I am working with Lucene 9, but I am encountering the following error during execution.

I'm using one of the latest Lucene jars (9.11.1) for PatternCaptureGroupTokenFilter, but the token positions are not as expected, making it difficult to accurately highlight the search results.

Here are the tokens generated for the string:

{ "tokens" : [ { "token" : "test:data", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : ":data", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test:", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "word", "position" : 1 } ] }

The generated offsets are not compatible for highlighting the searchedQuery.

To generate different offsets, i tried using the org.apache.lucene.analysis.tokenattributes.OffsetAttribute Lucene package. which is giving me the different offsets but i am encountering another error while indexing the document.

I am using the below java code.
`public final class PatternCaptureGroupTokenFilter extends TokenFilter {

private final CharTermAttribute charTermAttr = addAttribute(CharTermAttribute.class);
private final PositionIncrementAttribute posAttr = addAttribute(PositionIncrementAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

private final TypeAttribute typeAttribute = addAttribute(TypeAttribute.class);
private State state;
private final Matcher[] matchers;
private final CharsRefBuilder spare = new CharsRefBuilder();
private final int[] groupCounts;
private final boolean preserveOriginal;
private int[] currentGroup;
private int currentMatcher;
private int main_token_start;
private int main_token_end;

public PatternCaptureGroupTokenFilter(TokenStream input,
    boolean preserveOriginal, Pattern... patterns) {
    super(input);

    this.preserveOriginal = preserveOriginal;
    this.matchers = new Matcher[patterns.length];
    this.groupCounts = new int[patterns.length];
    this.currentGroup = new int[patterns.length];
    for (int i = 0; i < patterns.length; i++) {
        this.matchers[i] = patterns[i].matcher("");
        this.groupCounts[i] = this.matchers[i].groupCount();
        this.currentGroup[i] = -1;
    }
}

private boolean nextCapture() {

    int min_offset = Integer.MAX_VALUE;
    currentMatcher = -1;
    Matcher matcher;

    for (int i = 0; i < matchers.length; i++) {
        matcher = matchers[i];
        if (currentGroup[i] == -1) {
            currentGroup[i] = matcher.find() ? 1 : 0;
        }
        if (currentGroup[i] != 0) {
            while (currentGroup[i] < groupCounts[i] + 1) {
                final int start = matcher.start(currentGroup[i]);
                final int end = matcher.end(currentGroup[i]);

                if (start == end || preserveOriginal && start == 0
                        && spare.length() == end) {
                    currentGroup[i]++;
                    continue;
                }
                if (start < min_offset) {
                    min_offset = start;
                    currentMatcher = i;
                }
                break;
            }
            if (currentGroup[i] == groupCounts[i] + 1) {
                currentGroup[i] = -1;
                i--;
            }
        }
    }
    return currentMatcher != -1;
}

@Override
public boolean incrementToken() throws IOException {
    if (currentMatcher != -1 && nextCapture()) {
        assert state != null;
        clearAttributes();
        restoreState(state);
        final int start = matchers[currentMatcher]
                .start(currentGroup[currentMatcher]);
        final int end = matchers[currentMatcher]
                .end(currentGroup[currentMatcher]);

        // modified code starts
        main_token_start = offsetAtt.startOffset();
        main_token_end = offsetAtt.endOffset();

        final int newStart = start + main_token_start;
        final int newEnd = end + main_token_start;

        offsetAtt.setOffset(newStart, newEnd);
        // modified code ends

        posAttr.setPositionIncrement(0);

        charTermAttr.copyBuffer(spare.chars(), start, end - start);
        currentGroup[currentMatcher]++;
        return true;
    }

    if (!input.incrementToken()) {
        return false;
    }

    char[] buffer = charTermAttr.buffer();
    int length = charTermAttr.length();
    spare.copyChars(buffer, 0, length);
    state = captureState();

    for (int i = 0; i < matchers.length; i++) {
        matchers[i].reset(spare.get());
        currentGroup[i] = -1;
    }

    if (preserveOriginal) {
        currentMatcher = 0;
    } else if (nextCapture()) {
        final int start = matchers[currentMatcher]
                .start(currentGroup[currentMatcher]);
        final int end = matchers[currentMatcher]
                .end(currentGroup[currentMatcher]);

        // if we start at 0 we can simply set the length and save the copy
        if (start == 0) {
            charTermAttr.setLength(end);
        } else {
            charTermAttr.copyBuffer(spare.chars(), start, end - start);
        }
        currentGroup[currentMatcher]++;
    }

    return true;

}

@Override
public void reset() throws IOException {
    super.reset();
    state = null;
    currentMatcher = -1;
}

}`

I came across this GitHub link for reference: #9820. However, I couldn't find a relevant solution to the issue, as it is still marked open.

Is this achievable? Please let me know if anyone has any suggestions.

Version and environment details

I am using lucene's 9.6.0 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant