Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use flat memory layout #62

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

use flat memory layout #62

wants to merge 1 commit into from

Conversation

marler8997
Copy link

@marler8997 marler8997 commented Mar 6, 2021

This is a proposed change to the storage mechanism of compiled regular expressions. Currently, a compiled regex is made up of an array of these fixed-size structures:

typedef struct regex_t
{
  unsigned char  type;   /* CHAR, STAR, etc.                      */
  union
  {
    unsigned char  ch;   /*      the character itself             */
    unsigned char* ccl;  /*  OR  a pointer to characters in class */
  } u;
} regex_t;

The newly proposed data structure is a sequence of variable-length nodes of this structure:

typedef struct regex_t
{
  unsigned char type;    /* CHAR, STAR, etc.                      */
  unsigned char data_len;
  unsigned char data[0];
} regex_t;

The first difference between the two is that the new structure does not contain any pointers to any other data. All the data for the regex is self-contained which means there's no ownership dependencies to handle.

The next difference is fixed-length vs variable-length. The current fixed-length regex_t structure is 8 bytes on 32-bit systems and 16 bytes on 64-bit systems. This is because even though the type field is a single byte, the second field contains a pointer which needs to be aligned on the system's word boundary (4 for 32-bit 8 for 64-bit). Note that it also wouldn't matter whether the field order were changed, because the size of the structure also needs to align the next element if it were in an array. In comparison, the newly proposed data structure is variable length. It is 2 bytes for most types with CHAR, CHAR_CLASS and INV_CHAR_CLASS taking more bytes to store their raw characters. The new structure requires less memory to store any regex and typically ends up taking 2 to 4 times less memory.

It's also worth noting that the new structure has no alignment requirement, so it can be moved to any location in memory with something like memcpy.

Another difference is the new data structure should be a bit more cache friendly while checking for matches. Along with being smaller, it also turned the original structure of 2 buffers into 1 buffer which guarantees that the type id is right next the data it needs while checking for matches.

Yet another difference is the buffers to store both the regex objects and the character data are "uncolored". This means you don't have to adjust two different sizes of two different buffers to make sure you have enough memory to store a regex, there is only one size to adjust.

I also tested run time using time ./tests/test2 with this change. Before this changes times range from 4 to 5.5 seconds, after this change they ranged from 3.8 to 4 seconds. Times were more consistent and around 20% faster on average, which makes sense considering that it should be more cache friendly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant