Home > LLVM-RE-Gen

LLVM-RE-Gen

LLVM-RE-Gen is a project mainly written in C and C++, it's free.

Posix Regular Expression Compiler forr LLVM

INTRODUCTION

LLVM Regular Expresion Generator is a compiler that generates state machines from ASCII POSIX regular expression and compile them in LLVM IR so the compiled regular expression can be very fast to execute at run-time.

This project is part of the CoCoT project, is EXPERIMENTAL and probably INCOMPLETE. It is designed to be a proof-of-concept and only handles ASCII characters.

This project is aimed to be used inside a lexer. It therefore does not handles ^ nor $. However, it should not be very hard to implement. It simply considers that each given regexp should start on each given string. In other word, any '(regexp)' is in fact the equivalent of '^(regexp)' in other regular expression libraries.

FEATURES AND SYNTAX

Here is a list of currently supported features: ab = 'a' followed by 'b' a|b = 'a' or 'b' a? = 'a' zero or one time a = 'a' zero or any times a+ = aa (ab) = group composed by 'ab'. Can be used as a single token: '(ab)*', '(ab)|c', etc. [abc] = 'a' or 'b' or 'c' [^abc] = any character but 'a' or 'b' or 'c' [a-c] = any character from 'a' to 'c'

The escape characters recognized are : a, , f, , , as well as the hexadecimal escape sequence : x6D (='m')

Inside character sequence, you can add: [:alnum:] = [A-Za-z0-9] [:word:] = [A-Za-z0-9] [:alpha:] = [A-Za-z] [:blank:] = [ ] [:cntrl:] = [x00-x1Fx7F] [:digit:] = [0-9] [:graph:] = [x21-x7E] [:lower:] = [a-z] [:print:] = [x20-x7E] [:punct:] = [][!"#$%&'()*+,./:;<=>?@^`{|}~-] [:space:] = [ vf] [:upper:] = [A-Z] [:xdigit:] = [A-Fa-f0-9] Example: [[:alnum:]@] = [A-Za-z0-9@]

The parser is VERY context-dependant. Hence, this is a correct regular expression: [...]] It means: Any character from '.' to ']'.

  • Because we are in a character sequence, '.' is not a special char.
  • Because it is the begining of the sequence, '..' is not recognized as it needs at least a character to begin the range. Therefore '.' is recognized as a simple character.
  • '..' is recognized as the range token.
  • Because the end of the range is expected, ']' is recognized as the end of the range and not the end of the sequence.
  • The second ']' is recognized as the end of the sequence. This highly context-dependant parser permits to avoid escape characters in a lot of weird cases BUT it can generate weird errors. For example the [a..]b will generate a "Incomplete character sequence" error and not "Incomplete character range" error because the range a..] is correct and therefore the ']' is recognized as the end of the character range, so the sequence is still not terminated. Another weird effect can be showed with this expression: '[A..]c[de]'. In this case, this expression defines a SINGLE character sequence that can recognize any charcater from 'A' to ']' or the charcters 'c', or '[', or 'd', or 'e'. Because we are in a character sequence, '[' cannot be used as a begin sequence token and is therefore recognized as a simple character inside the sequence.

LLVM-REGEXP USAGE

llvm-regexp is the binary utility that generates LLVM bitcode form regular expressions. Just run llvm-regexp without arguments to see it's usage. To create a compiled transactionnal unit object (.o) that can be linked by a C or C++ compiler (like gcc or g++), follow this example :

llvm-regexp -H C++ -p demoTest 'a[bc]*c' | llc -filetype=obj -o demoTest.o Llvm-regexp generates the LLVM bitcode and llc compiles it into a transactionnal unit. In this case, llvm-regex also generates a demoTest.h C++ header file. You can then use the generated function in a C++ code like this :

include

include "demoTest.h"

int main() { std::cout << demoTest_0("abcbce") << std::endl; } Of course, the same example could work on regular C code, just change the -H parameter of llvm-regexp to -H C and use printf instead of std::cout.

SIMPLE LIBRARY USAGE

All the code is not documented yet (my research must go on), but the library is very easy to use :

// The only header you need

include

// Create a Regular expression function. LLVMRE::Func func = LLVMREInstance().createRE("[[:alpha:]][[:alnum:]_]");

// Execute it. Return the number of character matched (0 if none). int match = func->execute("testString");

As stated before, LLVMRE does not support $ and ^ but it is very easy to implement :

// Implementation of ^ std::string test = "testString"; LLVMRE::Func func = LLVMREInstance().createRE("[[:alpha:]][[:alnum:]_]"); int match = func->execute(test.c_str()); if (match != test.length()) match = 0;

// Implementation of $ std::string test = "...testString"; LLVMRE::Func func = LLVMREInstance().createRE("[[:alpha:]][[:alnum:]_]"); int match; int start = 0; while (start < test.length() && (match = func->execute(test.c_str())) == 0) ++start;

LIBRARY TWEAKS

Simply read the comments on the LLVMRE.h file, all the library is explained.

INSTALLATION

Use cmake or cmake-gui. This library needs a CMAKE enabled LLVM library (most of distribution ships with LLVM without it's CMAKE part).

Example under linux :

mkdir build cd build cmake -D CMAKE_BUILD_TYPE:STRING=Release .. make sudo make install