API Tutorial 3: Lexers and Lexemes

This article is part of a series on using the Understand API (Part 1, Part 2).

Many API scripts/programs rely on the entities and references stored in the Understand database, but sometimes you need to descend into the text of the file itself. Understand lets you do that with the lexer function and the lexeme class.

Lexeme – a chunk of text that means something to the parser: a string, a comment, a variable, etc.
Lexer – a stream of lexemes.

With Understand, we can walk through that stream of lexemes and query each one about its text, what entity or reference is associated with it, what Token is it (Punctuation, Comment, Preprocessor, etc), or what line is it on.
So if you have a simple line like this:

int a=5;//radius

 

Its lexemes would have the following information:

image

This plugin for the Understand GUI shows what the lexical values are for any file or entity.

Download: tokenizer.upl

To install, just drag into the Understand GUI, then right click on the entity and select Interactive Reports->Tokenizer

For this sample line, the plugin would show:

image.png

An Example

Return the text of a file removing all inactive code and comments, and expanding macros This example assumes the Understand database has been opened already (see the templates at the end of the first tutorial).

Perl:

# Search for the first file named 'test' and print 
# the file name and the cleaned text
my $ent = $db->lookup("*test*","file");
print $ent->longname. "\n";
print fileCleanText($ent);
sub fileCleanText{
my $file = shift;
my $returnText;
# Open the file lexer with macros expanded and 
# inactive code removed
my $lexer = $file->lexer(0,,0,1);
# return null if the lexer won't open
return unless $lexer;
# Go through all lexemes in the file and append the
# text of non-comments to returnText   
foreach my $lexeme ($lexer->lexemes()){
if ($lexeme->token ne "Comment"){
$returnText .= $lexeme->text;
}
}
return $returnText;
}

Python:

def fileCleanText(file):
returnString = "";
# Open the file lexer with macros expanded and 
# inactive code removed
for lexeme in file.lexer(False,8,False,True):
if(lexeme.token() != "Comment"):
# Go through lexemes in the file and append 
# the text of non-comments to returnText   
returnString += lexeme.text();
return returnString;
# Search for the first file named 'test' and print 
# the file name and the cleaned text
file = db.lookup(".*test.*","file")[0];
print (file.longname());
print(fileCleanText(file));

Final Thoughts

These tutorials try to cover some of the high level concepts with the Understand APIs as well as some small examples. However they just scratch the surface of what you can do. We recommend looking at some of the Perl and Python API samples that ship with Understand in the Scripts folder for more ideas and examples. Also more in depth API documentation is available in our manuals section.. If you get stuck, just send us an email at support@scitools.com, we’re always happy to help.