Parsing Infix Expressions using JavaScript

The following guide describes a programming pattern that can be used to parse infix expressions. The language used in the programming examples is JavaScript but the pattern can be applied using any language.

Instead of building a complete interpreter or parser, the focus of this guide is teaching the solution to the problem at the core of building an infix parser: working out how to move through the infix expression.

For the sake of simplicity, the parser in the example only handles single character operators and integers. Furthermore, it will execute the expression in-place without building a separate tree structure and it will only support infix operators.

Link to the full example.

Structure

The following is the structure of the example parser:

function Parser()
{
  var stream = null; // The stream of text being parsed
  var offset = -1;   // The stream pointer

  function getChar() // Get the next non-whitespace character
  {
  }

  function getOperator() // Get the next operator
  {
  }

  function executeOperator(left, right, operation) // Perform the given operation on the operands 
  {
  } 

  function getFactor() // Get the next factor
  {
  } 

  function getTerm(weight) // Get the next term
  {
  } 

  function getExpression() // Get an expression
  {
  } 

  this.execute = function(text) // Execute the expression in the text
  { 
    stream = text; 
    offset = 0; 

    return getExpression(); 
  } 
}

The functions of interest are getFactor, getTerm and getExpression. This is where the core of the problem is solved. But before they can be completed, getChar,getOperator and  executeOperator will have to be filled in.

Reading & Executing

getChar reads the next non-whitespace character from the stream and returns null if the end of the stream is reached first:

function getChar()
{
  var token;

  while(offset < stream.length)
  {
    token = stream[offset];
    offset++;

    if(token != ' ') // If token is not a space, return it
      return token;
  }
  
  return null;
}

getOperator reads the next operator in the stream and returns null if the end of the stream is reached first or the next character is not an operator:

function getOperator()
{
  var token, position;

  position = offset; // Save the current stream pointer position

  token = getChar();
  if(!token)
    return null;
 
  if(token == '+' || token == '-')
    return {operation: token, weight: 1};
  else if(token == '*' || token == '/')
    return {operation: token, weight: 2};
 
  offset = position; // Restore the previous stream pointer position
 
  return null;
}

executeOperator executes the supplied operation on the given operands:

function executeOperator(left, right, operation)
{
  if(operation == '+')
    return left + right;
  else if(operation == '-')
    return left - right;
  else if(operation == '*')
    return left * right;
  else if(operation == '/')
    return left / right;
}

Reading Factors

Walking through the expression itself is a bit trickier. We’ll start with getFactor. This function can do two things: read the next integer in the stream or execute a sub-expression. If present, the sub-expression will be demarcated by round brackets ().

function getFactor()
{
  var token, value;
  
  token = getChar();
  if(!token)
    return null;
   
  if(token == '(') // a sub expression
  {
    value = getExpression();
    token = getChar();
    
    if(!token || token != ')')
      throw new Error(`col ${offset}: Expected a closing bracket, found: ` + (token ? token : 'end-of-line'));
     
    return value;
  }
    
  else if(/\d/.test(token)) // an integer
    return token * 1;       // convert from string to number
   
  throw new Error(`col ${offset}: Expected an integer, found: ` + token);
}

The logic behind this function is relatively straight forward. The function:

  • Gets the next character in the stream or returns null if there is none.
  • Checks if that character is the beginning of a sub-expression. If so, executes the sub-expression and checks for the closing bracket, then returns the expression’s value. If not, checks if the next character is an number then returns it.
  • If character is neither a sub-expression nor a number, throw an exception.

That is all there is to reading the factors from the stream. For a more complex parser, this is where the program would check for different types of literals or identifiers. In a program that supports prefix operators, that check would precede the checks for literals and identifiers and if found, would lead to a call to getTerm.

Reading Expressions

The reading of terms, and by extension the expression, is performed using two almost identical functions: getTerm and getExpressiongetExpression contains the base design of the function:

function getExpression()
{
  var left, right, operator;
  
  left = getFactor(); // Get a factor, it must exist
  if(!left)
    throw new Error('Missing expression encountered'); 
   
  while(true) // Deliberate infinte loop
  {
    operator = getOperator();
    if(!operator)
      return left;
    
    right = getTerm(operator.weight);
    left  = executeOperator(left, right, operator.operation);
  }
}

What the function does is to first get a factor. The least that can constitute an expression is a factor, so it is mandatory that it be present.

Following this is an infinite loop that processes all subsequent terms. The loop starts by fetching an operator, which is optional. If none exists, the end of the expression has been reached and the value that is currently stored in left is returned as the result of the expression.

If an operator does exist, it must be followed by a term since the parser only supports infix operators. The result of that term is stored in right. The weight of the operator is always passed to getTerm so the function can determine if it should execute its own operator immediately, or yield to the previously reached operator.

The operator is then executed and the result is stored in left. The loop then begins from the top.

Reading Terms

getTerm contains a slightly modified version of getExpression and it also takes the weight of the previously encountered operator as a parameter:

function getTerm(weight)
{
  var left, right, position, operator;
  
  left = getFactor();
  if(!left)
    throw new Error('Missing term encountered'); 
   
  while(true) // deliberate infinte loop
  {
    position = offset; // Save the current stream pointer position
    
    operator = getOperator();
    if(!operator)
      return left;
     
    else if(operator.weight <= weight) // If the current operator has a lower or equal precedence, yield
    {
      offset = position; // Restore the previous stream pointer position
      return left;
    }
    
    right = getTerm(operator.weight);
    left  = executeOperator(left, right, operator.operation);
  }
}

Like getExpressiongetTerm demands a factor because that is the least that can constitute a term. Afterwards, a similar loop processes all subsequent terms.

The loop starts by saving the current position of the stream pointer. This is done so that the function is able to jump back to this position if it discovers that the previously encountered operator should execute first.

The loop then fetches an optional operator. If none exists, the end of the term has been reached and the value that is currently stored in left is returned as the result of the term.

  • If an operator is found, the weight of the operator is checked against the supplied weight of the previously encountered operator.
  • If the given weight is greater than the current weight, the function should yield so the higher precedence operator executes first.
  • If the given weight is equal to the current weight, the function should yield so operators are executed from left to right.
  • If the given weight is less than the current weight, proceed with the loop because the current operator should execute first.

If the operator was found and has the higher precedence, the loop fetches the next term. The weight of the current operator is passed to the function so it can decide whether or not to yield if it reaches an operator. The result of that term is stored in right.

The operator is then executed and the result is stored in left. The loop then begins from the top.

Conclusion

That is all there is to parsing infix expressions. When constructing a more complex system, it is wise to separate the lexer and the parser, and to build a syntax tree that is executed elsewhere in order to separate parsing from execution.

The End.

Leave a Reply

Your email address will not be published. Required fields are marked *