Hoa\Compiler Keyword-Identifier Clash


#1

How do I resolve keyword-identifier clash?

Ex) Truncated .pp:

...
%token	in			in
...
%token	id			[A-Za-z][A-Za-z0-9_]*
...

Following is parsed correctly:

for x in range(1,3):
	Favourite.Food.select("Cake");
;

This one…:

for x in range(1,3):
	Favourite.Food.insert("Cake");
;

…produces error:

Fatal error: Uncaught Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token "in" (in) at line 2 and column 17:
	Favourite.Food.insert("Cake");

I’ve attempted to switch the order of the tokens, but that seems to make it so that keywords are recognized as identifiers. How do I solve this?


#2

I’ve taken a look at the Lexer.php file to investigate. It seems like there’s an issue with the matchLexeme() function, which I think parses what’s remaining of the input with a regex that reads like:

#\G(?| <regex> )# <option> (from line 277)

So when it gets to this part…:

insert("Cake");; (space characters ignored)

…it does this:

(Courtesy of regex101)

I’m not sure if my understanding of the lexer mechanism is correct, but if it is, shouldn’t it work differently? Shouldn’t it, instead of analyzing all of the remaining text, analyze progressively more characters until no match is found? Here’s an illustration of what I mean:

Text Token
 i  ::id::
 in  ::in::
 ins  ::id::
 ...  ...
 insert  ::id::
 insert(  END

(Use #^(?| <regex> )$# <option> instead)

When the lexer reaches (, it rolls back to the last valid token ::id::, and repeats the process for the remaining text: ("Cake");;.


#3

I made some changes to the Lexer.php file, and it seems to work like it should. Instead of implementing the changes I suggested in the previous post, I changed nextToken() so that the longest token is returned after matching every token in $tokenArray (instead of returning early in the for-loop).

Please have a look at this edited version:

protected function nextToken($offset) {
	$tokenArray = &$this->_tokens[$this->_lexerState];
	
	$longest = null;
	foreach ($tokenArray as $lexeme => $bucket) {
		$result = $this->matchLexeme($lexeme, $bucket[0], $offset);
		if ($result !== null && 
				($longest === null || 
						$result['length'] > $longest['length']))
			$longest = [
				'token' => $lexeme,
				'value' => $result['value'],
				'length' => $result['length'],
				'regex' => $bucket[0],
				'state' => $bucket[1]
			];
	}
	
	if ($longest !== null) {
		if ($longest['state'] === null)
			$longest['state'] = $this->_lexerState;
		
		if ($longest['state'] !== $this->_lexerState) {
			$shift = false;
			if ($this->_nsStack !== null &&
					preg_match('#^__shift__(?:\s*\*\s*(\d+))?$#',
							$longest['state'], $matches) !== 0) {
				$i = isset($matches[1]) ? intval($matches[1]) : 1;
				if ($i > ($c = count($this->_nsStack)))
					throw new Compiler\Exception\Lexer(
						'Cannot shift namespace %d-times, from token ' .
						'%s in namespace %s, because the stack ' .
						'contains only %d namespaces.',
						1,
						[
							$i,
							$longest['token'],
							$this->_lexerState,
							$c
						]
					);
					
				while ($i-- >= 1)
					$longest['state'] = $this->_nsStack->pop();
				$shift = true;
			}
			
			if (!isset($this->_tokens[$longest['state']]))
				throw new Compiler\Exception\Lexer(
					'Namespace %s does not exist, called by token %s ' .
					'in namespace %s.',
					2,
					[
						$longest['state'],
						$longest['token'],
						$this->_lexerState
					]
				);
				
			if ($this->_nsStack !== null && $shift === false)
				$this->_nsStack[] = $this->_lexerState;
				
			$this->_lexerState = $longest['state'];
		}
		
		return [
			'token'  => $longest['token'],
			'value'  => $longest['value'],
			'length' => $longest['length'],
			'namespace' => $this->_lexerState,
			'keep' => 'skip' !== $longest['token']
		];
	}
	
	return null;
}

#4

Hello :-),

To avoid token clashes, you can use token namespaces. They have designed to exactly avoid this issue.

Also, you can use the hoa compiler:pp $grammar $language --token-sequence to see the observe the result of the lexer directly in your terminal. It is helpful.


#5

Thanks for the reply, but I’m not sure how namespaces can solve my problem. I think the “token clashes” you refer to are fundamentally different from what I’m experiencing.

Here are situations where namespaces can be useful as presented in the hack book:

  1. When you need to give a character sequence different names in different contexts (quote_ and string:_quote both refer to ")
  2. When you need to give different character sequences in different contexts the same name (foo and ns2:foo; ns1:tada and ns2:tada)

What none of the examples demonstrates is how to use namespaces to tell the lexer what to do when ambiguity arises (e.g. should insert be kept as a whole, or should it be divided into in and sert), which seems to be resolved simply with the order the tokens appear in the .pp file.

I’d love to know if there was a way around this using namespaces, but I can’t see how namespaces can help, because the namespace transition seems to occur only after a new token is created (I don’t want to create a new token right after in).


#6

You have a conflict between the in and id tokens. This is always better to place the most generic token at the end, exactly what you did. But unfortunately, in this specific case, it creates a conflict with a method name. Well, namespaces to the rescue. How?

You can jump to another namespace when you start reading chain calls. It does not solve the issue with the first identifier though.

So we can probably something smarter. The in token can be defined like this:

%token in in\b

This way, it will be recognized only if a word separator follows in. Consequently, insert will not be parsed as in + sert.

Check what we did here, Grammar.pp of Hoa\Ruler: The token not is declared as (?i)not\b, so case-insensitive + word separator.


#7

Thanks. That’s precisely what I was looking for.