As I dive into creating a recursive descent parser from scratch for educational purposes, I've encountered some challenges along the way.
To illustrate my point, let's take a quick look at a snippet of the CSS3 grammar:
simple_selector = type_selector | universal;
type_selector = [ namespace_prefix ]? element_name;
namespace_prefix = [ IDENT | '*' ]? '|';
element_name = IDENT;
universal = [ namespace_prefix ]? '*';
One issue that caught me off guard was realizing that namespace_prefix
is actually optional within both the type_selector
and universal
. This caused problems as the type_selector
consistently failed when given input like
*|*</code. It turns out it was being evaluated for any input matching the <code>namespace_prefix
production.
Recursive descent parsing seems fairly straightforward, but in order to make informed decisions throughout the process, I modified my productions to return Boolean values. This change allowed me to easily determine if a particular production succeeded or not.
I'm currently utilizing a linked list data structure to handle flexible look-ahead capabilities, allowing me to backtrack to the original position if a production fails. However, while attempting a production, I find myself passing around mutable states to build a Document Object Model (DOM). This approach isn't ideal since there's no clear indication of whether the production will be successful, leading to potential difficulties in reverting changes if needed.
So, here's my question: Would it be beneficial to introduce an abstract syntax tree as an intermediary representation and proceed from there? Is this a common workaround for addressing such issues? It appears that the main challenge lies in the fact that the DOM might not be the most suitable tree data structure for recursion.