Tokenizer, a simple lexer in PHP to extract tokens

Tokenizer V1.0.0

Recently, after discovering Emmet for Sublime Text 3, my interest in parsing grammars and generating my own basic programming language and syntax was rekindled. Having the ability to use a simple, easy to understand syntax to quickly generate common boilerplate code (or, in the case of Emmet, html) is a tempting prospect.

Why build a Lexer?

Although most modern PHP IDEs do have snippet support; the ability to create re-usable sections of code, they’re not grammars. You can’t use them to construct code from many little building blocks or keywords. Just one at a time. Snippets are useful, though. If you find yourself writing for loops on a regular basis, then having a keyword you can type into your editor that quickly expands into a for loop ready for programming is a life saver.

But what if you wanted nested for loops? Or a for loop within a for loop, with three nested if statements inside it, followed by a while loop proceeding the first for loop? Yes, you could use snippets, but they wouldn’t be quite as fast. It would be nice to have a syntax as simple as Emmet’s, but designed specifically for PHP.

My first attempt at creating a Lexer was all-too basic, and I found myself trying to write an ever-more complex regular expression to try and extract all the tokens I wanted from my string. Which is why I decided to build a simple Lexer class, instead.

Tokenizer in action


This is a basic example of the tokenizer in action. You create a series of patterns to match using regular expressions, and the tokenizer will take an input string and extract the tokens from it. If the tokenizer encounters any problems, it will report where in the patterns array the error occurred.

Why PHP?

Why not? As a scripting language, PHP makes it much easier to focus on developing the language itself than getting too bogged down in the syntax of a more complex programming language. This may be at a performance cost due to PHP being an interpreted language, though. However, one of the benefits of an interpreted language like PHP, Javascript or Python is the function eval(). This function interprets a string as inline code, and then processes it as code. Although some people will tell you eval() is evil, I think it’s just misunderstood. Let’s say you wanted to a parse a mathematical expression in your new programming language, and return the result. You could write your own math parser. Going through the process of adding in operator precedence, including parsing expressions in brackets first, or..

However, if you had performance in mind, but still wanted to maintain the ability to have some of the work done for you, then you could look at Lua and ChaiScript, two scripting languages you can use within C++. This gives you the performance of a compiled language, and the flexibility of a scripting language, all in one.

What next?

There is of course much more to creating a programming language than just recognising the individual tokens the syntax is made of. Once you have your tokens, you’ll need to parse them with a parser which can then interpret the tokens as a grammar, and then perform actions based on that grammar. One common approach is creating what is known as an abstract syntax tree. This can be accomplished by using the Interpreter design pattern, for example.

There are also pre-built lexers and parsers out there if you want a good head start, such as ANTLR and Bison, however you may find building your own will give you a much better understanding of how grammars can be built. I recommend looking into Chomsky hierarchy and Backus–Naur Form as they’re a good place to start.

If you’re looking for a light introduction into building a simple programming language, howCode have an excellent set of tutorials on their youtube channel. The playlist can be found here. The tutorials are written in Python, but the techniques can be applied to a variety of different imperative programming languages such as PHP, C++, Java and even Javascript.

Download Tokenizer on Github

The example.php file on github includes a more fleshed out example file to get you started on creating your own programming language.

2 thoughts on “Tokenizer, a simple lexer in PHP to extract tokens

  1. Hello .
    You’ve done a very good work with this Lexer. I enjoyed using it to understand how an interpreter works.

    But it seems to have a little problem .
    When i give to the input string a number and number value is 0(zero) , the tokenizer stops tokenizing forward.

    I tried to understand why this happens , but no result

    I look forward for you answer ! Thank you.

    • Sorry for the late reply! Thanks for spotting this bug. The error was in the example while loop when fetching each token. Provided the found token was a “truthy” type value, it would return the next token. Because 0 is a “falsey” type value, it was causing the while loop to exit earlier than it should.

      I’ve updated the github repo with the fixed example that uses a strict comparison.

Leave a Reply

Your email address will not be published. Required fields are marked *