Skip to content

Lexer V1

Lancine Doumbia edited this page Jul 24, 2022 · 1 revision

Tokenizer Op begin

6/27 159am

Rewriting the tokenizer syntax from python to javascript.

217pm

Still working on the tokenizer/lexer. Having a problem extracting multi digits/multi characters from input expression and storing it into the lexer number object

this works in python:

        #input expr 456
        if re.match(r"[0-9]", char):
            value = ''
            #nested iteration if a number is multi-num 
            while re.match(r"[0-9]", char):
                value += char
                current = current+1
                char = input_expression[current];
                
            tokens.append({
                'type': 'number',
                'value': value   #number token value 456
            })
            continue

but this doesn't work in javascript:

        //create + add token to array for numbers
        if (char.match(numbers)){
            var number = ''
            /* Problematic code at while loop condition
            while (char.match(numbers)){
                number += char
                current += 1
                char = input_expression[current];
            }
            */
            tokens.push({
                'type': 'number',
                'value': number
            })
            continue
        }

It keeps giving me this error: TypeError: Cannot read properties of undefined (reading 'match')

So far, I've had no problem adding new token generators for characters: +, -, *, /, {, }, =

244pm Idea

Add name property to token object to make things clearer. Do it after building a compiler.

    {
        'type': 'number',
        'name': 'whatever',
        'value': value
    }

622pm

Added token generators for characters: [, ], :, ;

643pm

Updated the error condition. The tokenizer will ignore the unknown character.

723pm

Added token generators for characters: !, ?

1038pm

The number token generator is still difficult to make it work properly

243am 6/28

Maybe I can find a workaround. I can simplify the number and letter token generators and add extra code to traverse the array to find the objects that have the 'word'/'number' type, make a new token that has the merged values of the objects together, and make a new array to return

306am

//check if token object type is word. Get the index
for (let index = 0; index < tokens.length; index++) {
    if(tokens[index].type == 'word'){
        console.log(`Letter? Yes. Index is ${index}`)
    }else{    
        console.log("No")
    }
        
}
//check if token object type is number. Get the index
for (let index = 0; index < tokens.length; index++) {
    if(tokens[index].type == 'number'){
        console.log(`Digit? Yes. Index is ${index}`)
    }else{
        console.log("No")
    }
        
}

Get rid of the else statements and merge the for loops together.

401am

Getting messy with the code here

721pm 6/29/2022

Still fixing my code up for the number and letter token generators. There must be a way to store multi number/alphabet tokens.

//PSEUDOCODE - probably should've done this first.

Function tokenizer (input_expression)
    init currentindex to 0
    make an array for holding token objects
    Init runner to false

    while currentindex < input_expression:
        move each input_expression element to char variable  //char may not be needed.

        if char matches whitespace regex
            current++
            continue
        endif
        //all character tokens
        if char equals '('
            push token object into tokens array
            //object {type:'what', value:'token value'}
            current++
            continue
        endif

        //numbers [Focus here]
        if char matches digits regex

            
            num variable holds number characters
            set runner to true

            while runner
                push token object into tokens array
                runner false
            endwhile
            current++
            continue
        endif

        //

        else
            display "unknown char" 
            current++
            continue
        endelse

    endwhile



End Function tokenizer

819pm

Almost works.

1137pm

I think I found the solution! Peter Leonov - Writing a JS lexer

1156pm

Didnt work. Try again

1222am 6/30/2022

I FOUND THE SOLUTION. And its not what I expected.

The solution was right in front of me the whole time.

In python:

if re.match(r"[0-9]", char):
    value = ''
    #nested iteration if a number is multi-num 
    while re.match(r"[0-9]", char):
        value += char
        current = current+1
        char = input_expression[current];
                
    tokens.append({
        'type': 'number',
        'value': value
    })
    continue

javascript:

if (char.match(/[0-9]/g)){ 
    var num = ''
            
    while(char.match(/[0-9]/g)){
        num += char
        current++
        char = input_expression[current] //THIS LINE IS IMPORTANT. SHOULD NOT HAVE TAKEN IT OUT.
    }
    tokens.push({
        'type': 'number',
        'value': num
    })      
    continue
}

Didn't know why my terminal was having problems with my while loop before

It took me 3 days to figure this out. Didn't give up

103am

And now my terminal is giving me the same error. Why!?

Parser Op 134am 6/30/2022

Skip to Code generator operation. Seems simple

I've gotten impatient. Breezing through this project at this point.

7/1/2022 Friday 303am Vue Options API generator Trial 2

I got frustrated because I had a difficult time pushing multiword/digit tokens without causing an error.

755pm

Rebuilding the tokenizer from scratch.

926pm

Discovering Moo - nearley.js

1154pm

I FINALLY FUCKING SOLVED THE MULTI NUM/WORD PROBLEM!!!!!!!!!!!!!!

I took a risk and it worked!

    //if digit found, get into this statement. While loop will run immediately
    if(el.match(/[0-9]/)){

        var number = '' //digit character storage

        /*My solution - since the regex match function behaves well with if statements
        and since while loops are great for allowing the storage of each element until
        reaching an index in the expression where there isn't a digit, it therefore 
        makes sense to make a while loop with the simplest condition. A boolean condition
        */
        //init speed to true outside for loop.
        while(speed){
            //secondary digit checker. As long as there's a digit, keep storing them 
            if(el.match(/[0-9]/)){ //find the first digit again. This time, store it
                number += el    
                //get to the next index of the expression and test its corresponding element
                //while at the same time, update the for loop
                el = expression[++where] 
            }else{  //No digit? Push this token with multi num value. Then break the while loop 
                tokens.push({
                    'type': 'number',
                    'value': number,
                    'index': where   //Don't need this pair. Thought I did for a complicated store and push.
                })
                speed=false //after pushing the token, falsify the condition to stop the loop.
            }  
        }
            
    }

I was on the verge of giving up a few times because of how complex i thought it was. But I didn't because I believed that there is a way to solve this. I just haven't tried it yet

Alternative: move the token push block from else statement to below the while loop.

1227am 7/2/2022 Saturday

Small oof. The speed variable should be in the if statements.

        
        if(el.match(/[0-9]/)){    
            var number = ''
            var speed = true
            //while loop here
        }
        if(el.match(/[A-Za-z_]/)){
            var word = ''
            var speed = true
            //while loop here
        }


1230am.

The code broke again. Why!?

Checking it now. https://rollbar.com/blog/javascript-typeerror-cannot-read-property-of-undefined/

myVar !== undefined //Solution? In the if statements of the while loop?

Already expecting this to fail. But i'll try anyway. Knew it

https://help.heroku.com/7XGGEGZH/cannot-read-property-match-of-undefined

Updating my node and npm

//v16.13.2 Node -> v18.4.0 //v8.1.2 npm -> v8.12.1

606am

I think the typeerror is being thrown because the code within the while loop itself is flawed. Because that line in the while loop is pointing to the next element that doesn't exist. Because that index (expression.length) isn't within the array bounds.

        if(el.match(/[0-9]/)){
            number += el
            el = expression[++where] //actual problematic code line here

Hypothesis is true. I was right. I thought, my code seemed error proof. There must be something I missed

Apparently putting a whitespace character after the string alleviates the type error.

Otherwise, add a checker to stop the while loop once the where variable reaches the expression length

622am

Fixed it. For real this time.

        if(el.match(/[0-9]/)){

            var number = ''
            var speed = true
            while(speed){
                if(el.match(/[0-9]/)){
                    number += el

                    //added this block. When you reach the end of the string, push that token.
                    //otherwise, keep scanning
                    if((where + 1) == expression.length){ 
                        tokens.push({
                            'type': 'number',
                            'value': number
                        })
                        speed=false
                    }else{
                        el = expression[++where]
                    }
                    
                }
                       
            }   

        }



645am Ran into a problem.

The tokenizer now cant process both multi numbers and words on the same expression.

728am

I can't find a work around with the extra whitespace at the end of the expression string. So i'll leave it.

tokenize(${input}\s)

That explains the continue keywords at the end of the python if statements of the while loop. Moving on.

Now to add more token builders for characters and move on to the parser

212pm

Adding more token makers for logical and comparison operators

1203am 7/3/2022

I have solved the logical operator problem. Before, I had issues of tokens for and/or operators generated by a single &/| character


        if(el.match(/[&|!]/)){
            var op = ''

            while(el.match(/[&|!]/)){ 
                //||
                op += el
                //next index
                el = expression[++where] 

                //|| operator
                if(op.match(/[|]{2}/)){
                    tokens.push({
                        'type': 'logic_op',
                        'value': op
                    })
                    op = '' //clear op holder after pushing
                }
                
                    
            }
            
        }

TOKENIZER ORGANIZATION - 1253am 7/3/2022

tokenizer(input) token array

main for loop
    local main var named element for iterating the input string
    if statement: element matching for specific characters
        local var for multi char value
        while loop: element matching for characters
            add each character to local var
            get next index of the input string and assign to element

            if statement: local var matching for regex 
                push token: type and its value
                clear local var string
            similar code repeats

        endwhile
    endif

    similar if statement code repeats or simple tokenpush
endfor

return tokens

226pm

        //Further analysis
        if(el.match(/[A-Za-z_]/)){
            var word = ''
            
            while(el.match(/[A-Za-z_]/)){
                word += el
                //ACHILLES HEEL. 
                //Will take you to the next index out of the array bounds once 
                //the while loop reads the final character of the expression.
                el = expression[++where]  

            }    
            tokens.push({
                'type': 'word',
                'value': word
            })
            
        }

606am 7/2/2022 Sat: Apparently putting a whitespace character after the string alleviates the type error.

There's another way. The extra whitespace character after the complete string expression annoys me


        if(el.match(/[A-Za-z_]/)){
            var word = ''
            
            //Final upgrade to while loop.
            while(el.match(/[A-Za-z_]/)){
                word += el
  
                //ADD THIS BLOCK
                //if you reach the final character, break this loop
                if(where == (expression.length - 1)){
                    break
                //if not, continue on to the next index  
                } else {
                    el = expression[++where]  
                }


            }    

            tokens.push({
                'type': 'word',
                'value': word
            })
            
        }

Possible drawback: Make sure each while loop within the if statements has that block. Code redundancy. But its better than having a crucial whitespace character after the last character of the expression string that'll just be a catastrophic source of failure.

428pm

The code block I added in that while loop works as expected. No final whitespace string after the expression needed

            if(where == (expression.length - 1)){
                break     
            } else {
                el = expression[++where]  
            }

Also, the operation handler is having a problem: SyntaxError: Invalid regular expression: /[&|!+-*/=<>%]/: Range out of order in character class if (el.match(/[&|!+-*/=<>%]/))

Made a separate file placeholder.js to test the op handler in a mini tokenize function named lexi (short for lexer).

Solution found: To solve the error, you can either add the hyphen as the first or last character in the character class or escape it. https://bobbyhadz.com/blog/javascript-invalid-regular-expression-range-out-of-order#:~:text=The%20%22Invalid%20regular%20expression%3A%20Range,the%20regex%20or%20escape%20it.

/[-a-zA-Z0-9]/g     //Good
/[a-zA-Z0-9-]/g     //Also good
/[a--zA-Z0-9 ]/g    //Bad


Fixed: /[-&|!+*\/=<>%]/

Note: An unescaped delimiter (/) must be escaped with a backslash (\)

\/ then.

442pm

Small oof. Lexi function wasn't reading the first character because

            while (el.match(/[-&|!+*\/=<>%]/)) {
				op += el;

                //This block got in the way. Put it below if statements. 
				if (where == expression.length - 1) {
					break;
				} else {
					el = expression[++where];
				}

				if (op.match(/[|]{2}/)) {
					tokens.push({
						type: 'logic_or',
						value: op
					});
					op = '';
				}   
            }

453pm

Another small oof.

Expectation In: >= Out: [ {type: 'greater_than_equal_to', value: '>'} ]

Actually In: >= Out: [ { type: 'greater_than', value: '>' }, { type: 'assign', value: '=' } ]

1053pm


        if (el.match(/[-&|!+*\/=<>%]/)) {

			var op = '';

			while (el.match(/[-&|!+*\/=<>%]/)) {

				op += el;

                //move those if statements to below this while loop
				if (op.match(/[|]{2}/)) {
					tokens.push({
						type: 'logic_or',
						value: op
					});
					op = '';
				}
				if (op.match(/[&]{2}/)) {
					tokens.push({
						type: 'logic_and',
						value: op
					});
					op = '';
				}
				if (op.match(/!/)) {
					tokens.push({
						type: 'logic_not',
						value: op
					});
					op = '';
				}
				if (op.match(/[=]/)) {
					tokens.push({
						type: 'assign',
						value: op
					});
					op = '';
				}
                //

                //leave that block alone
                if (where == expression.length - 1) {
					break;
				} else {
					el = expression[++where];
				}
                //
			}

		}
//Would it work? Prediction - Yes.

Previous thought: Make a local string variable and test


            var equal = /[==]/  //should be a string.

            if (op == equal) {
                tokens.push({
                    type: 'equals',
                    value: op
                });
                op = '';
            }

Expectation:

==

[ { type: 'assign', value: '==' } ]

Reality:

[ { type: 'equals', value: '=' } , { type: 'equals', value: '=' } ,

]

Because if statements were inside the while loop that generates a multi character token by storing each character at a time.

1103pm

Prediction true! The op handler works as it should now

1109pm

The op handler is messing up.

[ { type: 'logic_not', value: '!==' } ]

Should be

[ { type: 'logic_not', value: '!' } ]

I'll have to test the reg exps with https://regex101.com/

1114pm

Apparently the match(regexp) method isn't great with finding a precise match for a sequence of string characters. I put <= as the input. This is what i got:

[ { type: 'assign', value: '<=' } ]

I expected:

[ { type: 'less_than_equal_to'', value: '<=' } ]

Replace all if statement conditions.


            //FROM
            if (op.match(/[>=]/)) {
                tokens.push({
                    type: 'greater_than_equal_to',
                    value: op
                });
                op = '';
            }

            //TO
            if (op == "operator symbol") {
                tokens.push({
                    type: 'name of operator',
                    value: op
                });
                op = '';
            }

1122pm

The change worked!

The regular expression syntax finds a match if it finds any valid character. There is no sequence in regexp. Interesting.

128am 7/4/2022

Doing some finishing touches on my tokenizer. Adding extra code to my letter handler. Handle reserved words: function, pass, class, struct, var, whatever else


        var identifier_chars = /[A-Za-z_]/
        if(el.match(identifier_chars)){
            var identifier = ''
            
            while(el.match(identifier_chars)){
                identifier += el
                if(where == (expression.length - 1)){
                    break
                }else{
                    el = expression[++where] 
                }
                 
            }   

            /*PLAN
            Reserved words array 
            Add a for loop for iterating through the reserved words array. For of loop maybe
                if identifier == reserved[index], 
                    push reserved token, 
                    clear identifier variable
                otherwise
                    push identifier token
                    clear identifier variable
            */

            //Move this block to inside the for loop
            tokens.push({
                'type': 'identifier',
                'value': identifier
            })
            
        }  

Test the if statement block on placeholder.js.

243am

Somewhat worked.

[
  { type: 'identifier', value: 'high' },
  { type: 'identifier', value: '' }, //shouldnt be pushed again after the first one
  { type: 'identifier', value: '' },
  { type: 'identifier', value: '' },
  { type: 'identifier', value: '' },
    //all the way to index 45
]

Adding break keywords on the last lines of the statements

            for (let iter=0;iter<reserved.length ;iter++) {

                if(identifier == reserved[iter]){
                    tokens.push({
                        'type': 'reserved_keyword',
                        'value': identifier
                    }) 
                    
                    identifier = ''
                    break

                }else{
                    tokens.push({
                        'type': 'identifier',
                        'value': identifier
                    })
                    
                    identifier = ''
                    break
                }

            }

249am

Successful!

257am

Sort of succesful.

[
    ...
  { type: 'identifier', value: 'while' },  //reserved
  { type: 'identifier', value: 'sitting' },
  { type: 'identifier', value: 'on' },
  { type: 'identifier', value: 'a' },
  { type: 'identifier', value: 'chair' },
  { type: 'identifier', value: 'and' },      //reserved
  { type: 'identifier', value: 'eating' },
  { type: 'identifier', value: 'a' },
  { type: 'identifier', value: 'sandwich' },
  ...
]

The break keywords might be causing a problem.

1051am

Separate for loops ( one that handles generating reserved tokens if identifier variable matches with a reserved array element, and the other that handles generating regular tokens if there isn't a match) also didn't work.

1057am

Almost works. Not really

            var flip = false
			for (let iter = 0; iter < reserved.length; iter++) {
				if (identifier == reserved[iter]) {
					tokens.push({
						type: 'reserved_keyword',
						value: identifier
					});

					identifier = '';
                    flip = true
				}
                if(flip){
                    tokens.push({
						type: 'identifier',
						value: identifier
					});

					identifier = '';
                    flip = false
                }
			}


1230pm

Use the continue keyword to replace break

Nope. Not better.

1240pm

        for (let iter = 0; iter < reserved.length; iter++) {
                console.log(`Running ${iter}`);

				if (identifier == reserved[iter]) {
                    
					tokens.push({
						type: 'reserved_keyword',
						value: identifier
					});
								
				} 
			
		}

           
        tokens.push({
            type: 'identifier',
            value: identifier
        });
            
[
  { type: 'identifier', value: 'it' },
  { type: 'identifier', value: 'me' },
  { type: 'reserved_keyword', value: 'throw' },
  { type: 'identifier', value: 'throw' },
  { type: 'identifier', value: 'down' }
]

Unexpected - after a reserved keyword is found, not only does the reserved token get generated, but so too does a non reserved token. Solution?

1247pm

Solution found.

            //IN ID HANDLER
            var reserved = [ String array of reserved words here ]
            
            //init to false. Resets to false when letter handler runs
            var gate = false    
			for (let iter = 0; iter < reserved.length; iter++) {
                //console.log(`Running ${iter}`);  Debugging

                //if a match is found, push the reserved token
				if (identifier == reserved[iter]) {
                    
					tokens.push({
						type: 'reserved_keyword',
						value: identifier
					});

                    //if reserved keyword token is generated, keep the gate false so it doesn't generate the non reserved token. Keep this here b/c if statement condition is always checked during each iteration
                    gate = false  
                    //when reserved keyword is found, break this loop. Or else the for loop will keep running and set the gate variable to true when the if statement doesn't find any matches causing the regular id token to be made. We don't want that.
                    break         
							
				} else {
                    gate = true
                }
	
			}

            //if statement is needed. 
            if(gate){ //You shall not pass!
                tokens.push({
                    type: 'identifier',
                    value: identifier
                });
                
            }


7/5/2022 454am

Adding more token generators for =>, ->, &, ', ", and \

555am

Added a token generator for **

611am

Added a token generator for `

736am

Adding a token generator for //

12pm

Removing the // token generator. // means a comment in other programming languages. I also had trouble figuring out what to do with the . token operator. Do I have my lexer treat .
as a decimal for the number handler, or as a class member access operator? I'll have the parser decide that.

Also also, I noticed that the op handler has a lot of if statements (op token generators) comparing the input sequence with the symbol sequence the generators are looking for. I wonder if I can just put all of the symbols into an array and iterate through them with a for loop. And inside that loop, there is just one if statement that generates only one token with type 'operator' holding the value of said op.

142pm

Basically:

if element matches one of the operator characters:

    set operator variable to empty string

    while element matches one of the operator characters:

        store element to operator variable

        if the entire expression string is read:
            stop this loop
        otherwise:
            go to the next index of the expression
    
    //From this:
    ":" token generator      
    "::" token generator 
    "." token generator   
    "||" token generator
    "&&" token generator
    "!" token generator 
    "=" token generator
    "+" token generator
    "-" token generator
    "*" token generator
    "/" token generator
    "%" token generator
    "<" token generator
    ">" token generator
    "?" token generator
    "==" token generator
    "!=" token generator
    "<=" token generator
    ">=" token generator
    "++" token generator
    "--" token generator
    "+=" token generator
    "-=" token generator
    "*=" token generator
    "/=" token generator
    "%=" token generator
    "!==" token generator
    "===" token generator
    "&" token generator
    "**" token generator
    "=>" token generator   
    "->" token generator

    unknown op. don't generate that token

    32 token generators total! A lot of lines.

To this:


    operator array

    op variable

    //below while loop
    
    for loop iterating the operator array:
        if op matches the iteration of the operator string:
            generate token; type operator, value of op content
            break 

        if you've reached the end of the array and found no matches
            say "unknown op. Can't make that token" 

    //the idea could work.

514pm

Just to be safe, I'm adding the same tokenizer function with the name scanner on temporary test file placeholder.js that will handle the operators

610pm

Will the break keyword be needed? I'm omitting it.

5 minutes later

Break keyword is not needed. Totally optional.

    //INSIDE OPERATOR HANDLER:

    //This condition in for loop is made to satisfy two conditions because we don't want that prompt to execute if the loop finishes iterating. We also dont 
    //want that prompt to execute every time the loop doesn't find a match for the operator sequence you defined
    
    if(((index + 1) == operation.length) && op != operation[index]){
		console.log(`This op ${op} is unknown`)
    }

639pm

The for loop works as expected. Now moving the updated code to main file's tokenizer

227 lines of 33 conditonal statements now down to 11 lines! Code optimized.

647pm

Destroyed the scanner function. It already served its purpose

Now optimizing the for loop inside the id handler:

            //Previous jumbled up code//

            for every iteration of the reserved array:
				if identifier matches with the reserved value at that index:
					generate reserved id token; type: 'reserved_keyword', value: identifier
					keep gate variable false
					break out of this loop
				otherwise:
					make gate variable true
				}
			}

			if gate variable is true:
				generate regular id token

Plan:

            for every iteration of the reserved array:
				if identifier matches with the reserved value at that index:
					generate reserved id token; type: 'reserved_keyword', value: identifier
					
				if you reach the end of the array and can't find a match:
					generate regular id token
				
			
			//remove local gate variable and if statement condition that needs it

701pm

Small hiccup.

Input: while down case

[
  { type: 'reserved_keyword', value: 'while' },
  { type: 'identifier', value: 'while' },
  { type: 'identifier', value: 'down' },
  { type: 'reserved_keyword', value: 'case' },
  { type: 'identifier', value: 'case' }
]

740pm

Need to fix the if statement condition. I need to make sure that the regular id token gets generated if the loop doesn't find a match

6 minutes later

Strike that. The if statement inside the operator handler for loop is messing up too. When the operator is valid, it shouldn't display the unknown op message.

828pm

Solution found! Add a break keyword at the end of the if statement

            for every iteration of the reserved array:
				if identifier matches with the reserved value at that index:
					generate reserved id token; type: 'reserved_keyword', value: identifier
                    then break out of this loop
					
				if you reach the end of the array and can't find a match:
					generate regular id token

Leave the if condition alone.

857pm

Optimized the operator handler while loop:

    while element matches one of the operator characters:
        store element to operator variable

        if the entire expression string is read:
            break this loop
        
        go to the next index of the expression

Also what will happen if break was omitted from the while loop? Well, the loop will keep executing forever since its condition is still focused on the current element which is validated by the match function. That is if the while loop code is built like:

if the entire expression is scanned, break this loop else, go to the next index of the expression

But since the code is optimized: The TypeError will be thrown as nothing will stop this line el = expression[++where] from going out of the array bounds. Better that than an a while loop w/o a falsifying condition.

915pm

The id handler is working as it should. Yep, works well now

10 minutes later

Now to change the last 10 token generating if statements into switch cases.

1021pm

Removing the invalid handler. I've decided to have the tokenizer ignore whitespaces and invalid characters

1103pm.

I think this is it! I'm finished with my tokenizer. Now to move on to the parser.

Total amount of time spent: About 8 days

1015pm 7/13/2022

Building a separate scanner and tokenizer to see what I've learned from Vaidehi Joshi's blog.

616pm 7/16/2022

Separate scanner and tokenizer completed since 4pm. Syntactic Analyzer operation now in progress.

Clone this wiki locally