-
Notifications
You must be signed in to change notification settings - Fork 0
Lexer V1
Rewriting the tokenizer syntax from python to javascript.
Still working on the tokenizer/lexer. Having a problem extracting multi digits/multi characters from input expression and storing it into the lexer number object
#input expr 456
if re.match(r"[0-9]", char):
value = ''
#nested iteration if a number is multi-num
while re.match(r"[0-9]", char):
value += char
current = current+1
char = input_expression[current];
tokens.append({
'type': 'number',
'value': value #number token value 456
})
continue
//create + add token to array for numbers
if (char.match(numbers)){
var number = ''
/* Problematic code at while loop condition
while (char.match(numbers)){
number += char
current += 1
char = input_expression[current];
}
*/
tokens.push({
'type': 'number',
'value': number
})
continue
}
It keeps giving me this error: TypeError: Cannot read properties of undefined (reading 'match')
So far, I've had no problem adding new token generators for characters: +, -, *, /, {, }, =
Add name property to token object to make things clearer. Do it after building a compiler.
{
'type': 'number',
'name': 'whatever',
'value': value
}
Added token generators for characters: [, ], :, ;
Updated the error condition. The tokenizer will ignore the unknown character.
Added token generators for characters: !, ?
The number token generator is still difficult to make it work properly
Maybe I can find a workaround. I can simplify the number and letter token generators and add extra code to traverse the array to find the objects that have the 'word'/'number' type, make a new token that has the merged values of the objects together, and make a new array to return
//check if token object type is word. Get the index
for (let index = 0; index < tokens.length; index++) {
if(tokens[index].type == 'word'){
console.log(`Letter? Yes. Index is ${index}`)
}else{
console.log("No")
}
}
//check if token object type is number. Get the index
for (let index = 0; index < tokens.length; index++) {
if(tokens[index].type == 'number'){
console.log(`Digit? Yes. Index is ${index}`)
}else{
console.log("No")
}
}
Get rid of the else statements and merge the for loops together.
Getting messy with the code here
Still fixing my code up for the number and letter token generators. There must be a way to store multi number/alphabet tokens.
//PSEUDOCODE - probably should've done this first.
Function tokenizer (input_expression)
init currentindex to 0
make an array for holding token objects
Init runner to false
while currentindex < input_expression:
move each input_expression element to char variable //char may not be needed.
if char matches whitespace regex
current++
continue
endif
//all character tokens
if char equals '('
push token object into tokens array
//object {type:'what', value:'token value'}
current++
continue
endif
//numbers [Focus here]
if char matches digits regex
num variable holds number characters
set runner to true
while runner
push token object into tokens array
runner false
endwhile
current++
continue
endif
//
else
display "unknown char"
current++
continue
endelse
endwhile
End Function tokenizer
Almost works.
I think I found the solution! Peter Leonov - Writing a JS lexer
Didnt work. Try again
I FOUND THE SOLUTION. And its not what I expected.
The solution was right in front of me the whole time.
In python:
if re.match(r"[0-9]", char):
value = ''
#nested iteration if a number is multi-num
while re.match(r"[0-9]", char):
value += char
current = current+1
char = input_expression[current];
tokens.append({
'type': 'number',
'value': value
})
continue
javascript:
if (char.match(/[0-9]/g)){
var num = ''
while(char.match(/[0-9]/g)){
num += char
current++
char = input_expression[current] //THIS LINE IS IMPORTANT. SHOULD NOT HAVE TAKEN IT OUT.
}
tokens.push({
'type': 'number',
'value': num
})
continue
}
Didn't know why my terminal was having problems with my while loop before
It took me 3 days to figure this out. Didn't give up
And now my terminal is giving me the same error. Why!?
I've gotten impatient. Breezing through this project at this point.
I got frustrated because I had a difficult time pushing multiword/digit tokens without causing an error.
Rebuilding the tokenizer from scratch.
Discovering Moo - nearley.js
I FINALLY FUCKING SOLVED THE MULTI NUM/WORD PROBLEM!!!!!!!!!!!!!!
I took a risk and it worked!
//if digit found, get into this statement. While loop will run immediately
if(el.match(/[0-9]/)){
var number = '' //digit character storage
/*My solution - since the regex match function behaves well with if statements
and since while loops are great for allowing the storage of each element until
reaching an index in the expression where there isn't a digit, it therefore
makes sense to make a while loop with the simplest condition. A boolean condition
*/
//init speed to true outside for loop.
while(speed){
//secondary digit checker. As long as there's a digit, keep storing them
if(el.match(/[0-9]/)){ //find the first digit again. This time, store it
number += el
//get to the next index of the expression and test its corresponding element
//while at the same time, update the for loop
el = expression[++where]
}else{ //No digit? Push this token with multi num value. Then break the while loop
tokens.push({
'type': 'number',
'value': number,
'index': where //Don't need this pair. Thought I did for a complicated store and push.
})
speed=false //after pushing the token, falsify the condition to stop the loop.
}
}
}
I was on the verge of giving up a few times because of how complex i thought it was. But I didn't because I believed that there is a way to solve this. I just haven't tried it yet
Alternative: move the token push block from else statement to below the while loop.
Small oof. The speed variable should be in the if statements.
if(el.match(/[0-9]/)){
var number = ''
var speed = true
//while loop here
}
if(el.match(/[A-Za-z_]/)){
var word = ''
var speed = true
//while loop here
}
The code broke again. Why!?
Checking it now. https://rollbar.com/blog/javascript-typeerror-cannot-read-property-of-undefined/
myVar !== undefined
//Solution? In the if statements of the while loop?
Already expecting this to fail. But i'll try anyway. Knew it
https://help.heroku.com/7XGGEGZH/cannot-read-property-match-of-undefined
Updating my node and npm
//v16.13.2 Node -> v18.4.0 //v8.1.2 npm -> v8.12.1
I think the typeerror is being thrown because the code within the while loop itself is flawed. Because that line in the while loop is pointing to the next element that doesn't exist. Because that index (expression.length) isn't within the array bounds.
if(el.match(/[0-9]/)){
number += el
el = expression[++where] //actual problematic code line here
Hypothesis is true. I was right. I thought, my code seemed error proof. There must be something I missed
Apparently putting a whitespace character after the string alleviates the type error.
Otherwise, add a checker to stop the while loop once the where variable reaches the expression length
Fixed it. For real this time.
if(el.match(/[0-9]/)){
var number = ''
var speed = true
while(speed){
if(el.match(/[0-9]/)){
number += el
//added this block. When you reach the end of the string, push that token.
//otherwise, keep scanning
if((where + 1) == expression.length){
tokens.push({
'type': 'number',
'value': number
})
speed=false
}else{
el = expression[++where]
}
}
}
}
The tokenizer now cant process both multi numbers and words on the same expression.
I can't find a work around with the extra whitespace at the end of the expression string. So i'll leave it.
tokenize(
${input}\s)
That explains the continue keywords at the end of the python if statements of the while loop. Moving on.
Now to add more token builders for characters and move on to the parser
Adding more token makers for logical and comparison operators
I have solved the logical operator problem. Before, I had issues of tokens for and/or operators generated by a single &/| character
if(el.match(/[&|!]/)){
var op = ''
while(el.match(/[&|!]/)){
//||
op += el
//next index
el = expression[++where]
//|| operator
if(op.match(/[|]{2}/)){
tokens.push({
'type': 'logic_op',
'value': op
})
op = '' //clear op holder after pushing
}
}
}
TOKENIZER ORGANIZATION - 1253am 7/3/2022
tokenizer(input) token array
main for loop
local main var named element for iterating the input string
if statement: element matching for specific characters
local var for multi char value
while loop: element matching for characters
add each character to local var
get next index of the input string and assign to element
if statement: local var matching for regex
push token: type and its value
clear local var string
similar code repeats
endwhile
endif
similar if statement code repeats or simple tokenpush
endfor
return tokens
//Further analysis
if(el.match(/[A-Za-z_]/)){
var word = ''
while(el.match(/[A-Za-z_]/)){
word += el
//ACHILLES HEEL.
//Will take you to the next index out of the array bounds once
//the while loop reads the final character of the expression.
el = expression[++where]
}
tokens.push({
'type': 'word',
'value': word
})
}
606am 7/2/2022 Sat: Apparently putting a whitespace character after the string alleviates the type error.
There's another way. The extra whitespace character after the complete string expression annoys me
if(el.match(/[A-Za-z_]/)){
var word = ''
//Final upgrade to while loop.
while(el.match(/[A-Za-z_]/)){
word += el
//ADD THIS BLOCK
//if you reach the final character, break this loop
if(where == (expression.length - 1)){
break
//if not, continue on to the next index
} else {
el = expression[++where]
}
}
tokens.push({
'type': 'word',
'value': word
})
}
Possible drawback: Make sure each while loop within the if statements has that block. Code redundancy. But its better than having a crucial whitespace character after the last character of the expression string that'll just be a catastrophic source of failure.
The code block I added in that while loop works as expected. No final whitespace string after the expression needed
if(where == (expression.length - 1)){
break
} else {
el = expression[++where]
}
Also, the operation handler is having a problem:
SyntaxError: Invalid regular expression: /[&|!+-*/=<>%]/: Range out of order in character class
if (el.match(/[&|!+-*/=<>%]/))
Made a separate file placeholder.js
to test the op handler in a mini tokenize function named lexi
(short for lexer).
Solution found: To solve the error, you can either add the hyphen as the first or last character in the character class or escape it. https://bobbyhadz.com/blog/javascript-invalid-regular-expression-range-out-of-order#:~:text=The%20%22Invalid%20regular%20expression%3A%20Range,the%20regex%20or%20escape%20it.
/[-a-zA-Z0-9]/g //Good
/[a-zA-Z0-9-]/g //Also good
/[a--zA-Z0-9 ]/g //Bad
Fixed: /[-&|!+*\/=<>%]/
Note: An unescaped delimiter (/)
must be escaped with a backslash (\)
\/
then.
Small oof. Lexi function wasn't reading the first character because
while (el.match(/[-&|!+*\/=<>%]/)) {
op += el;
//This block got in the way. Put it below if statements.
if (where == expression.length - 1) {
break;
} else {
el = expression[++where];
}
if (op.match(/[|]{2}/)) {
tokens.push({
type: 'logic_or',
value: op
});
op = '';
}
}
Another small oof.
Expectation In: >= Out: [ {type: 'greater_than_equal_to', value: '>'} ]
Actually In: >= Out: [ { type: 'greater_than', value: '>' }, { type: 'assign', value: '=' } ]
if (el.match(/[-&|!+*\/=<>%]/)) {
var op = '';
while (el.match(/[-&|!+*\/=<>%]/)) {
op += el;
//move those if statements to below this while loop
if (op.match(/[|]{2}/)) {
tokens.push({
type: 'logic_or',
value: op
});
op = '';
}
if (op.match(/[&]{2}/)) {
tokens.push({
type: 'logic_and',
value: op
});
op = '';
}
if (op.match(/!/)) {
tokens.push({
type: 'logic_not',
value: op
});
op = '';
}
if (op.match(/[=]/)) {
tokens.push({
type: 'assign',
value: op
});
op = '';
}
//
//leave that block alone
if (where == expression.length - 1) {
break;
} else {
el = expression[++where];
}
//
}
}
//Would it work? Prediction - Yes.
Previous thought: Make a local string variable and test
var equal = /[==]/ //should be a string.
if (op == equal) {
tokens.push({
type: 'equals',
value: op
});
op = '';
}
Expectation:
==
[ { type: 'assign', value: '==' } ]
Reality:
[ { type: 'equals', value: '=' } , { type: 'equals', value: '=' } ,
]
Because if statements were inside the while loop that generates a multi character token by storing each character at a time.
Prediction true! The op handler works as it should now
The op handler is messing up.
[ { type: 'logic_not', value: '!==' } ]
Should be
[ { type: 'logic_not', value: '!' } ]
I'll have to test the reg exps with https://regex101.com/
Apparently the match(regexp)
method isn't great with finding a precise match for a sequence of string characters. I put <=
as the input.
This is what i got:
[ { type: 'assign', value: '<=' } ]
I expected:
[ { type: 'less_than_equal_to'', value: '<=' } ]
Replace all if statement conditions.
//FROM
if (op.match(/[>=]/)) {
tokens.push({
type: 'greater_than_equal_to',
value: op
});
op = '';
}
//TO
if (op == "operator symbol") {
tokens.push({
type: 'name of operator',
value: op
});
op = '';
}
The change worked!
The regular expression syntax finds a match if it finds any valid character. There is no sequence in regexp. Interesting.
Doing some finishing touches on my tokenizer. Adding extra code to my letter handler. Handle reserved words: function, pass, class, struct, var, whatever else
var identifier_chars = /[A-Za-z_]/
if(el.match(identifier_chars)){
var identifier = ''
while(el.match(identifier_chars)){
identifier += el
if(where == (expression.length - 1)){
break
}else{
el = expression[++where]
}
}
/*PLAN
Reserved words array
Add a for loop for iterating through the reserved words array. For of loop maybe
if identifier == reserved[index],
push reserved token,
clear identifier variable
otherwise
push identifier token
clear identifier variable
*/
//Move this block to inside the for loop
tokens.push({
'type': 'identifier',
'value': identifier
})
}
Test the if statement block on placeholder.js
.
Somewhat worked.
[
{ type: 'identifier', value: 'high' },
{ type: 'identifier', value: '' }, //shouldnt be pushed again after the first one
{ type: 'identifier', value: '' },
{ type: 'identifier', value: '' },
{ type: 'identifier', value: '' },
//all the way to index 45
]
Adding break keywords on the last lines of the statements
for (let iter=0;iter<reserved.length ;iter++) {
if(identifier == reserved[iter]){
tokens.push({
'type': 'reserved_keyword',
'value': identifier
})
identifier = ''
break
}else{
tokens.push({
'type': 'identifier',
'value': identifier
})
identifier = ''
break
}
}
Successful!
Sort of succesful.
[
...
{ type: 'identifier', value: 'while' }, //reserved
{ type: 'identifier', value: 'sitting' },
{ type: 'identifier', value: 'on' },
{ type: 'identifier', value: 'a' },
{ type: 'identifier', value: 'chair' },
{ type: 'identifier', value: 'and' }, //reserved
{ type: 'identifier', value: 'eating' },
{ type: 'identifier', value: 'a' },
{ type: 'identifier', value: 'sandwich' },
...
]
The break keywords might be causing a problem.
Separate for loops ( one that handles generating reserved tokens if identifier variable matches with a reserved array element, and the other that handles generating regular tokens if there isn't a match) also didn't work.
Almost works. Not really
var flip = false
for (let iter = 0; iter < reserved.length; iter++) {
if (identifier == reserved[iter]) {
tokens.push({
type: 'reserved_keyword',
value: identifier
});
identifier = '';
flip = true
}
if(flip){
tokens.push({
type: 'identifier',
value: identifier
});
identifier = '';
flip = false
}
}
Use the continue
keyword to replace break
Nope. Not better.
for (let iter = 0; iter < reserved.length; iter++) {
console.log(`Running ${iter}`);
if (identifier == reserved[iter]) {
tokens.push({
type: 'reserved_keyword',
value: identifier
});
}
}
tokens.push({
type: 'identifier',
value: identifier
});
[
{ type: 'identifier', value: 'it' },
{ type: 'identifier', value: 'me' },
{ type: 'reserved_keyword', value: 'throw' },
{ type: 'identifier', value: 'throw' },
{ type: 'identifier', value: 'down' }
]
Unexpected - after a reserved keyword is found, not only does the reserved token get generated, but so too does a non reserved token. Solution?
Solution found.
//IN ID HANDLER
var reserved = [ String array of reserved words here ]
//init to false. Resets to false when letter handler runs
var gate = false
for (let iter = 0; iter < reserved.length; iter++) {
//console.log(`Running ${iter}`); Debugging
//if a match is found, push the reserved token
if (identifier == reserved[iter]) {
tokens.push({
type: 'reserved_keyword',
value: identifier
});
//if reserved keyword token is generated, keep the gate false so it doesn't generate the non reserved token. Keep this here b/c if statement condition is always checked during each iteration
gate = false
//when reserved keyword is found, break this loop. Or else the for loop will keep running and set the gate variable to true when the if statement doesn't find any matches causing the regular id token to be made. We don't want that.
break
} else {
gate = true
}
}
//if statement is needed.
if(gate){ //You shall not pass!
tokens.push({
type: 'identifier',
value: identifier
});
}
Adding more token generators for =>, ->, &, ', ", and \
Added a token generator for **
Added a token generator for `
Adding a token generator for //
Removing the // token generator. // means a comment in other programming languages. I also had trouble figuring out what to do with the . token operator. Do I have my lexer treat .
as a decimal for the number handler, or as a class member access operator? I'll have the parser decide that.
Also also, I noticed that the op handler has a lot of if statements (op token generators) comparing the input sequence with the symbol sequence the generators are looking for. I wonder if I can just put all of the symbols into an array and iterate through them with a for loop. And inside that loop, there is just one if statement that generates only one token with type 'operator' holding the value of said op.
Basically:
if element matches one of the operator characters:
set operator variable to empty string
while element matches one of the operator characters:
store element to operator variable
if the entire expression string is read:
stop this loop
otherwise:
go to the next index of the expression
//From this:
":" token generator
"::" token generator
"." token generator
"||" token generator
"&&" token generator
"!" token generator
"=" token generator
"+" token generator
"-" token generator
"*" token generator
"/" token generator
"%" token generator
"<" token generator
">" token generator
"?" token generator
"==" token generator
"!=" token generator
"<=" token generator
">=" token generator
"++" token generator
"--" token generator
"+=" token generator
"-=" token generator
"*=" token generator
"/=" token generator
"%=" token generator
"!==" token generator
"===" token generator
"&" token generator
"**" token generator
"=>" token generator
"->" token generator
unknown op. don't generate that token
32 token generators total! A lot of lines.
To this:
operator array
op variable
//below while loop
for loop iterating the operator array:
if op matches the iteration of the operator string:
generate token; type operator, value of op content
break
if you've reached the end of the array and found no matches
say "unknown op. Can't make that token"
//the idea could work.
Just to be safe, I'm adding the same tokenizer function with the name scanner
on temporary test file placeholder.js
that will handle the operators
Will the break keyword be needed? I'm omitting it.
Break keyword is not needed. Totally optional.
//INSIDE OPERATOR HANDLER:
//This condition in for loop is made to satisfy two conditions because we don't want that prompt to execute if the loop finishes iterating. We also dont
//want that prompt to execute every time the loop doesn't find a match for the operator sequence you defined
if(((index + 1) == operation.length) && op != operation[index]){
console.log(`This op ${op} is unknown`)
}
The for loop works as expected. Now moving the updated code to main file's tokenizer
227 lines of 33 conditonal statements now down to 11 lines! Code optimized.
Destroyed the scanner
function. It already served its purpose
Now optimizing the for loop inside the id handler:
//Previous jumbled up code//
for every iteration of the reserved array:
if identifier matches with the reserved value at that index:
generate reserved id token; type: 'reserved_keyword', value: identifier
keep gate variable false
break out of this loop
otherwise:
make gate variable true
}
}
if gate variable is true:
generate regular id token
Plan:
for every iteration of the reserved array:
if identifier matches with the reserved value at that index:
generate reserved id token; type: 'reserved_keyword', value: identifier
if you reach the end of the array and can't find a match:
generate regular id token
//remove local gate variable and if statement condition that needs it
Small hiccup.
Input: while down case
[
{ type: 'reserved_keyword', value: 'while' },
{ type: 'identifier', value: 'while' },
{ type: 'identifier', value: 'down' },
{ type: 'reserved_keyword', value: 'case' },
{ type: 'identifier', value: 'case' }
]
Need to fix the if statement condition. I need to make sure that the regular id token gets generated if
the loop doesn't find a match
Strike that. The if statement inside the operator handler for loop is messing up too. When the operator is valid, it shouldn't display the unknown op message.
Solution found! Add a break keyword at the end of the if statement
for every iteration of the reserved array:
if identifier matches with the reserved value at that index:
generate reserved id token; type: 'reserved_keyword', value: identifier
then break out of this loop
if you reach the end of the array and can't find a match:
generate regular id token
Leave the if condition alone.
Optimized the operator handler while loop:
while element matches one of the operator characters:
store element to operator variable
if the entire expression string is read:
break this loop
go to the next index of the expression
Also what will happen if break was omitted from the while loop? Well, the loop will keep executing forever since its condition is still focused on the current element which is validated by the match function. That is if the while loop code is built like:
if the entire expression is scanned, break this loop else, go to the next index of the expression
But since the code is optimized: The TypeError will be thrown as nothing will stop this line el = expression[++where]
from going out of the array bounds.
Better that than an a while loop w/o a falsifying condition.
The id handler is working as it should. Yep, works well now
Now to change the last 10 token generating if statements into switch cases.
Removing the invalid handler. I've decided to have the tokenizer ignore whitespaces and invalid characters
I think this is it! I'm finished with my tokenizer. Now to move on to the parser.
Total amount of time spent: About 8 days