|
| 1 | +# Bold Terms |
| 2 | + |
| 3 | +Write a function that takes in a document in the form of a string, and a list of |
| 4 | +search terms. The algorithm should modify the document such that character sequences |
| 5 | +matching any of the given terms are wrapped in HTML `<b></b>` tags similar to how a search |
| 6 | +engine will behave. |
| 7 | + |
| 8 | +## Assumptions, Clarifications, and Edge Cases |
| 9 | + |
| 10 | +Here are some good assumptions and clarifications to consider: |
| 11 | + |
| 12 | + - Will we want to only match search terms that are their own word |
| 13 | + (define word)? Or if we have the term `dom` and we have document |
| 14 | + text `dominic` will we want to bol the occurrence of `dom` within |
| 15 | + `dominic`? This is a good question to ask, and the algorithm in this |
| 16 | + repository bolds *any* occurrence of a term. |
| 17 | + - We'll want our passed-in terms list to consist only of |
| 18 | + unique terms so that we don't bold the same term multiple |
| 19 | + times (as many as it appears in the terms list). Making this |
| 20 | + assumption would be convenient, but equally we could make the |
| 21 | + set consist of only unique elements ourselves if we can't make |
| 22 | + this assumption. |
| 23 | + - What should we do if a matching term in the document is already |
| 24 | + surrounded by bold tags? In this case, the tags must have come from |
| 25 | + the original document, as our last assumption makes it clear that we |
| 26 | + will not attempt to bold the same term multiple times. My first thought |
| 27 | + was that if we were creating a real-world algorithm, we might want to check |
| 28 | + if a term has already been bolded before bolding it, as we'd be working with, |
| 29 | + mutating, and re-displaying segments of the document's original HTML (in which |
| 30 | + a term might be bolded). *On second thought*, if we could do a little preprocessing |
| 31 | + (or at least assume some is done for us) and assume we're only dealing with the plaintext |
| 32 | + from the source document, stripped of all HTML markup, we can skip this check and bold terms |
| 33 | + as we wish, as no `<b>` or `</b>` character sequences will ever be present in our document's |
| 34 | + text. |
| 35 | + - What should we do if a search term happens to be a subset of another term, and both are matching |
| 36 | + in the document? For example, given the terms `mini` and `minimum`, should we return: |
| 37 | + - `<b>minimum</b>` |
| 38 | + - `<b><b>mini</b>mum</b>` |
| 39 | + - `<b>mini</b>mum` |
| 40 | + |
| 41 | +**[Regarding the final point above]** In this case we'll say it doesn't matter however it would be |
| 42 | +go to look into this if we were implementing this in the real world. A good method here would be to |
| 43 | +process the longest search terms first, which would produce the second output above, which is still |
| 44 | +valid HTML. |
| 45 | + |
| 46 | +## Edge case with possible solutions: |
| 47 | + |
| 48 | +This section is a deeper dive into the real-world edge cases that we may want our algorithm to |
| 49 | +cater to. In this trivial interview-style implementation of this algorithm we won't be implementing |
| 50 | + |
| 51 | +### Searching for terms that match characters the algorithm adds |
| 52 | + |
| 53 | +In this algorithm, if we bold some search term (by surrounding it with `<b>` and `</b>`), |
| 54 | +and then later search for the terms `<b>` or `</b>`, we'll bold these inserted tag sequences |
| 55 | +because it is convenient to treat these inserted sequences as valid document text. In a real |
| 56 | +world algorithm however, we'd probably want to ensure we don't bold things we've added to the |
| 57 | +document as a side effect of running this algorithm. We want the modifications we're making to |
| 58 | +to be purely cosmetic and not contribute to the content of the document. |
| 59 | + |
| 60 | +To address this concern in the real world, we have a few (good?) options: |
| 61 | + |
| 62 | +1. From the given list of search terms, eliminate all terms from the given |
| 63 | + list that match HTML bold tags, or any other HTML tags (if we wish to |
| 64 | + generalize this algorithm). The primary drawback of this method is that |
| 65 | + if the document's plaintext, once stripped of HTML markup, contains text |
| 66 | + like `<b>` or `<b>` in order to render the literal `<b>`, there's no |
| 67 | + way to highlight the rendered document text `<b>` because we can't search for |
| 68 | + `<b>`. Searching for the escaped terms will not work either because the rendered |
| 69 | + counterparts don't match the escaped terms. |
| 70 | +2. We could do something similar to what Google seems to do (likely for security), which |
| 71 | + is omit certain characters from search terms that might be utilizing those characters to |
| 72 | + form HTML markup. In this case, searching `<b>` would really bet he same as searching `b` |
| 73 | + as we can remove the `<` and `>`. This allows searching for valid HTML to not result in an |
| 74 | + empty query, so we have the potential of still bolding something useful. Note given our first |
| 75 | + assumption, we assume that the list of terms is unique. This allows for the list to consist of |
| 76 | + the terms `<b>` and `b`. While they are unique, they search for the same thing so if going with |
| 77 | + this solution we'd need to redefine "unique term" to indicate "unique search term" to indicate |
| 78 | + uniqueness is by what the term would actually end up searching for. |
| 79 | +3. Warning: This solution is a little messy. If someday we might be adding more than just HTML tags to |
| 80 | + our document, we may not be able to limit users from searching from everything we add (for example if |
| 81 | + we add just general text, our users are often going to want to search for general text). To ensure that |
| 82 | + we don't process or bold text that we've added to a document ourselves, we may want to use a map to keep |
| 83 | + track of which characters in the document string we've added ourselves. This way when we search for the |
| 84 | + occurrence of a term, we can determine whether or not an a given occurrence of that term is a subset of |
| 85 | + text we've added ourselves, or whether it appears in naturally-occurring text. From my perspective it |
| 86 | + would be difficult to update the map to reflect the new positions of the text we've previously added |
| 87 | + once we add more, but this isn't relevant right now, is it :) |
| 88 | + |
| 89 | +## Proposed Solutions |
| 90 | + |
| 91 | +### Naive Solution (implemented) |
| 92 | + |
| 93 | +One solution is to simply for each term, find all occurrences of it in the document, |
| 94 | +and before each occurrence, add the characters `<b>`, and after add `</b>`. This can |
| 95 | +be done by using `std::string::find` to find an occurrence, performing the aforementioned |
| 96 | +operations, and doing the same for future occurrences. |
| 97 | + |
| 98 | +The `std::string::find` method takes an optional second parameter which is an index to start |
| 99 | +searching from which we can use to find later occurrences of a term. We can maintain some `findIdx`, |
| 100 | +initially `0`, which can keep track of the index from which the current term has been found. Once we |
| 101 | +find an occurrence, at say position `x`, we'll want to: |
| 102 | + |
| 103 | + - Insert `<b>` right before term (at position `findIdx`) |
| 104 | + - Set `findIdx = findIdx + 3` to account for adding three characters behind the occurrence |
| 105 | + - Insert `</b>` right after the term (at position `findIdx + termLength`) |
| 106 | + - Set `findIdx = findIdx + termLength + 4` to start searching for the next term `termLength + 4` positions |
| 107 | + away, or right after the term we've found plus our character additions |
| 108 | + |
| 109 | +It's important to realize that we'd want to generalize our offsets and character additions if we wanted to support |
| 110 | +the adding of more than one type of HTML tag in our document. |
| 111 | + |
| 112 | +### Complexity Analysis |
| 113 | + |
| 114 | +Note that we're traversing the entire document `k` times, if there are `k` search terms. This gives us: |
| 115 | + |
| 116 | +Time complexity: `O(n*k)`, if `n` is the length of the document |
| 117 | +Space complexity: `O(n)`, since we're returning a new modified copy of the document |
| 118 | + |
| 119 | +### Optimized solution |
| 120 | + |
| 121 | +We could potentially save ourselves from iterating over the document so many times if we were able to perform |
| 122 | +the task in one single pass of the document. To do this we'd want to traverse the document with the ability to |
| 123 | +match each search term as they come up. In this case we'd probably want to match the longest terms first, so we'd |
| 124 | +likely need some way to look ahead. For example, in the document `minimum`, with search terms `m`, and `minimum`, we'd |
| 125 | +clearly have a match of `m` right away, but instead of greedily acting on this match right away, it would be nice if we |
| 126 | +could save our spot, and keep looking ahead while current string we're looking at matches some part of another, longer |
| 127 | +search term. If we match some longer search terms on the way, we can save this information somewhere, but we don't want |
| 128 | +to act on a match until we've either matched or failed to match with the longest so-far-matching search term. |
| 129 | + |
| 130 | +The implementation of this would likely be messy and non-trivial, but would produce excellent efficiencies. We might be |
| 131 | +able to keep track of the longest so-far-matching search term by using a trie, but I'm just spitballing here. |
| 132 | + |
| 133 | +### Complexity Analysis |
| 134 | + |
| 135 | +Time complexity: `O(n)` |
| 136 | +Space complexity: `O(k)`, since we'd need to store our search terms in a new structure capable of the look ahead technique |
| 137 | +we previously talked about. |
0 commit comments