Skip to content

Commit ef1804b

Browse files
authored
algo(boldTerms): add algo + docs + tests
1 parent e3a115d commit ef1804b

File tree

5 files changed

+245
-0
lines changed

5 files changed

+245
-0
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
CC=g++
2+
PROGRAMS=bold
3+
CFLAGS= -g
4+
all: $(PROGRAMS)
5+
6+
bold: bold.cpp
7+
$(CC) $^ -o $@ $(CFLAGS)
8+
9+
10+
clean:
11+
rm -r $(PROGRAMS) *.dSYM
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Bold Terms
2+
3+
Write a function that takes in a document in the form of a string, and a list of
4+
search terms. The algorithm should modify the document such that character sequences
5+
matching any of the given terms are wrapped in HTML `<b></b>` tags similar to how a search
6+
engine will behave.
7+
8+
## Assumptions, Clarifications, and Edge Cases
9+
10+
Here are some good assumptions and clarifications to consider:
11+
12+
- Will we want to only match search terms that are their own word
13+
(define word)? Or if we have the term `dom` and we have document
14+
text `dominic` will we want to bol the occurrence of `dom` within
15+
`dominic`? This is a good question to ask, and the algorithm in this
16+
repository bolds *any* occurrence of a term.
17+
- We'll want our passed-in terms list to consist only of
18+
unique terms so that we don't bold the same term multiple
19+
times (as many as it appears in the terms list). Making this
20+
assumption would be convenient, but equally we could make the
21+
set consist of only unique elements ourselves if we can't make
22+
this assumption.
23+
- What should we do if a matching term in the document is already
24+
surrounded by bold tags? In this case, the tags must have come from
25+
the original document, as our last assumption makes it clear that we
26+
will not attempt to bold the same term multiple times. My first thought
27+
was that if we were creating a real-world algorithm, we might want to check
28+
if a term has already been bolded before bolding it, as we'd be working with,
29+
mutating, and re-displaying segments of the document's original HTML (in which
30+
a term might be bolded). *On second thought*, if we could do a little preprocessing
31+
(or at least assume some is done for us) and assume we're only dealing with the plaintext
32+
from the source document, stripped of all HTML markup, we can skip this check and bold terms
33+
as we wish, as no `<b>` or `</b>` character sequences will ever be present in our document's
34+
text.
35+
- What should we do if a search term happens to be a subset of another term, and both are matching
36+
in the document? For example, given the terms `mini` and `minimum`, should we return:
37+
- `<b>minimum</b>`
38+
- `<b><b>mini</b>mum</b>`
39+
- `<b>mini</b>mum`
40+
41+
**[Regarding the final point above]** In this case we'll say it doesn't matter however it would be
42+
go to look into this if we were implementing this in the real world. A good method here would be to
43+
process the longest search terms first, which would produce the second output above, which is still
44+
valid HTML.
45+
46+
## Edge case with possible solutions:
47+
48+
This section is a deeper dive into the real-world edge cases that we may want our algorithm to
49+
cater to. In this trivial interview-style implementation of this algorithm we won't be implementing
50+
51+
### Searching for terms that match characters the algorithm adds
52+
53+
In this algorithm, if we bold some search term (by surrounding it with `<b>` and `</b>`),
54+
and then later search for the terms `<b>` or `</b>`, we'll bold these inserted tag sequences
55+
because it is convenient to treat these inserted sequences as valid document text. In a real
56+
world algorithm however, we'd probably want to ensure we don't bold things we've added to the
57+
document as a side effect of running this algorithm. We want the modifications we're making to
58+
to be purely cosmetic and not contribute to the content of the document.
59+
60+
To address this concern in the real world, we have a few (good?) options:
61+
62+
1. From the given list of search terms, eliminate all terms from the given
63+
list that match HTML bold tags, or any other HTML tags (if we wish to
64+
generalize this algorithm). The primary drawback of this method is that
65+
if the document's plaintext, once stripped of HTML markup, contains text
66+
like `&lt;b>` or `<b&gt;` in order to render the literal `<b>`, there's no
67+
way to highlight the rendered document text `<b>` because we can't search for
68+
`<b>`. Searching for the escaped terms will not work either because the rendered
69+
counterparts don't match the escaped terms.
70+
2. We could do something similar to what Google seems to do (likely for security), which
71+
is omit certain characters from search terms that might be utilizing those characters to
72+
form HTML markup. In this case, searching `<b>` would really bet he same as searching `b`
73+
as we can remove the `<` and `>`. This allows searching for valid HTML to not result in an
74+
empty query, so we have the potential of still bolding something useful. Note given our first
75+
assumption, we assume that the list of terms is unique. This allows for the list to consist of
76+
the terms `<b>` and `b`. While they are unique, they search for the same thing so if going with
77+
this solution we'd need to redefine "unique term" to indicate "unique search term" to indicate
78+
uniqueness is by what the term would actually end up searching for.
79+
3. Warning: This solution is a little messy. If someday we might be adding more than just HTML tags to
80+
our document, we may not be able to limit users from searching from everything we add (for example if
81+
we add just general text, our users are often going to want to search for general text). To ensure that
82+
we don't process or bold text that we've added to a document ourselves, we may want to use a map to keep
83+
track of which characters in the document string we've added ourselves. This way when we search for the
84+
occurrence of a term, we can determine whether or not an a given occurrence of that term is a subset of
85+
text we've added ourselves, or whether it appears in naturally-occurring text. From my perspective it
86+
would be difficult to update the map to reflect the new positions of the text we've previously added
87+
once we add more, but this isn't relevant right now, is it :)
88+
89+
## Proposed Solutions
90+
91+
### Naive Solution (implemented)
92+
93+
One solution is to simply for each term, find all occurrences of it in the document,
94+
and before each occurrence, add the characters `<b>`, and after add `</b>`. This can
95+
be done by using `std::string::find` to find an occurrence, performing the aforementioned
96+
operations, and doing the same for future occurrences.
97+
98+
The `std::string::find` method takes an optional second parameter which is an index to start
99+
searching from which we can use to find later occurrences of a term. We can maintain some `findIdx`,
100+
initially `0`, which can keep track of the index from which the current term has been found. Once we
101+
find an occurrence, at say position `x`, we'll want to:
102+
103+
- Insert `<b>` right before term (at position `findIdx`)
104+
- Set `findIdx = findIdx + 3` to account for adding three characters behind the occurrence
105+
- Insert `</b>` right after the term (at position `findIdx + termLength`)
106+
- Set `findIdx = findIdx + termLength + 4` to start searching for the next term `termLength + 4` positions
107+
away, or right after the term we've found plus our character additions
108+
109+
It's important to realize that we'd want to generalize our offsets and character additions if we wanted to support
110+
the adding of more than one type of HTML tag in our document.
111+
112+
### Complexity Analysis
113+
114+
Note that we're traversing the entire document `k` times, if there are `k` search terms. This gives us:
115+
116+
Time complexity: `O(n*k)`, if `n` is the length of the document
117+
Space complexity: `O(n)`, since we're returning a new modified copy of the document
118+
119+
### Optimized solution
120+
121+
We could potentially save ourselves from iterating over the document so many times if we were able to perform
122+
the task in one single pass of the document. To do this we'd want to traverse the document with the ability to
123+
match each search term as they come up. In this case we'd probably want to match the longest terms first, so we'd
124+
likely need some way to look ahead. For example, in the document `minimum`, with search terms `m`, and `minimum`, we'd
125+
clearly have a match of `m` right away, but instead of greedily acting on this match right away, it would be nice if we
126+
could save our spot, and keep looking ahead while current string we're looking at matches some part of another, longer
127+
search term. If we match some longer search terms on the way, we can save this information somewhere, but we don't want
128+
to act on a match until we've either matched or failed to match with the longest so-far-matching search term.
129+
130+
The implementation of this would likely be messy and non-trivial, but would produce excellent efficiencies. We might be
131+
able to keep track of the longest so-far-matching search term by using a trie, but I'm just spitballing here.
132+
133+
### Complexity Analysis
134+
135+
Time complexity: `O(n)`
136+
Space complexity: `O(k)`, since we'd need to store our search terms in a new structure capable of the look ahead technique
137+
we previously talked about.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#include <iostream>
2+
#include <vector>
3+
#include <string>
4+
#include <unordered_set>
5+
6+
// Source: ---
7+
8+
/**
9+
* For description and complexity analysis see: README.md
10+
*/
11+
12+
std::string boldTerms(std::string doc, const std::vector<std::string>& terms) {
13+
std::unordered_set<std::string> termSet(terms.begin(), terms.end());
14+
int foundIdx;
15+
16+
for (std::unordered_set<std::string>::const_iterator term = termSet.begin(); term != termSet.end(); term++) {
17+
foundIdx = 0;
18+
while ((foundIdx = doc.find(*term, foundIdx)) != std::string::npos) {
19+
doc.insert(foundIdx, "<b>");
20+
foundIdx += 3;
21+
doc.insert(foundIdx + (*term).size(), "</b>");
22+
foundIdx += (*term).size() + 4;
23+
}
24+
}
25+
26+
return doc;
27+
}
28+
29+
int main() {
30+
int numTests, numTerms;
31+
std::string doc, tmpTerm;
32+
std::vector<std::string> terms;
33+
34+
std::cin >> numTests;
35+
for (int i = 0; i < numTests; i++) {
36+
std::cin >> numTerms;
37+
std::cin.ignore();
38+
std::getline(std::cin, doc);
39+
40+
// Collect all of our search terms
41+
for (int j = 0; j < numTerms; j++) {
42+
std::cin >> tmpTerm;
43+
terms.push_back(tmpTerm);
44+
}
45+
46+
std::cout << boldTerms(doc, terms) << std::endl;
47+
terms.clear();
48+
}
49+
50+
return 0;
51+
}
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
9
2+
0
3+
this should have no bold tags
4+
5
5+
this should also have no bold tags
6+
apple
7+
zoo
8+
teacher
9+
student
10+
shouldn't
11+
1
12+
this should have one bold tag
13+
one
14+
3
15+
this should have two bold tags
16+
two
17+
two
18+
bold
19+
1
20+
this document should have <b>bolded</b> end-bold tags
21+
</b>
22+
2
23+
this is the minimum example our algorithm is not well-defined for
24+
mini
25+
minimum
26+
2
27+
this is the minimum example our algorithm is not well-defined for
28+
minimum
29+
mini
30+
2
31+
in this sentence we end up bolding characters that our algorithm added! that's not good
32+
<b>
33+
good
34+
2
35+
in this sentence we don't end up bolding characters our algorithm added because of the order of the terms! that's not good either
36+
good
37+
<b>
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
this should have no bold tags
2+
this should also have no bold tags
3+
this should have <b>one</b> bold tag
4+
this should have <b>two</b> <b>bold</b> tags
5+
this document should have <b>bolded<b></b></b> end-bold tags
6+
this is the <b><b>mini</b>mum</b> example our algorithm is not well-defined for
7+
this is the <b>mini</b>mum example our algorithm is not well-defined for
8+
in this sentence we end up bolding characters that our algorithm added! that's not <b><b></b>good</b>
9+
in this sentence we don't end up bolding characters our algorithm added because of the order of the terms! that's not <b>good</b> either

0 commit comments

Comments
 (0)