642. Design Search Autocomplete System


Design a search autocomplete system for a search engine. Users may input a sentence (at least one word and end with a special character '#'). For each character they type except '#', you need to return the top 3 historical hot sentences that have prefix the same as the part of sentence already typed. Here are the specific rules:

  1. The hot degree for a sentence is defined as the number of times a user typed the exactly same sentence before.
  2. The returned top 3 hot sentences should be sorted by hot degree (The first is the hottest one). If several sentences have the same degree of hot, you need to use ASCII-code order (smaller one appears first).
  3. If less than 3 hot sentences exist, then just return as many as you can.
  4. When the input is a special character, it means the sentence ends, and in this case, you need to return an empty list.

Your job is to implement the following functions:

The constructor function:

AutocompleteSystem(String[] sentences, int[] times): This is the constructor. The input is historical data. Sentences is a string array consists of previously typed sentences. Times is the corresponding times a sentence has been typed. Your system should record these historical data.

Now, the user wants to input a new sentence. The following function will provide the next character the user types:

List<String> input(char c): The input c is the next character typed by the user. The character will only be lower-case letters ('a' to 'z'), blank space (' ') or a special character ('#'). Also, the previously typed sentence should be recorded in your system. The output will be the top 3 historical hot sentences that have prefix the same as the part of sentence already typed.


Example:
Operation: AutocompleteSystem(["i love you", "island","ironman", "i love leetcode"], [5,3,2,2])
The system have already tracked down the following sentences and their corresponding times:
"i love you" : 5 times
"island" : 3 times
"ironman" : 2 times
"i love leetcode" : 2 times
Now, the user begins another search:

Operation: input('i')
Output: ["i love you", "island","i love leetcode"]
Explanation:
There are four sentences that have prefix "i". Among them, "ironman" and "i love leetcode" have same hot degree. Since ' ' has ASCII code 32 and 'r' has ASCII code 114, "i love leetcode" should be in front of "ironman". Also we only need to output top 3 hot sentences, so "ironman" will be ignored.

Operation: input(' ')
Output: ["i love you","i love leetcode"]
Explanation:
There are only two sentences that have prefix "i ".

Operation: input('a')
Output: []
Explanation:
There are no sentences that have prefix "i a".

Operation: input('#')
Output: []
Explanation:
The user finished the input, the sentence "i a" should be saved as a historical sentence in system. And the following input will be counted as a new search.


Note:

  1. The input sentence will always start with a letter and end with '#', and only one blank space will exist between two words.
  2. The number of complete sentences that to be searched won't exceed 100. The length of each sentence including those in the historical data won't exceed 100.
  3. Please use double-quote instead of single-quote when you write test cases even for a character input.
  4. Please remember to RESET your class variables declared in class AutocompleteSystem, as static/class variables are persisted across multiple test cases. Please see here for more details.


Solution


Approach #1 Brute Force [Time Limit Exceeded]

In this solution, we make use of a HashMap which stores entries in the form . Here, refers to the number of times the has been typed earlier.

AutocompleteSystem: We pick up each sentence from and their corresponding times from the , and make their entries in the appropriately.

input(c): We make use of a current sentence tracker variable, , which is used to store the sentence entered till now as the input. For as the current input, firstly, we append this to and then iterate over all the keys of to check if a key exists whose initial characters match with . We add all such keys to a . Then, we sort this as per our requirements, and obtain the first three values from this .

Performance Analysis

  • AutocompleteSystem() takes time. This is because, putting an entry in a hashMap takes time. But, to create a hash value for a sentence of average length , it will be scanned atleast once. We need to put such entries in the .

  • input() takes time. We need to iterate over the list of sentences, in , entered till now(say with a count ), taking time, to populate the used for finding the hot sentences. Then, we need to sort the of length , taking time.


Approach #2 Using One level Indexing[Accepted]

This method is almost the same as that of the last approach except that instead of making use of simply a HashMap to store the sentences along with their number of occurences, we make use of a Two level HashMap.

Thus, we make use of an array of HashMapsEach element of this array, , is used to refer to one of the alphabets possible. Each element is a HashMap itself, which stores the sentences and their number of occurences similar to the last approach. e.g. is used to refer to a HashMap which stores the sentences starting with an 'a'.

The process of adding the data in AutocompleteSystem and retrieving the data remains the same as in the last approach, except the one level indexing using which needs to be done prior to accessing the required HashMap.

Performance Analysis

  • AutocompleteSystem() takes time. Putting an entry in a hashMap takes time. But, to create a hash value for a sentence of average length , it will be scanned atleast once. We need to put such entries in the .

  • input() takes time. We need to iterate only over one hashmap corresponding to the sentences starting with the first character of the current sentence, to populate the for finding the hot sentences. Here, refers to the size of this corresponding hashmap. Then, we need to sort the of length , taking time.


Approach #3 Using Trie[Accepted]

A Trie is a special data structure used to store strings that can be visualized like a tree. It consists of nodes and edges. Each node consists of at max 26 children and edges connect each parent node to its children. These 26 pointers are nothing but pointers for each of the 26 letters of the English alphabet A separate edge is maintained for every edge.

Strings are stored in a top to bottom manner on the basis of their prefix in a trie. All prefixes of length 1 are stored at until level 1, all prefixes of length 2 are sorted at until level 2 and so on.

A Trie data structure is very commonly used for representing the words stored in a dictionary. Each level represents one character of the word being formed. A word available in the dictionary can be read off from the Trie by starting from the root and going till the leaf.

By doing a small modification to this structure, we can also include an entry, , for the number of times the current word has been previously typed. This entry can be stored in the leaf node corresponding to the particular word.

Now, for implementing the AutoComplete function, we need to consider each character of the every word given in array, and add an entry corresponding to each such character at one level of the trie. At the leaf node of every word, we can update the section of the node with the corresponding number of times this word has been typed.

The following figure shows a trie structure for the words "A","to", "tea", "ted", "ten", "i", "in", and "inn", occuring 15, 7, 3, 4, 12, 11, 5 and 9 times respectively.

Trie

Similarly, to implement the input(c) function, for every input character , we need to add this character to the word being formed currently, i.e. to . Then, we need to traverse in the current trie till all the characters in the current word, , have been exhausted.

From this point onwards, we traverse all the branches possible in the Trie, put the sentences/words formed by these branches to a along with their corresponding number of occurences, and find the best 3 out of them similar to the last approach. The following animation shows a typical illustration.

!?!../Documents/642_Design_Autocomplete.json:1000,563!?!

Performance Analysis

  • AutocompleteSystem() takes time. We need to iterate over sentences each of average length , to create the trie for the given set of .

  • input() takes time. Here, refers to the length of the sentence formed till now, . refers to the number of nodes in the trie considering the sentence formed till now as the root node. Again, we need to sort the of length indicating the options available for the hot sentences, which takes time.


Analysis written by: @vinod23