duoble-array

duoble-array SOFIWARE—PRACTICE AND EXPERIENCE, VOL. 22(9), 695–721 (SEPTEMBER 1992) An Efficient Implementation of Trie Structures JUN-ICHI AOE AND KATSUSHI MORIMOTO Department of Information Science and Intelligent Systemsj University of Tokushima, Minami-Josanjima- C...

SOFIWARE—PRACTICE AND EXPERIENCE, VOL. 22(9), 695–721 (SEPTEMBER 1992) An Efficient Implementation of Trie Structures JUN-ICHI AOE AND KATSUSHI MORIMOTO Department of Information Science and Intelligent Systemsj University of Tokushima, Minami-Josanjima- Cho, Tokushima-Shi 770, Japan AND TAKASHI SATO Department of Arts and Sciences, Osaka Kyoiku University, 3-1-1, Ikeda 563, Japan SUMMARY A new internal array structure, called a double-array, implementing a trie structure is presented. The double-array combines the fast access of a matrix form with the compactness of a list form. The algorithms for retrieval, insertion and deletion are introduced through examples. Although insertion is rather slow, it is still practical, and both the deletion and the retrieval time can be improved from the list form. From the comparison with the list for various large sets of keys, it is shown that the size of the double-array can be about 17 per cent smaller than that of the list, and that the retrieval speed of the double-array can be from 3·1 to 5·1 times faster than that of the list. KEY WORDS Dictionary Information retrieval Key retrieval strategies Natural language processing INTRODUCTION In many information retrieval applications, it is necessary to be able to adopt a trie search l’2 for looking at the input character by character. Examples include a lexical analyser of a compiler, a bibliographic search 3,5 a spelling checker, 6 and morphologi- cal dictionaries 7 in natural language processing, and so on. Each node of the trie is an array indexed by the next ‘item’. The element indexed is a final-state flag, plus a pointer to either a new node or a null pointer. Figure 1 gives an example of an array-structured trie for the set K = {baby, bachelor, badge, jar}. Retrieval, deletion and insertion on the trie are very fast, but it takes lots of space because the space complexity is proportional to the product of the number of nodes and the number of characters. A well-known strategy for compressing the trie is to list the arcs out of each node, with the null pointer at the end of the list. Figure 2 shows an example of a list-structured trie for the set K. The list-structured trie enables us to save the space by use of null pointers of the array-structured trie, but the retrieval becomes slow if there are many arcs leaving each node. This paper presents a technique of compressing the trie into two one-dimensional arrays BASE and CHECK called a double-array. In the double-array, non-empty locations of node n are mapped, by the array BASE, into the array CHECK such that 0038–0644/92/090695–27$18.50 Received 27 February 1990 © 1992 by John Wiley & Sons, Ltd. Revised 14 April 1992 696 J.-I. AOE, K. MORIMOTO AND T. SATO Figure 1. An array-structured trie for bachelor, baby, badge, jar no two non-empty locations in each node are mapped to the same position in CHECK. Each arc of the trie can be retrieved from the double-array in 0 (1) time, that is, the worst-case time complexity for retrieving a key becomes 0 ( k ) for the length k of that key. The trie has many nodes for a large set of keys, so it is important to make the double-array compact. In order to implement the trie for a large set of keys, the double-array stores only as much of the prefix in the trie as is necessary to disambiguate the key, and the tail of the key not needed for further disambiguation is stored in a string array, denoted as TAIL. REPRESENTATION OF A TRIE A trie is a tree structure in which each path from the root to a leaf corresponds to one key in the represented set. The paths in the trie correspond to the characters of the keys in the set. To avoid confusion between words like ‘the’ and ‘then’, a special endmarker symbol, #, is used at the end of each word in the set. The following definitions will be used in the explanations that follow. K is the set of keys represented by the trie. The trie is formed of nodes and arcs. Arc labels AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 697 Figure 2. A list-structured trie for bachelor, baby, badge, jar consist of symbols, called characters. An arc labeled a from node n to m is rep- resented by the notation g(n, a)=m. For a key in K, the node m with g(n,a)=m is a separate node if a is a sufficient character (or arc label) for distinguishing that key from all others in K. The concat- enation of the arc labels from separate node m to the terminal node is called a single string for m, denoted as STR[ m ]. The characters of key K remaining after the single string is deleted from K are called the tail of K. A tree constructed only of the arcs from the root to the separate nodes for all keys in K is called the reduced trie. An example of a reduced trie for the set K= {baby#, bachelor#, badge#, jar#} is shown in Figure 3. This same reduced trie representation is also shown in Figure 3, using a double-array and an array of characters for tail storage. Question marks (?) in TAIL indicate garbage; their use will be explaining when analysing the insertion and deletion algorithms. The following relations between the reduced trie and the double-array shown in Figure 3 exist: 698 J.-I. AOE, K. MORIMOTO AND T. SATO Figure 3. The reduced trie and the double-array for K 1. If there is an arc g(n,a) =m on the reduced trie, then BASE [ n ] +a=m and CHECK [ m ]= n. {For the arc labels: ‘#’=1, ‘a’=2, ‘b’=3, ‘c’=4, etc. } 2. If the node m is a separate node such that the tail string STR [ m ] = b 1, b2, . . . . b h t h e n (a) (b) These two Retrieval BASE [ m ]< 0 let, p =– BASE [ m ], TAIL [ p ]= b l, TAIL [ p +l]= b 2, . . . . TAIL [ p+h -1]= bh relations will remain throughout this paper. Retrieval operations using the double array are straightforward. For example, to retrieve ‘bachelor#’ from the double-array shown in Figure 3, the following steps are performed: Step 1. Step 2. Steps 3,4. Step 5. Store the root node at position 1 of BASE in the double-array, and start at that position. The value for the character ‘b’ is 3, so from relation 1 above BASE [ n ]+ a =BASE [ l]+ ‘b’= BASE [l]+3=4+3=7 Observe that CHECK [7] = 1 Since the value found for BASE in step 1 is positive, proceed. Use the value 7 from step 1 as the new index into BASE, and the value 2 for character ‘a’, so: BASE [7]+ ‘a’= BASE [7] +2=l+2=3, and CHECK [3]=7 Proceed as above, using 4 for ‘c’: BASE [3]+ ‘c’= BASE [3]+4=1+4=5, and CHECK [5]=3 The value in BASE [5] is – 1. A negative value indicates that the rest of AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 699 the word is located in TAIL, starting at TAIL [– BASE [5]] = TAIL [l]. The other words in the list can be retrieved using a similar technique, always starting at the root node at position 1 in BASE. Observe that retrieval involves only direct array lookups (no searching is required) and addition, making retrievals extremely efficient in this implementation. Insertion Insertion into a double-array is also straightforward. During insertion, any of the four cases below arises: 1. Insertion of the new word when the double-array is empty. 2. Insertion of the new word without any collisions. 3. Insertion of the new word with a collision; in this case, additional characters must be added to the BASE and characters must be removed from the TAIL array to resolve the collision, but nothing already in the BASE array must be removed. 4. When insertion of the new word with a collision as in case 3 occurs, values in the BASE array must be moved. A collision indicates that two different characters have the same index value within the double-array. These four insertion cases will be demonstrated by adding ‘bach- elor’ (case 1), ‘jar#’ (case 2), ‘badge#’ (case 3) and ‘baby#’ (case 4) to an empty double-array structure like the one shown in Figure 4. We define the size denoted by DA_SIZE, of the double-array as the maximum index of the non-zero entries of CHECK. Note that the BASE and CHECK entries exceeding the DA_SIZE can be allocated dynamically as zero entries if needed. Case 1: the insertion of the new word when the double-array is empty To insert for example ‘bachelor#’, perform the following steps: Step 1. Start at position 1 of BASE in the double-array. The value for the character ‘b’ is 3, so: BASE [1]+‘b’ =BASE [l]+3=l+3=4, and CHECK [4]=0 ¹ l Step 2. The value 0 in CHECK [4] indicates insertion of the rest of the word. That Figure 4. The reduced trie and the double-array with no data 700 J.-I. AOE, K. MORIMOTO AND T. SATO is, ‘b’ is defined on the double-array (by the relation g(l ,‘b’) =4) so store into TAIL the remaining string, ‘achelor#’. Step 3. Set BASE [4] ¬ – POS =–1 to indicate that the rest of the word is stored into TAIL starting from position POS. And set CHECK [4] ¬ 1 to indicate the node number it comes from, within the double-array. Step 4. Set the pointer to TAIL POS ¬ 9 which is the location for the next insertion. Figure 5 shows the reduced trie and the double-array after inserting ‘bachelor#’. Case 2: insertion, when the new word is inserted without collisions Perform the following steps to insert ‘jar#’: Step 1. Start at position 1 of BASE in the double-array. The value for the character ‘j’ is 11, so: BASE [l]+‘j’= BASE [1]+11=1+11=12, and CHECK [12]= 0 ¹ l Step 2. The value 0 in CHECK [12] indicates insertion of the rest of the word implying that no collision with ‘bachelor#’ occurred. Store the remaining string, ‘ar#’, into TAIL from position POS = 9. Step 3. Set BASE [12] ¬ -POS =-9 Figure 5. The reduced trie and the double-array for insertion of ‘bachelor#’ AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 701 to indicate that the rest of the word is stored into TAIL starting from position POS. And set CHECK [12] ¬ 1 to indicate the node number it comes from, within the double-array. Step 4. Set the pointer to TAIL POS ¬ 12 which is the location for the next insertion. As can be observed, there is no difference between case 1 and case 2 insertions so their classification is only conceptual and not operational. The resulting reduced trie and double-array after inserting ‘jar#’ are shown in Figure 6. To study returns the Figure 6. The reduced trie and the double-array for insertion of ‘jar#’ the insertion cases 3 and 4, consider the function X_CHECK(LIST) which minimum integer q such that q> 0 and CHECK [ q+c ] =0 for all c in LIST. q always starts with the value- 1 and has unitary Case 3: insertion, when a collision occurs Perform the following steps for ‘badge#’: Step 1. Start at position 1 of BASE in the character ‘b’ is 3, so: increments at- analysis time. double-array. The value for the BASE [l]+‘b’= BASE [l]+3=l+3=4, and CHECK [4]=1 The non-zero value in CHECK [4] indicates that an arc definition from the node indicated by the value in CHECK [4], i.e. 1, to node 4 exists. Step 2. Since the value found for BASE in step 1 is positive, proceed. The value 4 from step 1 is used as the new index into BASE, but: BASE [4]=–1 702 Step 3. Step 4. Step 5. Step 6. Step 7. Step 8. J.-I. AOE, K. MORIMOTO AND T. SATO This negative value indicates that searching has finished and string com- parison is to be performed. Retrieve from TAIL the string starting at position – BASE [4] = 1, i.e. ‘achelor#’ and compare it with the remaining part of the string to be inserted, i.e. ‘adge#’. As comparison fails, that is the strings are different from each other, insert the common prefix into the double-array as indicated in steps 4, 5, and 6. Save the current value of – BASE [4] into a temporal variable: TEMP ¬ -BASE [4]=1 Calculate X_CHECK [{‘a’}] for the prefix character ‘a’ common to the two strings ‘adge#’ and ‘achelor#’: CHECK [ q+a ]= CHECK [ l+‘a’] = CHECK [l+2]= CHECK [3]=0 The value of q, i.e. 1, is the candidate for new value of BASE [4], and the 0 value of CHECK [3] indicates that node 3 is available, so: Store the new value for BASE [4]: BASE [4] ¬ q = 1 And the new value of CHECK for the available node 3: CHECK [ BASE [4]+‘a’] = CHECK [l+2] = CHECK [3] ¬ 4 This indicates an arc definition from the node value in CHECK [3], i.e. 4, to node 3. Note: Due to the common prefix in this example, steps 5 and 6 are not repeated, but these two steps must be performed as many times as prefix values exist. To store the remaining strings ‘chelor#’ and ‘dge#’, calculate the value to be stored into BASE [3] for two arc labels ‘c’ and ‘d’ according to the closest neighbour available by X_CHECK ( {‘c’ ,‘d’} ) as follows. For ‘c’: CHECK [q+‘c’] = CHECK [l +4] = CHECK [5] =0 Þ available For ‘d’: CHECK [q+‘d’] = CHECK [l+5]= CHECK [6] =0 Þ available as q= 1 is the candidate, set BASE [3] ¬ 1 Calculate the node number CHECK taking as parameter BASE [3]+‘c’=1+4=5 for the reference to ‘chelor#’ in BASE and the first character of the string: BASE [5] ¬ –TEMP= –1 and CHECK [5] ¬ 3 Establishing the reference to TAIL by means of BASE, and the arc defi- nition between states 3 and 5 by means of CHECK. Step 9. Step 10. Step 11. AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 703 Store the rest of the string ‘helor#’ into TAIL starting at position BASE [5] = 1, but TAIL [7] and TAIL [8] become garbage in Figure 7. For the remaining string ‘dge#’: BASE [3]+‘d’=l+5=6 BASE [6] ¬ – POS = – 12 and CHECK [6] ¬ 3 Store ‘ge#’ into TAIL starting at POS . Finally set POS to the new insertion position, at the end of the used part of TAIL. POS ¬ 12+length[‘ge# ’]=12+3=15 In summary, when a collision occurs the prefix common to the collisioned strings is extracted from TAIL and inserted into the double-array. The values within the double-array for the collisioned strings, including the new string, are moved to the nearest neighbour position available and adjusted to such new positions (see Figure 7 ). Case 4: insertion, when a new word is inserted with a collision As in case 3, values in the BASE array must be moved; perform the following steps for ‘baby #’: Step 1. The root node is stored at position 1 of BASE in the double-array, so start at position 1. For the first three characters the values for BASE and CHECK are: BASE [l]+‘b’= BASE [ l]+3=l+3=4, and CHECK [4]=1 BASE [4]+‘a’= BASE [4] +2=l+2=3, and CHECK [3]=4 BASE [3]+‘b’= BASE [3 ]+3=l+3=4, and CHECK [4]=l ¹ 3 The inconsistency in CHECK [4] indicates an undefined status, this means Figure 7. The reduced trie and the double-array for insertion of ‘badge#’ 704 J.-I. AOE, K. MORIMOTO AND T. SATO that the values for nodes 1 and 3 should be modified to allow the new insertion, thus proceed as follows. Step 2. Set a temporal variable TEMP_NODE1 to TEMP_NODE1 ¬ BASE [ 3]+ ‘b’=1+3=4 If CHECK [4] were equal to 0 then this will imply availability, so the case would be a straightforward insertion of the string into TAIL at POS. As this is not the case, do the following. Step 3. Store in a list, having as index number the node number where the inconsistency was found, the characters that correspond to the arcs leaving that node. LlST [3] ¬ [‘c’,‘d’] And in another list, having as index the last CHECK value, the charac- ters that correspond to the arcs leaving that node. LIST [l] ¬ [‘b’,‘j’] Step 4. As the purpose here is to associate the new string with node 3, compare the length of the two lists incrementing the length of LIST [3] by 1. This increment is necessary to consider the new character to be added to node 3. compare (length [ LlST [3]]+ 1, length [ LIST [l]])= compare (3,2) If length [ LlST [3]]+ 1 < length [ LIST [l]] the current node, 3, would be modified. But as length [ LlST [3]] + 1 ³ length [ LIST [l]] modify the alterna- tive node, i.e. node 1, as follows. Step 5. Set a temporal variable for the node referenced by BASE. TEMP_BASE ¬ BASE [l]=l and calculate a new BASE using LIST [l] according to the closest neighbour available as follows: X_CHECK [‘b’]: CHECK [q+‘b’] = CHECK [1+3] =CHECK [4]=l ¹ 0 CHECK [2+3] =CHECK [5]=–l ¹ 0 CHECK [3+3] =CHECK [6]=–14 ¹ 0 CHECK [4+3] =CHECK [7]=0 Þ available and X_CHECK [‘j’]: CHECK [q+‘j’] = CHECK [4+11]=CHECK [15] =0 Þ available as q = 4 is the candidate, set AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 705 BASE [1] ¬ 4 Step 6. For ‘b’, store the value for the states to be modified in temporal variables: TEMP_NODE1 ¬ TEMP_BASE+ ‘b’ =1+3=4 TEMP_NODE2 ¬ BASE[1]+ ‘b’=4+3=7 Copy the BASE value from the original status to the new status: BASE[TEMP_NODE2] ¬ BASE[TEMP_NODE1] i.e. BASE [7] ¬ BASE [4] = 1 And set the CHECK value for the new node: CHECK [ TEMP_NODE2 ] =CHECK [7] ¬ CHECK [4]=1 Step 7. As BASE[TEMP_NODE1] = BASE [4]=1>0 That is, the BASE value for the original status is an internal pointer instead of a pointer to TAIL, calculate the offset w to obtain the node value to be modified to point to the new node. CHECK [ BASE ] TEMP_ NODE1 ]+ w ] =TEMP_NODE1 i.e. CHECK [ BASE [4]+ w ]=4; CHECK [1+ w ] = 4 Þ w =2 and modify CHECK to point to the new status: CHECK [ BASE [4]+2] = CHECK [l+2]= CHECK [3] ¬ TEMP_NODE2 =7 Step 8. Initialize the old BASE and CHECK: BASE [ TEMP_NODE1 ] = BASE [4] ¬ 0 CHECK [ TEMP_NODE1 ] = CHECK [4] ¬ 0 Step 9. For ‘j’, store the value for the states to be modified in temporal variables: TEMP_NODE1 ¬ TEMP_BASE + ‘j’ =1+11=12 TEMP_NODE2 ¬ BASE [1]+ ‘j’ =4+11=15 Copy the BASE value from the original status to the new status: 706 J.-I. AOE, K. MORIMOTO AND T. SATO BASE [ TEMP_NODE2 ¬ BASE [ TEMP_NODE1 ] i.e. BASE [15] ¬ BASE [12]=-9 And set the CHECK value for the new node: CHECK [ TEMP_NODE2 ] = CHECK [15] ¬ CHECK [12]=1 Step 10. As BASE [ TEMP_NODE1 ] =BASE [12]=–9<0 That is, the BASE value for the original status is a pointer to TAIL, so only initialize the old BASE and CHECK: BASE [ TEMP_NODE1 ] = BASE [12] ¬ 0 CHECK [ TEMP_NODE1 ] = CHECK [12 ] ¬ 0 Now the conflict generated by the collision of ‘b’ from ba b y has been solved. Finally, insert the remaining part of the new string ‘by#’ into TAIL. Step 11. Considering the original BASE node number, i.e. 3, where the inconsist- ency was generated (see step 3) as pivot, store in a temporal variable the node number for the reference to the new string: TEMP_NODE ¬ BASE [3]+ ‘b’= 1+3=4 Step 12. Store in the new BASE node the pointer to TAIL for the new string: BASE [ TEMP_NODE ] =BASE [4] ¬ – POS =–15 and in CHECK the internal pointer to the referencing node CHECK [ TEMP_NODE ] =CHECK [4] ¬ 3 Step 13. Insert the new string in TAIL: TAlL [ POS ] =TAlL [15] +‘y#’ Step 14. To finish set the pointer to TAIL to its new value: POS ¬ POS+ length [’y#’] = 15+2= 17 In summary, when a collision occurs the values in the double-array must be moved since there is no room for the specification of the new word, the node that collided, or the previous node, as indicated by CHECK. The node with fewer branches is then moved to another place within the double-array to give space for the specification AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 707 of more nodes to the node that collided. The connecting node numbers are modified to point to the new place within the double-array where the node specification was moved. Finally, the new string is inserted (see Figure 3 ). Deletion Deletion of words from a double-array is also straightforward. Deletion has the same scanning process as in case 2 insertion. In fact, the only difference consists of resetting, for the word to be deleted, the necessary pointers to TAIL within the double-array. For example, to delete the word ‘badge#’ perform the following: Step 1. The root node is stored at position 1 of BASE in the double-array, so start at position 1. For the first three characters the values for BASE and CHECK are The BASE [1]+‘b’= BASE [1]+3=4+3=7, and CHECK [7]=1 BASE [7]+‘a’= BASE [7]+2=l+2=3, and CHECK [3]=7 BASE [3]+‘d’=BASE [3]+5=l+5=6, and CHECK [6]=3 BASE [6] = – 12<0 Þ separate node separate node is the pointer to TAIL. Step 2

                    本文档为【duoble-array】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

duoble-array

你可能还喜欢