SOFIWARE—PRACTICE AND EXPERIENCE, VOL. 22(9), 695–721 (SEPTEMBER 1992)
An Efficient Implementation of Trie
Structures
JUN-ICHI AOE AND KATSUSHI MORIMOTO
Department of Information Science and Intelligent Systemsj University of Tokushima,
Minami-Josanjima- Cho, Tokushima-Shi 770, Japan
AND
TAKASHI SATO
Department of Arts and Sciences, Osaka Kyoiku University, 3-1-1, Ikeda 563, Japan
SUMMARY
A new internal array structure, called a double-array, implementing a trie structure is presented. The
double-array combines the fast access of a matrix form with the compactness of a list form. The
algorithms for retrieval, insertion and deletion are introduced through examples. Although insertion is
rather slow, it is still practical, and both the deletion and the retrieval time can be improved from the
list form. From the comparison with the list for various large sets of keys, it is shown that the size of
the double-array can be about 17 per cent smaller than that of the list, and that the retrieval speed of
the double-array can be from 3·1 to 5·1 times faster than that of the list.
KEY WORDS Dictionary Information retrieval Key retrieval strategies Natural language processing
INTRODUCTION
In many information retrieval applications, it is necessary to be able to adopt a trie
search l’2 for looking at the input character by character. Examples include a lexical
analyser of a compiler, a bibliographic search 3,5 a spelling checker, 6 and morphologi-
cal dictionaries 7 in natural language processing, and so on. Each node of the trie is
an array indexed by the next ‘item’. The element indexed is a final-state flag, plus
a pointer to either a new node or a null pointer. Figure 1 gives an example of an
array-structured trie for the set K = {baby, bachelor, badge, jar}. Retrieval, deletion
and insertion on the trie are very fast, but it takes lots of space because the space
complexity is proportional to the product of the number of nodes and the number
of characters. A well-known strategy for compressing the trie is to list the arcs out
of each node, with the null pointer at the end of the list. Figure 2 shows an example
of a list-structured trie for the set K. The list-structured trie enables us to save the
space by use of null pointers of the array-structured trie, but the retrieval becomes
slow if there are many arcs leaving each node.
This paper presents a technique of compressing the trie into two one-dimensional
arrays BASE and CHECK called a double-array. In the double-array, non-empty
locations of node n are mapped, by the array BASE, into the array CHECK such that
0038–0644/92/090695–27$18.50 Received 27 February 1990
© 1992 by John Wiley & Sons, Ltd. Revised 14 April 1992
696 J.-I. AOE, K. MORIMOTO AND T. SATO
Figure 1. An array-structured trie for bachelor, baby, badge, jar
no two non-empty locations in each node are mapped to the same position in CHECK.
Each arc of the trie can be retrieved from the double-array in 0 (1) time, that is,
the worst-case time complexity for retrieving a key becomes 0 ( k ) for the length k
of that key. The trie has many nodes for a large set of keys, so it is important to
make the double-array compact. In order to implement the trie for a large set of
keys, the double-array stores only as much of the prefix in the trie as is necessary
to disambiguate the key, and the tail of the key not needed for further disambiguation
is stored in a string array, denoted as TAIL.
REPRESENTATION OF A TRIE
A trie is a tree structure in which each path from the root to a leaf corresponds to
one key in the represented set. The paths in the trie correspond to the characters
of the keys in the set. To avoid confusion between words like ‘the’ and ‘then’, a
special endmarker symbol, #, is used at the end of each word in the set.
The following definitions will be used in the explanations that follow. K is the set
of keys represented by the trie. The trie is formed of nodes and arcs. Arc labels
AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 697
Figure 2. A list-structured trie for bachelor, baby, badge, jar
consist of symbols, called characters. An arc labeled a from node n to m is rep-
resented by the notation g(n, a)=m.
For a key in K, the node m with g(n,a)=m is a separate node if a is a sufficient
character (or arc label) for distinguishing that key from all others in K. The concat-
enation of the arc labels from separate node m to the terminal node is called a single
string for m, denoted as STR[ m ]. The characters of key K remaining after the single
string is deleted from K are called the tail of K. A tree constructed only of the arcs
from the root to the separate nodes for all keys in K is called the reduced trie.
An example of a reduced trie for the set K= {baby#, bachelor#, badge#, jar#}
is shown in Figure 3. This same reduced trie representation is also shown in Figure
3, using a double-array and an array of characters for tail storage. Question marks
(?) in TAIL indicate garbage; their use will be explaining when analysing the insertion
and deletion algorithms.
The following relations between the reduced trie and the double-array shown in
Figure 3 exist:
698 J.-I. AOE, K. MORIMOTO AND T. SATO
Figure 3. The reduced trie and the double-array for K
1. If there is an arc g(n,a) =m on the reduced trie, then BASE [ n ] +a=m and
CHECK [ m ]= n. {For the arc labels: ‘#’=1, ‘a’=2, ‘b’=3, ‘c’=4, etc. }
2. If the node m is a separate node such that the tail string STR [ m ] = b 1, b2, . . . .
b h t h e n (a)
(b)
These two
Retrieval
BASE [ m ]< 0
let, p =– BASE [ m ], TAIL [ p ]= b l, TAIL [ p +l]= b 2, . . . . TAIL [ p+h -1]= bh
relations will remain throughout this paper.
Retrieval operations using the double array are straightforward. For example, to
retrieve ‘bachelor#’ from the double-array shown in Figure 3, the following steps
are performed:
Step 1.
Step 2.
Steps 3,4.
Step 5.
Store the root node at position 1 of BASE in the double-array, and
start at that position. The value for the character ‘b’ is 3, so from
relation 1 above
BASE [ n ]+ a =BASE [ l]+ ‘b’= BASE [l]+3=4+3=7
Observe that CHECK [7] = 1
Since the value found for BASE in step 1 is positive, proceed. Use the
value 7 from step 1 as the new index into BASE, and the value 2 for
character ‘a’, so:
BASE [7]+ ‘a’= BASE [7] +2=l+2=3, and CHECK [3]=7
Proceed as above, using 4 for ‘c’:
BASE [3]+ ‘c’= BASE [3]+4=1+4=5, and CHECK [5]=3
The value in BASE [5] is – 1. A negative value indicates that the rest of
AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 699
the word is located in TAIL, starting at TAIL [– BASE [5]] = TAIL [l]. The
other words in the list can be retrieved using a similar technique,
always starting at the root node at position 1 in BASE.
Observe that retrieval involves only direct array lookups (no searching is required)
and addition, making retrievals extremely efficient in this implementation.
Insertion
Insertion into a double-array is also straightforward. During insertion, any of the
four cases below arises:
1. Insertion of the new word when the double-array is empty.
2. Insertion of the new word without any collisions.
3. Insertion of the new word with a collision; in this case, additional characters
must be added to the BASE and characters must be removed from the TAIL
array to resolve the collision, but nothing already in the BASE array must be
removed.
4. When insertion of the new word with a collision as in case 3 occurs, values in
the BASE array must be moved.
A collision indicates that two different characters have the same index value within
the double-array. These four insertion cases will be demonstrated by adding ‘bach-
elor’ (case 1), ‘jar#’ (case 2), ‘badge#’ (case 3) and ‘baby#’ (case 4) to an empty
double-array structure like the one shown in Figure 4. We define the size denoted
by DA_SIZE, of the double-array as the maximum index of the non-zero entries of
CHECK. Note that the BASE and CHECK entries exceeding the DA_SIZE can be
allocated dynamically as zero entries if needed.
Case 1: the insertion of the new word when the double-array is empty
To insert for example ‘bachelor#’, perform the following steps:
Step 1. Start at position 1 of BASE in the double-array. The value for the character
‘b’ is 3, so:
BASE [1]+‘b’ =BASE [l]+3=l+3=4, and CHECK [4]=0 ¹ l
Step 2. The value 0 in CHECK [4] indicates insertion of the rest of the word. That
Figure 4. The reduced trie and the double-array with no data
700 J.-I. AOE, K. MORIMOTO AND T. SATO
is, ‘b’ is defined on the double-array (by the relation g(l ,‘b’) =4) so store
into TAIL the remaining string, ‘achelor#’.
Step 3. Set
BASE [4] ¬ – POS =–1
to indicate that the rest of the word is stored into TAIL starting from
position POS. And set
CHECK [4] ¬ 1
to indicate the node number it comes from, within the double-array.
Step 4. Set the pointer to TAIL
POS ¬ 9
which is the location for the next insertion.
Figure 5 shows the reduced trie and the double-array after inserting ‘bachelor#’.
Case 2: insertion, when the new word is inserted without collisions
Perform the following steps to insert ‘jar#’:
Step 1. Start at position 1 of BASE in the double-array. The value for the character
‘j’ is 11, so:
BASE [l]+‘j’= BASE [1]+11=1+11=12, and CHECK [12]= 0 ¹ l
Step 2. The value 0 in CHECK [12] indicates insertion of the rest of the word
implying that no collision with ‘bachelor#’ occurred. Store the remaining
string, ‘ar#’, into TAIL from position POS = 9.
Step 3. Set
BASE [12] ¬ -POS =-9
Figure 5. The reduced trie and the double-array for insertion of ‘bachelor#’
AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 701
to indicate that the rest of the word is stored into TAIL starting from
position POS. And set
CHECK [12] ¬ 1
to indicate the node number it comes from, within the double-array.
Step 4. Set the pointer to TAIL
POS ¬ 12
which is the location for the next insertion.
As can be observed, there is no difference between case 1 and case 2 insertions so
their classification is only conceptual and not operational. The resulting reduced trie
and double-array after inserting ‘jar#’ are shown in Figure 6.
To study
returns the
Figure 6. The reduced trie and the double-array for insertion of ‘jar#’
the insertion cases 3 and 4, consider the function X_CHECK(LIST) which
minimum integer q such that q> 0 and CHECK [ q+c ] =0 for all c in LIST.
q always starts with the value- 1 and has unitary
Case 3: insertion, when a collision occurs
Perform the following steps for ‘badge#’:
Step 1. Start at position 1 of BASE in the
character ‘b’ is 3, so:
increments at- analysis time.
double-array. The value for the
BASE [l]+‘b’= BASE [l]+3=l+3=4, and CHECK [4]=1
The non-zero value in CHECK [4] indicates that an arc definition from the
node indicated by the value in CHECK [4], i.e. 1, to node 4 exists.
Step 2. Since the value found for BASE in step 1 is positive, proceed. The value
4 from step 1 is used as the new index into BASE, but:
BASE [4]=–1
702
Step 3.
Step 4.
Step 5.
Step 6.
Step 7.
Step 8.
J.-I. AOE, K. MORIMOTO AND T. SATO
This negative value indicates that searching has finished and string com-
parison is to be performed.
Retrieve from TAIL the string starting at position – BASE [4] = 1, i.e.
‘achelor#’ and compare it with the remaining part of the string to be
inserted, i.e. ‘adge#’. As comparison fails, that is the strings are different
from each other, insert the common prefix into the double-array as
indicated in steps 4, 5, and 6.
Save the current value of – BASE [4] into a temporal variable:
TEMP ¬ -BASE [4]=1
Calculate X_CHECK [{‘a’}] for the prefix character ‘a’ common to the two
strings ‘adge#’ and ‘achelor#’:
CHECK [ q+a ]= CHECK [ l+‘a’] = CHECK [l+2]= CHECK [3]=0
The value of q, i.e. 1, is the candidate for new value of BASE [4], and
the 0 value of CHECK [3] indicates that node 3 is available, so:
Store the new value for BASE [4]:
BASE [4] ¬ q = 1
And the new value of CHECK for the available node 3:
CHECK [ BASE [4]+‘a’] = CHECK [l+2] = CHECK [3] ¬ 4
This indicates an arc definition from the node value in CHECK [3], i.e. 4,
to node 3.
Note: Due to the common prefix in this example, steps 5 and 6 are
not repeated, but these two steps must be performed as many times as
prefix values exist.
To store the remaining strings ‘chelor#’ and ‘dge#’, calculate the value
to be stored into BASE [3] for two arc labels ‘c’ and ‘d’ according to the
closest neighbour available by X_CHECK ( {‘c’ ,‘d’} ) as follows.
For ‘c’: CHECK [q+‘c’] = CHECK [l +4] = CHECK [5] =0 Þ available
For ‘d’: CHECK [q+‘d’] = CHECK [l+5]= CHECK [6] =0 Þ available
as q= 1 is the candidate, set
BASE [3] ¬ 1
Calculate the node number
CHECK taking as parameter
BASE [3]+‘c’=1+4=5
for the reference to ‘chelor#’ in BASE and
the first character of the string:
BASE [5] ¬ –TEMP= –1 and CHECK [5] ¬ 3
Establishing the reference to TAIL by means of BASE, and the arc defi-
nition between states 3 and 5 by means of CHECK.
Step 9.
Step 10.
Step 11.
AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 703
Store the rest of the string ‘helor#’ into TAIL starting at position
BASE [5] = 1, but TAIL [7] and TAIL [8] become garbage in Figure 7.
For the remaining string ‘dge#’:
BASE [3]+‘d’=l+5=6
BASE [6] ¬ – POS = – 12 and CHECK [6] ¬ 3
Store ‘ge#’ into TAIL starting at POS .
Finally set POS to the new insertion position, at the end of the used part
of TAIL.
POS ¬ 12+length[‘ge# ’]=12+3=15
In summary, when a collision occurs the prefix common to the collisioned strings
is extracted from TAIL and inserted into the double-array. The values within the
double-array for the collisioned strings, including the new string, are moved to
the nearest neighbour position available and adjusted to such new positions (see
Figure 7 ).
Case 4: insertion, when a new word is inserted with a collision
As in case 3, values in the BASE array must be moved; perform the following
steps for ‘baby #’:
Step 1. The root node is stored at position 1 of BASE in the double-array, so
start at position 1. For the first three characters the values for BASE and
CHECK are:
BASE [l]+‘b’= BASE [ l]+3=l+3=4, and CHECK [4]=1
BASE [4]+‘a’= BASE [4] +2=l+2=3, and CHECK [3]=4
BASE [3]+‘b’= BASE [3 ]+3=l+3=4, and CHECK [4]=l ¹ 3
The inconsistency in CHECK [4] indicates an undefined status, this means
Figure 7. The reduced trie and the double-array for insertion of ‘badge#’
704 J.-I. AOE, K. MORIMOTO AND T. SATO
that the values for nodes 1 and 3 should be modified to allow the new
insertion, thus proceed as follows.
Step 2. Set a temporal variable TEMP_NODE1 to
TEMP_NODE1 ¬ BASE [ 3]+ ‘b’=1+3=4
If CHECK [4] were equal to 0 then this will imply availability, so the case
would be a straightforward insertion of the string into TAIL at POS. As
this is not the case, do the following.
Step 3. Store in a list, having as index number the node number where the
inconsistency was found, the characters that correspond to the arcs
leaving that node.
LlST [3] ¬ [‘c’,‘d’]
And in another list, having as index the last CHECK value, the charac-
ters that correspond to the arcs leaving that node.
LIST [l] ¬ [‘b’,‘j’]
Step 4. As the purpose here is to associate the new string with node 3, compare
the length of the two lists incrementing the length of LIST [3] by 1. This
increment is necessary to consider the new character to be added to node
3.
compare (length [ LlST [3]]+ 1, length [ LIST [l]])= compare (3,2)
If length [ LlST [3]]+ 1 < length [ LIST [l]] the current node, 3, would be
modified. But as length [ LlST [3]] + 1 ³ length [ LIST [l]] modify the alterna-
tive node, i.e. node 1, as follows.
Step 5. Set a temporal variable for the node referenced by BASE.
TEMP_BASE ¬ BASE [l]=l
and calculate a new BASE using LIST [l] according to the closest neighbour
available as follows:
X_CHECK [‘b’]: CHECK [q+‘b’]
= CHECK [1+3] =CHECK [4]=l ¹ 0
CHECK [2+3] =CHECK [5]=–l ¹ 0
CHECK [3+3] =CHECK [6]=–14 ¹ 0
CHECK [4+3] =CHECK [7]=0 Þ available
and
X_CHECK [‘j’]: CHECK [q+‘j’] = CHECK [4+11]=CHECK [15]
=0 Þ available
as q = 4 is the candidate, set
AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 705
BASE [1] ¬ 4
Step 6. For ‘b’, store the value for the states to be modified in temporal variables:
TEMP_NODE1 ¬ TEMP_BASE+ ‘b’ =1+3=4
TEMP_NODE2 ¬ BASE[1]+ ‘b’=4+3=7
Copy the BASE value from the original status to the new status:
BASE[TEMP_NODE2] ¬ BASE[TEMP_NODE1]
i.e.
BASE [7] ¬ BASE [4] = 1
And set the CHECK value for the new node:
CHECK [ TEMP_NODE2 ] =CHECK [7] ¬ CHECK [4]=1
Step 7. As
BASE[TEMP_NODE1] = BASE [4]=1>0
That is, the BASE value for the original status is an internal pointer
instead of a pointer to TAIL, calculate the offset w to obtain the node
value to be modified to point to the new node.
CHECK [ BASE ] TEMP_ NODE1 ]+ w ] =TEMP_NODE1
i.e.
CHECK [ BASE [4]+ w ]=4; CHECK [1+ w ] = 4 Þ w =2
and modify CHECK to point to the new status:
CHECK [ BASE [4]+2] = CHECK [l+2]= CHECK [3] ¬ TEMP_NODE2 =7
Step 8. Initialize the old BASE and CHECK:
BASE [ TEMP_NODE1 ] = BASE [4] ¬ 0
CHECK [ TEMP_NODE1 ] = CHECK [4] ¬ 0
Step 9. For ‘j’, store the value for the states to be modified in temporal variables:
TEMP_NODE1 ¬ TEMP_BASE + ‘j’ =1+11=12
TEMP_NODE2 ¬ BASE [1]+ ‘j’ =4+11=15
Copy the BASE value from the original status to the new status:
706 J.-I. AOE, K. MORIMOTO AND T. SATO
BASE [ TEMP_NODE2 ¬ BASE [ TEMP_NODE1 ]
i.e.
BASE [15] ¬ BASE [12]=-9
And set the CHECK value for the new node:
CHECK [ TEMP_NODE2 ] = CHECK [15] ¬ CHECK [12]=1
Step 10. As
BASE [ TEMP_NODE1 ] =BASE [12]=–9<0
That is, the BASE value for the original status is a pointer to TAIL, so
only initialize the old BASE and CHECK:
BASE [ TEMP_NODE1 ] = BASE [12] ¬ 0
CHECK [ TEMP_NODE1 ] = CHECK [12 ] ¬ 0
Now the conflict generated by the collision of ‘b’ from ba b y has been solved. Finally,
insert the remaining part of the new string ‘by#’ into TAIL.
Step 11. Considering the original BASE node number, i.e. 3, where the inconsist-
ency was generated (see step 3) as pivot, store in a temporal variable
the node number for the reference to the new string:
TEMP_NODE ¬ BASE [3]+ ‘b’= 1+3=4
Step 12. Store in the new BASE node the pointer to TAIL for the new string:
BASE [ TEMP_NODE ] =BASE [4] ¬ – POS =–15
and in CHECK the internal pointer to the referencing node
CHECK [ TEMP_NODE ] =CHECK [4] ¬ 3
Step 13. Insert the new string in TAIL:
TAlL [ POS ] =TAlL [15] +‘y#’
Step 14. To finish set the pointer to TAIL to its new value:
POS ¬ POS+ length [’y#’] = 15+2= 17
In summary, when a collision occurs the values in the double-array must be moved
since there is no room for the specification of the new word, the node that collided,
or the previous node, as indicated by CHECK. The node with fewer branches is then
moved to another place within the double-array to give space for the specification
AN EFFICIENT IMPLEMENTATION OF TRIE STRUCTURES 707
of more nodes to the node that collided. The connecting node numbers are modified
to point to the new place within the double-array where the node specification was
moved. Finally, the new string is inserted (see Figure 3 ).
Deletion
Deletion of words from a double-array is also straightforward. Deletion has the
same scanning process as in case 2 insertion. In fact, the only difference consists of
resetting, for the word to be deleted, the necessary pointers to TAIL within the
double-array.
For example, to delete the word ‘badge#’ perform the following:
Step 1. The root node is stored at position 1 of BASE in the double-array, so start
at position 1. For the first three characters the values for BASE and CHECK
are
The
BASE [1]+‘b’= BASE [1]+3=4+3=7, and CHECK [7]=1
BASE [7]+‘a’= BASE [7]+2=l+2=3, and CHECK [3]=7
BASE [3]+‘d’=BASE [3]+5=l+5=6, and CHECK [6]=3
BASE [6] = – 12<0 Þ separate node
separate node is the pointer to TAIL.
Step 2
本文档为【duoble-array】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。