Fast Regular Expression Matching using Small TCAMs for Network
Intrusion Detection and Prevention Systems
Chad R. Meiners Jignesh Patel Eric Norige Eric Torng Alex X. Liu
Department of Computer Science and Engineering
Michigan State University
East Lansing, MI 48824-1226, U.S.A.
{meinersc, patelji1, norigeer, torng, alexliu}@cse.msu.edu
Abstract
Regular expression (RE) matching is a core component
of deep packet inspection in modern networking and
security devices. In this paper, we propose the first
hardware-based RE matching approach that uses Ternary
Content Addressable Memories (TCAMs), which are
off-the-shelf chips and have been widely deployed in
modern networking devices for packet classification. We
propose three novel techniques to reduce TCAM space
and improve RE matching speed: transition sharing, ta-
ble consolidation, and variable striding. We tested our
techniques on 8 real-world RE sets, and our results show
that small TCAMs can be used to store large DFAs and
achieve potentially high RE matching throughtput. For
space, we were able to store each of the corresponding 8
DFAs with as many as 25,000 states in a 0.59Mb TCAM
chip where the number of TCAM bits required per DFA
state were 12, 12, 12, 13, 14, 26, 28, and 42. Using
a different TCAM encoding scheme that facilitates pro-
cessing multiple characters per transition, we were able
to achieve potential RE matching throughputs of between
10 and 19 Gbps for each of the 8 DFAs using only a sin-
gle 2.36 Mb TCAM chip.
1 Introduction
1.1 Background and Problem Statement
Deep packet inspection is a key part of many networking
devices on the Internet such as Network Intrusion De-
tection (or Prevention) Systems (NIDS/NIPS), firewalls,
and layer 7 switches. In the past, deep packet inspec-
tion typically used string matching as a core operator,
namely examining whether a packet’s payload matches
any of a set of predefined strings. Today, deep packet in-
spection typically uses regular expression (RE) matching
as a core operator, namely examining whether a packet’s
payload matches any of a set of predefined regular ex-
pressions, because REs are fundamentally more expres-
sive, efficient, and flexible in specifying attack signatures
[27]. Most open source and commercial deep packet in-
spection engines such as Snort, Bro, TippingPoint X505,
and many Cisco networking appliances use RE match-
ing. Likewise, some operating systems such as Cisco
IOS and Linux have built RE matching into their layer 7
filtering functions. As both traffic rates and signature set
sizes are rapidly growing over time, fast and scalable RE
matching is now a core network security issue.
RE matching algorithms are typically based on the De-
terministic Finite Automata (DFA) representation of reg-
ular expressions. A DFA is a 5-tuple (Q,Σ, δ, q0, A),
where Q is a set of states, Σ is an alphabet, δ : Σ×Q→
Q is the transition function, q0 is the start state, and
A ⊆ Q is a set of accepting states. Any set of regu-
lar expressions can be converted into an equivalent DFA
with the minimum number of states. The fundamental
issue with DFA-based algorithms is the large amount of
memory required to store transition table δ. We have to
store δ(q, a) = p for each state q and character a.
Prior RE matching algorithms are either software-
based [4, 6, 7, 12, 16, 18, 19] or FPGA-based [5, 7, 13, 14,
22, 24, 29]. Software-based solutions have to be imple-
mented in customized ASIC chips to achieve high-speed,
the limitations of which include high deployment cost
and being hard-wired to a specific solution and thus lim-
ited ability to adapt to new RE matching solutions. Al-
though FPGA-based solutions can be modified, resynthe-
sizing and updating FPGA circuitry in a deployed system
to handle regular expression updates is slow and diffi-
cult; this makes FPGA-based solutions difficult to be de-
ployed in many networking devices (such as NIDS/NIPS
and firewalls) where the regular expressions need to be
updated frequently [18].
1.2 Our Approach
To address the limitations of prior art on high-speed RE
matching, we propose the first Ternary Content Address-
able Memory (TCAM) based RE matching solution. We
1
use a TCAM and its associated SRAM to encode the
transitions of the DFA built from an RE set where one
TCAM entry might encode multiple DFA transitions.
TCAM entries and lookup keys are encoded in ternary
as 0’s, 1’s, and *’s where *’s stand for either 0 or 1.
A lookup key matches a TCAM entry if and only if
the corresponding 0’s and 1’s match; for example, key
0001101111 matches entry 000110****. TCAM circuits
compare a lookup key with all its occupied entries in par-
allel and return the index (or sometimes the content) of
the first address for the content that the key matches; this
address is then used to retrieve the corresponding deci-
sion in SRAM.
Given an RE set, we first construct an equivalent min-
imum state DFA [15]. Second, we build a two column
TCAM lookup table where each column encodes one of
the two inputs to δ: the source state ID and the input char-
acter. Third, for each TCAM entry, we store the destina-
tion state ID in the same entry of the associated SRAM.
Fig. 1 shows an example DFA, its TCAM lookup table,
and its SRAM decision table. We illustrate how this DFA
processes the input stream “01101111, 01100011”. We
form a TCAM lookup key by appending the current input
character to the current source state ID; in this example,
we append the first input character “01101111” to “00”,
the ID of the initial state s0, to form “0001101111”. The
first matching entry is the second TCAM entry, so “01”,
the destination state ID stored in the second SRAM en-
try is returned. We form the next TCAM lookup key
“0101100011” by appending the second input character
“011000011” to this returned state ID “01”, and the pro-
cess repeats.
s0
s1
s2
[a,o]
else
else
b
[b,c]
a,[c,o]
a,[d,o]
Src ID Input
00 0110 0000
00 0110 ****
00 **** ****
01 0110 0000
01 0110 0010
01 0110 ****
01 **** ****
10 0110 0000
10 0110 001*
10 0110 ****
10 **** ****
TCAM
s0
s1
s2
Dst ID
00 s0
01 s1
00 s0
00 s0
01 s1
10 s2
00 s0
00 s0
01 s1
10 s2
00 s0
SRAM
Input stream Src ID Input
Figure 1: A DFA with its TCAM table
Advantages of TCAM-based RE Matching There
are three key reasons why TCAM-based RE matching
works well. First, a small TCAM is capable of encoding
a large DFA with carefully designed algorithms lever-
aging the ternary nature and first-match semantics of
TCAMs. Our experimental results show that each of the
DFAs built from 8 real-world RE sets with as many as
25,000 states, 4 of which were obtained from the authors
of [6], can be stored in a 0.59Mb TCAM chip. The two
DFAs that correspond to primarily string matching RE
sets require 28 and 42 TCAM bits per DFA state; 5 of
the remaining 6 DFAs which have a sizeable number of
‘.*’ patterns require 12 to 14 TCAM bits per DFA state
whereas the 6th DFA requires 26 TCAM bits per DFA
state. Second, TCAMs facilitate high-speed RE matching
because TCAMs are essentially high-performance paral-
lel lookup systems: any lookup takes constant time (i.e.,
a few CPU cycles) regardless of the number of occupied
entries. Using Agrawal and Sherwood’s TCAM model
[1] and the resulting required TCAM sizes for the 8 RE
sets, we show that it may be possible to achieve through-
puts ranging between 5.36 and 18.6 Gbps using only a
single 2.36 Mb TCAM chip. Third, because TCAMs are
off-the-shelf chips that are widely deployed in modern
networking devices, it should be easy to design network-
ing devices that include our TCAM based RE matching
solution. It may even be possible to immediately deploy
our solution on some existing devices.
Technical Challenges There are two key technical
challenges in TCAM-based RE matching. The first is en-
coding a large DFA in a small TCAM. Directly encoding
a DFA in a TCAM using one TCAM entry per transi-
tion will lead to a prohibitive amount of TCAM space.
For example, consider a DFA with 25000 states that con-
sumes one 8 bit character per transition. We would need
a total of 140.38Mb (= 25000×28×(8+⌈log 25000⌉)).
This is infeasible given the largest available TCAM chip
has a capacity of only 72 Mb. To address this challenge,
we use two techniques that minimize the TCAM space
for storing a DFA: transition sharing and table consol-
idation. The second challenge is improving RE match-
ing speed and thus throughput. One way to improve the
throughput by up to a factor of k is to use k-stride DFAs
that consume k input characters per transition. However,
this leads to an exponential increase in both state and
transition spaces. To avoid this space explosion, we use
the novel idea of variable striding.
Key Idea 1 - Transition Sharing The basic idea is to
combine multiple transitions into one TCAM entry by
exploiting two properties of DFA transitions: (1) char-
acter redundancy where many transitions share the same
source state and destination state and differ only in their
character label, and (2) state redundancy where many
transitions share the same character label and destina-
tion state and differ only in their source state. One rea-
son for the pervasive character and state redundancy in
DFAs constructed from real-world RE sets is that most
states have most of their outgoing transitions going to
some common “failure” state; such transitions are often
called default transitions. The low entropy of these DFAs
2
opens optimization opportunities. We exploit character
redundancy by character bundling (i.e., input character
sharing) and state redundancy by shadow encoding (i.e.,
source state sharing). In character bundling, we use a
ternary encoding of the input character field to repre-
sent multiple characters and thus multiple transitions that
share the same source and destination states. In shadow
encoding, we use a ternary encoding for the source state
ID to represent multiple source states and thus multiple
transitions that share the same label and destination state.
Key Idea 2 - Table Consolidation The basic idea is
to merge multiple transition tables into one transition
table using the observation that some transition tables
share similar structures (e.g., common entries) even if
they have different decisions. This shared structure can
be exploited by consolidating similar transition tables
into one consolidated transition table. When we con-
solidate k TCAM lookup tables into one consolidated
TCAM lookup table, we store k decisions in the asso-
ciated SRAM decision table.
Key Idea 3 - Variable Striding The basic idea is to
store transitions with a variety of strides in the TCAM so
that we increase the average number of characters con-
sumed per transition while ensuring all the transitions fit
within the allocated TCAM space. This idea is based on
two key observations. First, for many states, we can cap-
ture many but not all k-stride transitions using relatively
few TCAM entries whereas capturing all k-stride tran-
sitions requires prohibitively many TCAM entries. Sec-
ond, with TCAMs, we can store transitions with different
strides in the same TCAM lookup table.
The rest of this paper proceeds as follows. We review
related work in Section 2. In Sections 3, 4, and 5, we
describe transition sharing, table consolidation, and vari-
able striding, respectively. We present implementation
issues, experimental results, and conclusions in Sections
6, 7, and 8, respectively.
2 Related Work
In the past, deep packet inspection typically used string
matching (often called pattern matching) as a core op-
erator; string matching solutions have been extensively
studied [2, 3, 28, 30, 32, 33, 35]). TCAM-based solutions
have been proposed for string matching, but they do not
generalize to RE matching because they only deal with
independent strings [3, 30, 35].
Today deep packet inspection often uses RE match-
ing as a core operator because strings are no longer ad-
equate to precisely describe attack signatures [25, 27].
Prior work on RE matching falls into two categories:
software-based and FPGA-based. Prior software-based
RE matching solutions focus on either reducing mem-
ory by minimizing the number of transitions/states or
improving speed by increasing the number of characters
per lookup. Such solutions can be implemented on gen-
eral purpose processors, but customized ASIC chip im-
plementations are needed for high speed performance.
For transition minimization, two basic approaches have
been proposed: alphabet encoding that exploits charac-
ter redundancy [6, 7, 12, 16] and default transitions that
exploit state redundancy [4, 6, 18, 19]. Previous alphabet
encoding approaches cannot fully exploit local charac-
ter redundancy specific to each state. Most use a sin-
gle alphabet encoding table that can only exploit global
character redundancy that applies to every state. Kong
et al. proposed using 8 alphabet encoding tables by par-
titioning the DFA states into 8 groups with each group
having its own alphabet encoding table [16]. Our work
improves upon previous alphabet encoding techniques
because we can exploit local character redundancy spe-
cific to each state. Our work improves upon the default
transition work because we do not need to worry about
the number of default transitions that a lookup may go
through because TCAMs allow us to traverse an arbitrar-
ily long default transition path in a single lookup. Some
transition sharing ideas have been used in some TCAM-
based string matching solutions for Aho-Corasick-based
DFAs [3, 11]. However, these ideas do not easily ex-
tend to DFAs generated by general RE sets, and our
techniques produce at least as much transition sharing
when restricted to string matching DFAs. For state min-
imization, two fundamental approaches have been pro-
posed. One approach is to first partition REs into multi-
ple groups and build a DFA from each group; at run time,
packet payload needs to be scanned by multiple DFAs
[5, 26, 34]. This approach is orthogonal to our work and
can be used in combination with our techniques. In par-
ticular, because our techniques achieve greater compres-
sion of DFAs than previous software-based techniques,
less partitioning of REs will be required. The other ap-
proach is to use scratch memory to store variables that
track the traversal history and avoid some duplication of
states [8,17,25]. The benefit of state reduction for scratch
memory-based FAs does not come for free. The size of
the required scratch memory may be significant, and the
time required to update the scratch memory after each
transition may be significant. This approach is orthogo-
nal to our approach. While we have only applyied our
techniques to DFAs in this initial study of TCAM-based
RE matching, our techniques may work very well with
scratch memory-based automata.
Prior FPGA-based solutions exploit the parallel pro-
cessing capabilities of FPGA technology to implement
nondeterministic finite automata (NFA) [5, 7, 13, 14, 22,
24,29] or parallel DFAs [23]. While NFAs are more com-
pact than DFAs, they require more memory bandwidth
3
to process each transition as an NFA may be in multiple
states whereas a DFA is always only in one state. Thus,
each character that is processed might be processed in
up to |Q| transition tables. Prior work has looked at
ways for finding good NFA representations of the REs
that limit the number of states that need to be processed
simultaneously. However, FPGA’s cannot be quickly re-
configured, and they have clock speeds that are slower
than ASIC chips.
There has been work [7, 12] on creating multi-stride
DFAs and NFAs. This work primarily applies to FPGA
NFA implementations since multiple character SRAM
based DFAs have only been evaluated for a small number
of REs. The ability to increase stride has been limited
by the constraint that all transitions must be increased
in stride; this leads to excessive memory explosion for
strides larger than 2. With variable striding, we increase
stride selectively on a state by state basis. Alicherry et al.
have explored variable striding for TCAM-based string
matching solutions [3] but not for DFAs that apply to ar-
bitrary RE sets.
3 Transition Sharing
The basic idea of transition sharing is to combine mul-
tiple transitions into a single TCAM entry. We pro-
pose two transition sharing ideas: character bundling and
shadow encoding. Character bundling exploits intra-state
optimization opportunities and minimizes TCAM tables
along the input character dimension. Shadow encoding
exploits inter-state optimization opportunities and mini-
mizes TCAM tables along the source state dimension.
3.1 Character Bundling
Character bundling exploits character redundancy by
combining multiple transitions from the same source
state to the same destination into one TCAM entry. Char-
acter bundling consists of four steps. (1) Assign each
state a unique ID of ⌈log |Q|⌉ bits. (2) For each state,
enumerate all 256 transition rules where for each rule,
the predicate is a transition’s label and the decision is the
destination state ID. (3) For each state, treating the 256
rules as a 1-dimensional packet classifier and leveraging
the ternary nature and first-match semantics of TCAMs,
we minimize the number of transitions using the op-
timal 1-dimensional TCAM minimization algorithm in
[20, 31]. (4) Concatenate the |Q| 1-dimensional minimal
prefix classifiers together by prepending each rule with
its source state ID. The resulting list can be viewed as a
2-dimensional classifier where the two fields are source
state ID and transition label and the decision is the des-
tination state ID. Fig. 1 shows an example DFA and its
TCAM lookup table built using character bundling. The
three chunks of TCAM entries encode the 256 transi-
tions for s0, s1, and s2, respectively. Without character
bundling, we would need 256× 3 entries.
3.2 Shadow Encoding
Whereas character bundling uses ternary codes in the in-
put character field to encode multiple input characters,
shadow encoding uses ternary codes in the source state
ID field to encode multiple source states.
3.2.1 Observations
We use our running example in Fig. 1 to illustrate shadow
encoding. We observe that all transitions with source
states s1 and s2 have the same destination state except
for the transitions on character c. Likewise, source state
s0 differs from source states s1 and s2 only in the char-
acter range [a, o]. This implies there is a lot of state re-
dundancy. The table in Fig. 2 shows how we can ex-
ploit state redundancy to further reduce required TCAM
space. First, since states s1 and s2 are more similar, we
give them the state IDs 00 and 01, respectively. State
s2 uses the ternary code of 0* in the state ID field of its
TCAM entries to share transitions with state s1. We give
state s0 the state ID of 10, and it uses the ternary code of
∗∗ in the state ID field of its TCAM entries to share tran-
sitions with both states s1 and s2. Second, we order the
state tables in the TCAM so that state s1 is first, state s2
is second, and state s0 is last. This facilitates the sharing
of transitions among different states where earlier states
have incomplete tables deferring some transitions to later
tables.
TCAM SRAM
Src State ID Input Dest State ID
s1 00 0110 0011 01: s2
0* 0110 001* 00: s1
s2 0* 0110 0000 10: s0
0* 0110 **** 01: s2
** 0110 0000 10: s0
s0 ** 0110 **** 00: s1
** **** **** 10: s0
Figure 2: TCAM table with shadow encoding
We must solve three problems to implement shadow
encoding: (1) Find the best order of the state tables in
the TCAM given that any order is allowed. (2) Identify
entries to remove from each state table given this order.
(3) Choose binary IDs and ternary codes for each state
that support the given order and removed entries. We
solve these problems in the rest of this section.
Our shadow encoding technique builds upon prior
work with de
本文档为【包检测相关】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。