CERIAS Tech Report 2004-13
LINGUISTIC STEGANOGRAPHY: SURVEY, ANALYSIS, AND ROBUSTNESS
CONCERNS FOR HIDING INFORMATION IN TEXT
by Krista Bennett
Center for Education and Research in
Information Assurance and Security,
Purdue University, West Lafayette, IN 47907-2086
Linguistic Steganography:
Survey, Analysis, and Robustness Concerns for Hiding Information in Text
Krista Bennett
Department of Linguistics
Purdue University
West Lafayette, IN 47906
kbennett@cerias.purdue.edu
Abstract. Steganography is an ancient art. With the advent of computers, we have vast
accessible bodies of data in which to hide information, and increasingly sophisticated
techniques with which to analyze and recover that information. While much of the recent
research in steganography has been centered on hiding data in images, many of the
solutions that work for images are more complicated when applied to natural language
text as a cover medium. Many approaches to steganalysis attempt to detect statistical
anomalies in cover data which predict the presence of hidden information. Natural
language cover texts must not only pass the statistical muster of automatic analysis, but
also the minds of human readers. Linguistically naïve approaches to the problem use
statistical frequency of letter combinations or random dictionary words to encode
information. More sophisticated approaches use context-free grammars to generate
syntactically correct cover text which mimics the syntax of natural text. None of these
uses meaning as a basis for generation, and little attention is paid to the semantic
cohesiveness of a whole text as a data point for statistical attack. This paper provides a
basic introduction to steganography and steganalysis, with a particular focus on text
steganography. Text-based information hiding techniques are discussed, providing
motivation for moving toward linguistic steganography and steganalysis. We highlight
some of the problems inherent in text steganography as well as issues with existing
solutions, and describe linguistic problems with character-based, lexical, and syntactic
approaches. Finally, the paper explores how a semantic and rhetorical generation
approach suggests solutions for creating more believable cover texts, presenting some
current and future issues in analysis and generation. The paper is intended to be both
general enough that linguists without training in information security and computer
science can understand the material, and specific enough that the linguistic and
computational problems are described in adequate detail to justify the conclusions
suggested.
Introduction
Steganography is the art of sending hidden or invisible messages. The name is taken from a work
by Trithemus (1462-1516) entitled “Steganographia” and comes from the Greek στεγανό-ς,
γραφ-ειν meaning “covered writing” (Petitcolas et al 1999: 1062, Petitcolas 2000: 2, etc.). The
practice of sending secret messages is nothing new, and attempts to cover the messages by hiding
2
them in something else (or by making them look like something else) have been made fpr
millennia. Many of the standard examples used by modern researchers to explain steganography,
in fact, come from the writings of Herodotus. For example, in around 440 BC, Herodotus writes
about Histæus, who was being held captive and wanted to send a message without being
detected. He shaved the head of his favorite slave, tattooed a message on his scalp, and waited
for the hair to regrow, obscuring the message from guards (Petitcolas 2000: 3). Petitcolas
mentions that this method was in fact still used by Germans in the early 20th century.
Modern steganography is generally understood to deal with electronic media rather than physical
objects and texts. This makes sense for a number of reasons. First of all, because the size of the
information is generally (necessarily) quite small compared to the size of the data in which it
must be hidden (the cover text), electronic media is much easier to manipulate in order to hide
data and extract messages. Secondly, extraction itself can be automated when the data is
electronic, since computers can efficiently manipulate the data and execute the algorithms
necessary to retrieve the messages. Also, because there is simply so much electronic information
available, there are a huge number of potential cover texts available in which to hide
information, and there is a gargantuan amount of data an adversary attempting to find
steganographically hidden messages must process. Electronic data also often includes redundant,
unnecessary, and unnoticed data spaces which can be manipulated in order to hide messages. In a
sense, these data spaces provide a sort of conceptual “hidden compartment” into which secret
messages can be inserted and sent off to the receiver.
This work provides an introduction to steganography in general, and discusses linguistic
steganography in particular. While much of modern steganography focuses on images, audio
signals, and other digital data, there is also a wealth of text sources in which information can be
hidden. While there are various ways in which one may hide information in text, there is a
specific set of techniques which uses the linguistic structure of a text as the space in which
information is hidden. We will discuss text methods, and provide justification for linguistic
solutions. Additionally, we will analyze the state-of-the-art in linguistic steganography, and
discuss both problems with these solutions, and a suggested vector for future solutions.
3
In section 1, we discuss general steganography and steganalysis, as well as some well-known
areas of steganography. Section 2 discusses the main focus of this paper, text steganography in
general and linguistic steganography in particular. Section 3 explores the linguistic problems
with existing text steganographic methods. Finally, Section 4 gives suggestions for constructing
the next generation of linguistically and statistically robust cover texts based upon the methods
described in Section 1 and 2, and the issues discussed in Section 3.
1 Steganography, Steganalysis, and Mimicking
Because the focus of this text is on linguistic steganography, it is important to understand just
what we mean by this term. Chapman et al define linguistic steganography as “the art of using
written natural language to conceal secret messages” (Chapman et al 2001: 156). Our definition
is somewhat more specific that this, requiring not only that the steganographic cover be
composed of natural language text or some sort, but that the text itself is either generated to have
a cohesive linguistic structure, or that the cover text is natural language text to begin with. To
further elaborate, we will first introduce steganography as a field and discuss current techniques
in information hiding. We then show how these are applied to texts, differentiating between non-
linguistic and linguistic methods. Section 1.1 describes modern steganography with some
examples of steganographic techniques, and defines linguistic steganography within the context
of text steganography in general. Section 1.2 introduces steganalysis and adversarial models,
which are, in a sense, the driving force behind the creation of new steganographic methods.
Finally, section 1.3 discusses “mimicking”, which is an encapsulation of the idea of using the
statistical properties of a normal data object as the basis for generating a steganographic cover.
These are intended as background information in order to motivate the discussion of text
steganography and cover generation in section 2.
1.1 Steganography
Steganographic information can be hidden in almost anything, and some cover objects are more
suitable for information hiding than others. This section will simply detail a few common
steganographic methods applied to various kinds of electronic media, along with an explanation
of the steganographic techniques used. Techniques can be grouped in many different ways;
Johnson and Katzenbeisser group steganographic techniques into six categories by how the
algorithm encodes information in the cover object: substitution systems, transform domain
4
techniques, spread spectrum techniques, statistical methods, distortion techniques, and cover
generation methods (2000: 43-44). In terms of linguistic steganography, we will be mainly
concerned with cover generation methods, although some statistical methods and substitution
systems will be described. Substitution systems insert the hidden message into redundant areas of
the cover object, statistical methods use the statistical profile of the cover text in order to encode
information, and cover generation texts encode information in the way the cover object itself is
generated (44). The descriptions that follow are not supposed to be an exhaustive survey, but
merely an introduction to some of the existing methods; for a much more comprehensive
description of modern steganographic techniques, see Katzenbeisser and Petitcolas (2000) or
Wayner (2002).
One further comment should be made; Kerkhoffs’ principle, which states that one must assume
that an attacker has knowledge of the protocol used and that all security must thus lie in the key
used in the protocol, is not to be ignored (Anderson and Petitcolas 1998, Petitcolas 2000). While
we do not specifically discuss secret keys here, it should be stated that we assume that the hidden
message is encrypted before being hidden in the cover text. While this does not protect a
protocol from being attacked if the introduction of random-looking data is inappropriate in the
context of the cover data, but it does prevent the message from being read. In many cases, the
fact that encrypted data looks like random data is intentionally used in spaces where such
random noise could realistically occur. In any event, we assume that the message itself is
cryptographically secure, and we therefore focus the protocols intended to hide such messages.
1.1.1 Image steganography
Image steganography has gotten more popular press in recent years than other kinds of
steganography, possibly because of the flood of electronic image information available with the
advent of digital cameras and high-speed internet distribution. Image steganography often
involves hiding information in the naturally occurring “noise” within the image, and provides a
good illustration for such techniques.
Most kinds of information contain some kind of noise. Noise can be described as unwanted
distortion of information within the signal. Within an audio signal, the concept of noise is
obvious. For images, however, noise generally refers to the imperfections inherent in the process
5
of rendering an analog picture as a digital image. For example, the values of colors in the palette
for a digital image will not only not be the exact colors in the real image, and the distribution of
these colors will be also be imperfect. As Wayner mentions, the instantaneous measurement of
photons made by a digital camera also captures the randomness inherent in their behavior,
leading to a set of “imperfect” measurements which balance out to become a digital image
(Wayner 2002: 152). By changing the least significant bit (LSB) in the color representation for
selected pixels, information can be hidden while often not significantly changing the visual
appearance of the image; this is known as “image downgrading” (Katzenbeisser 2000: 29,
Johnson and Katzenbeisser 2000: 49). The greater the number of bits used to represent colors,
the less obvious the changes in palette values are in the visual representation of the final image.
While changing the LSB in order to hide information is a widely used steganographic method,
Petitcolas et al note that it is “trivial for a capable opponent to remove” such information (1999:
1065). Furthermore, lossy compression and other image transformations can easily destroy
hidden messages (Johnson and Jajodia 1998: 30).
There are many other methods for image steganography, however. For the images below, an
algorithm was used that attempts to avoid statistical distortion by taking advantage of the discrete
cosine transforms that are used to compress and approximate a digital image. The F5 algorithm
improves upon previous methods for steganographic JPEGs by not only modifying the
compression coefficients, but by attempting to spread the modifications through the file in such a
way that their statistical profile still approximates a non-steganographic JPEG image (Westfeld
2001; Wayner 2002: 181). The image below is a bitmap image which is compressed to a JPEG
image by the tool as the secret message is embedded.
6
(Original image thanks to reasonablyclever.com)
(An original image (182k bitmap – left), and the same image with a 1k file embedded in it using
the F5 steganography tool (Westfeld, 2003) – note that the distortions in the background of the
second image resemble image distortions commonly seen when converting images; however, the
distortion in the two images is also clearly detectable by a human. This might or might not tip off
an attacker, depending upon whether or not such image distortion can be expected in context.
Furthermore, the statistical profile of the distortion may tell an attacker much more about the
stego-object.)
1.1.2 Audio steganography
Audio steganography, the hiding of messages in audio “noise” (and in frequencies which humans
can’t hear), is another area of information hiding that relies on using an existing source as a
space in which to hide information. Audio steganography can be problematic, however, since
musicians, audiophiles, and sound engineers have been reported to be able to detect the “high-
pitched whine” associated with extra high-frequency information encoded in messages. In
addition to storing information in non-audible frequencies or by distorting the audible signal to
include additional noise, Johnson and Katzenbeisser also mention a technique known as “echo
hiding.” An echo is introduced into the signal, and the size of the echo displacement from the
original signal can be used to indicate 1’s or 0’s, depending on the size (2000: 62). Regardless of
the kind of signal modification used, as with many steganographic techniques applied to images,
changing an existing signal modifies its statistical profile in addition to potentially changing the
audible qualities of the signal. Making such steganography less detectable depends on making
the changes look like legitimately occurring noise or signal distortion.
7
The following is a short visual sample of happens when a 13k audio file is embedded into a 168k
.wav file using Steganos (Steganos GmbH, 2004); the audio of the steganographic file sounds
fuzzy, like a radio that is not well-tuned (the original sound file is quite clear). The representation
captured below shows the end of the sound trailing off into silence. While both images are
similar, an examination of the steganographic representation shows that there is additional noise
(this is made particularly obvious by the fact that line at the end of the right sample is much
thicker than that of the left sample):
Clean .wav file (cover object) Stego .wav file (images from the Audacity audio program)
While the original audio is still clearly discernable in the stego-object, the changes are detectable
by the human listener. While they fall within the range of the kind of noise heard on a radio
station or through interference, if such noise is not expected in context, it may raise questions
about the audio file itself.
1.1.3 Text steganography
Text steganography, which is what this paper specifically deals with, uses text as the medium in
which to hide information. Our definition of text steganography remains broad in order to
differentiate it from the more specific “linguistic steganography”. Text steganography can
involve anything from changing the formatting of an existing text, to changing words within a
text, to generating random character sequences or using context-free grammars to generate
readable texts. With any of these methods, the common denominator is that hidden messages are
8
imbedded in character-based text. What sets each method apart from another, coarsely speaking,
is whether or not the text is preexisting or is completely generated by the steganographic tool,
and whether or not the resulting text is a result of random generation/modification, “statistical”
generation, or linguistically-driven generation/modification. Only linguistically-driven
generation and modification of cover texts qualifies as linguistic steganography by our
definition, and will be further detailed in section 2. Each of these text-based methods has its own
unique characteristics, and from an both a linguistic and security standpoint, problems.
To give an idea of what text steganography looks like, here is a short sample of steganographic
text generated by a tool called spammimic. spammimic is a web-based steganography tool
which uses a “spam” grammar and Wayner’s mimic algorithm for context-free-grammars to
produce spam-like steganographic text (spammimic.com, 2000). For example, entering “Lovely
spam” as a secret message into the web interface creates the following (lengthy) steganographic
message:
Dear Friend , This letter was specially selected to be sent to you
! This is a one time mailing there is no need to request removal
if you won't want any more ! This mail is being sent in compliance
with Senate bill 1622 ; Title 6 ; Section 305 ! This is NOT
unsolicited bulk mail . Why work for somebody else when you can
become rich in 10 days ! Have you ever noticed nearly every
commercial on television has a .com on in it and people love
convenience . Well, now is your chance to capitalize on this ! WE
will help YOU turn your business into an E-BUSINESS and process
your orders within seconds ! You are guaranteed to succeed because
we take all the risk ! But don't believe us ! Mr Jones of Indiana
tried us and says "Now I'm rich, Rich, RICH" ! This offer is 100%
legal . Because the Internet operates on "Internet time" you must
act now ! Sign up a friend and you'll get a discount of 50% .
Thanks .
The message is created through a grammar of spam phrases, each of which expresses a bit or
series of bits. In some sense, the spammimic variant of Wayner’s mimic grammars is one of the
more convincing text methods, since so much essentially meaningless spam content is readily
available on the internet (one estimate claims (as of May 2004) that 82% of all email in the U.S.
was spam (Gaudin 2004)). Wading through the volumes of “legitimate” spam to find hidden
messages would be a hefty job indeed; in fact, one of the ways to improve the chances that a
steganographic message won’t be found is to camouflage the data in such a way that there is a
9
high cost for the attacker to search for the message (Petitcolas et al 1999: 1065). While this
“needle in a haystack” metaphor is appropriate and is relied upon by many steganographic
methods, given the grammar used to create the message or a known steganographic message
created with the grammar, the task becomes significantly less daunting.
1.2 Steganalysis
As with most areas of information security, steganography is an arms race. This is perhaps best
exemplified by Ross Ande
本文档为【Linguistic SteganographySurvey, Analysis, and Robustness Concerns for Hiding Information in Text】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。