首页 Linguistic SteganographySurvey, Analysis, and Robustness Concerns for Hiding Information in Text

Linguistic SteganographySurvey, Analysis, and Robustness Concerns for Hiding Information in Text

举报
开通vip

Linguistic SteganographySurvey, Analysis, and Robustness Concerns for Hiding Information in Text CERIAS Tech Report 2004-13 LINGUISTIC STEGANOGRAPHY: SURVEY, ANALYSIS, AND ROBUSTNESS CONCERNS FOR HIDING INFORMATION IN TEXT by Krista Bennett Center for Education and Research in Information As...

Linguistic SteganographySurvey, Analysis, and Robustness Concerns for Hiding Information in Text
CERIAS Tech Report 2004-13 LINGUISTIC STEGANOGRAPHY: SURVEY, ANALYSIS, AND ROBUSTNESS CONCERNS FOR HIDING INFORMATION IN TEXT by Krista Bennett Center for Education and Research in Information Assurance and Security, Purdue University, West Lafayette, IN 47907-2086 Linguistic Steganography: Survey, Analysis, and Robustness Concerns for Hiding Information in Text Krista Bennett Department of Linguistics Purdue University West Lafayette, IN 47906 kbennett@cerias.purdue.edu Abstract. Steganography is an ancient art. With the advent of computers, we have vast accessible bodies of data in which to hide information, and increasingly sophisticated techniques with which to analyze and recover that information. While much of the recent research in steganography has been centered on hiding data in images, many of the solutions that work for images are more complicated when applied to natural language text as a cover medium. Many approaches to steganalysis attempt to detect statistical anomalies in cover data which predict the presence of hidden information. Natural language cover texts must not only pass the statistical muster of automatic analysis, but also the minds of human readers. Linguistically naïve approaches to the problem use statistical frequency of letter combinations or random dictionary words to encode information. More sophisticated approaches use context-free grammars to generate syntactically correct cover text which mimics the syntax of natural text. None of these uses meaning as a basis for generation, and little attention is paid to the semantic cohesiveness of a whole text as a data point for statistical attack. This paper provides a basic introduction to steganography and steganalysis, with a particular focus on text steganography. Text-based information hiding techniques are discussed, providing motivation for moving toward linguistic steganography and steganalysis. We highlight some of the problems inherent in text steganography as well as issues with existing solutions, and describe linguistic problems with character-based, lexical, and syntactic approaches. Finally, the paper explores how a semantic and rhetorical generation approach suggests solutions for creating more believable cover texts, presenting some current and future issues in analysis and generation. The paper is intended to be both general enough that linguists without training in information security and computer science can understand the material, and specific enough that the linguistic and computational problems are described in adequate detail to justify the conclusions suggested. Introduction Steganography is the art of sending hidden or invisible messages. The name is taken from a work by Trithemus (1462-1516) entitled “Steganographia” and comes from the Greek στεγανό-ς, γραφ-ειν meaning “covered writing” (Petitcolas et al 1999: 1062, Petitcolas 2000: 2, etc.). The practice of sending secret messages is nothing new, and attempts to cover the messages by hiding 2 them in something else (or by making them look like something else) have been made fpr millennia. Many of the standard examples used by modern researchers to explain steganography, in fact, come from the writings of Herodotus. For example, in around 440 BC, Herodotus writes about Histæus, who was being held captive and wanted to send a message without being detected. He shaved the head of his favorite slave, tattooed a message on his scalp, and waited for the hair to regrow, obscuring the message from guards (Petitcolas 2000: 3). Petitcolas mentions that this method was in fact still used by Germans in the early 20th century. Modern steganography is generally understood to deal with electronic media rather than physical objects and texts. This makes sense for a number of reasons. First of all, because the size of the information is generally (necessarily) quite small compared to the size of the data in which it must be hidden (the cover text), electronic media is much easier to manipulate in order to hide data and extract messages. Secondly, extraction itself can be automated when the data is electronic, since computers can efficiently manipulate the data and execute the algorithms necessary to retrieve the messages. Also, because there is simply so much electronic information available, there are a huge number of potential cover texts available in which to hide information, and there is a gargantuan amount of data an adversary attempting to find steganographically hidden messages must process. Electronic data also often includes redundant, unnecessary, and unnoticed data spaces which can be manipulated in order to hide messages. In a sense, these data spaces provide a sort of conceptual “hidden compartment” into which secret messages can be inserted and sent off to the receiver. This work provides an introduction to steganography in general, and discusses linguistic steganography in particular. While much of modern steganography focuses on images, audio signals, and other digital data, there is also a wealth of text sources in which information can be hidden. While there are various ways in which one may hide information in text, there is a specific set of techniques which uses the linguistic structure of a text as the space in which information is hidden. We will discuss text methods, and provide justification for linguistic solutions. Additionally, we will analyze the state-of-the-art in linguistic steganography, and discuss both problems with these solutions, and a suggested vector for future solutions. 3 In section 1, we discuss general steganography and steganalysis, as well as some well-known areas of steganography. Section 2 discusses the main focus of this paper, text steganography in general and linguistic steganography in particular. Section 3 explores the linguistic problems with existing text steganographic methods. Finally, Section 4 gives suggestions for constructing the next generation of linguistically and statistically robust cover texts based upon the methods described in Section 1 and 2, and the issues discussed in Section 3. 1 Steganography, Steganalysis, and Mimicking Because the focus of this text is on linguistic steganography, it is important to understand just what we mean by this term. Chapman et al define linguistic steganography as “the art of using written natural language to conceal secret messages” (Chapman et al 2001: 156). Our definition is somewhat more specific that this, requiring not only that the steganographic cover be composed of natural language text or some sort, but that the text itself is either generated to have a cohesive linguistic structure, or that the cover text is natural language text to begin with. To further elaborate, we will first introduce steganography as a field and discuss current techniques in information hiding. We then show how these are applied to texts, differentiating between non- linguistic and linguistic methods. Section 1.1 describes modern steganography with some examples of steganographic techniques, and defines linguistic steganography within the context of text steganography in general. Section 1.2 introduces steganalysis and adversarial models, which are, in a sense, the driving force behind the creation of new steganographic methods. Finally, section 1.3 discusses “mimicking”, which is an encapsulation of the idea of using the statistical properties of a normal data object as the basis for generating a steganographic cover. These are intended as background information in order to motivate the discussion of text steganography and cover generation in section 2. 1.1 Steganography Steganographic information can be hidden in almost anything, and some cover objects are more suitable for information hiding than others. This section will simply detail a few common steganographic methods applied to various kinds of electronic media, along with an explanation of the steganographic techniques used. Techniques can be grouped in many different ways; Johnson and Katzenbeisser group steganographic techniques into six categories by how the algorithm encodes information in the cover object: substitution systems, transform domain 4 techniques, spread spectrum techniques, statistical methods, distortion techniques, and cover generation methods (2000: 43-44). In terms of linguistic steganography, we will be mainly concerned with cover generation methods, although some statistical methods and substitution systems will be described. Substitution systems insert the hidden message into redundant areas of the cover object, statistical methods use the statistical profile of the cover text in order to encode information, and cover generation texts encode information in the way the cover object itself is generated (44). The descriptions that follow are not supposed to be an exhaustive survey, but merely an introduction to some of the existing methods; for a much more comprehensive description of modern steganographic techniques, see Katzenbeisser and Petitcolas (2000) or Wayner (2002). One further comment should be made; Kerkhoffs’ principle, which states that one must assume that an attacker has knowledge of the protocol used and that all security must thus lie in the key used in the protocol, is not to be ignored (Anderson and Petitcolas 1998, Petitcolas 2000). While we do not specifically discuss secret keys here, it should be stated that we assume that the hidden message is encrypted before being hidden in the cover text. While this does not protect a protocol from being attacked if the introduction of random-looking data is inappropriate in the context of the cover data, but it does prevent the message from being read. In many cases, the fact that encrypted data looks like random data is intentionally used in spaces where such random noise could realistically occur. In any event, we assume that the message itself is cryptographically secure, and we therefore focus the protocols intended to hide such messages. 1.1.1 Image steganography Image steganography has gotten more popular press in recent years than other kinds of steganography, possibly because of the flood of electronic image information available with the advent of digital cameras and high-speed internet distribution. Image steganography often involves hiding information in the naturally occurring “noise” within the image, and provides a good illustration for such techniques. Most kinds of information contain some kind of noise. Noise can be described as unwanted distortion of information within the signal. Within an audio signal, the concept of noise is obvious. For images, however, noise generally refers to the imperfections inherent in the process 5 of rendering an analog picture as a digital image. For example, the values of colors in the palette for a digital image will not only not be the exact colors in the real image, and the distribution of these colors will be also be imperfect. As Wayner mentions, the instantaneous measurement of photons made by a digital camera also captures the randomness inherent in their behavior, leading to a set of “imperfect” measurements which balance out to become a digital image (Wayner 2002: 152). By changing the least significant bit (LSB) in the color representation for selected pixels, information can be hidden while often not significantly changing the visual appearance of the image; this is known as “image downgrading” (Katzenbeisser 2000: 29, Johnson and Katzenbeisser 2000: 49). The greater the number of bits used to represent colors, the less obvious the changes in palette values are in the visual representation of the final image. While changing the LSB in order to hide information is a widely used steganographic method, Petitcolas et al note that it is “trivial for a capable opponent to remove” such information (1999: 1065). Furthermore, lossy compression and other image transformations can easily destroy hidden messages (Johnson and Jajodia 1998: 30). There are many other methods for image steganography, however. For the images below, an algorithm was used that attempts to avoid statistical distortion by taking advantage of the discrete cosine transforms that are used to compress and approximate a digital image. The F5 algorithm improves upon previous methods for steganographic JPEGs by not only modifying the compression coefficients, but by attempting to spread the modifications through the file in such a way that their statistical profile still approximates a non-steganographic JPEG image (Westfeld 2001; Wayner 2002: 181). The image below is a bitmap image which is compressed to a JPEG image by the tool as the secret message is embedded. 6 (Original image thanks to reasonablyclever.com) (An original image (182k bitmap – left), and the same image with a 1k file embedded in it using the F5 steganography tool (Westfeld, 2003) – note that the distortions in the background of the second image resemble image distortions commonly seen when converting images; however, the distortion in the two images is also clearly detectable by a human. This might or might not tip off an attacker, depending upon whether or not such image distortion can be expected in context. Furthermore, the statistical profile of the distortion may tell an attacker much more about the stego-object.) 1.1.2 Audio steganography Audio steganography, the hiding of messages in audio “noise” (and in frequencies which humans can’t hear), is another area of information hiding that relies on using an existing source as a space in which to hide information. Audio steganography can be problematic, however, since musicians, audiophiles, and sound engineers have been reported to be able to detect the “high- pitched whine” associated with extra high-frequency information encoded in messages. In addition to storing information in non-audible frequencies or by distorting the audible signal to include additional noise, Johnson and Katzenbeisser also mention a technique known as “echo hiding.” An echo is introduced into the signal, and the size of the echo displacement from the original signal can be used to indicate 1’s or 0’s, depending on the size (2000: 62). Regardless of the kind of signal modification used, as with many steganographic techniques applied to images, changing an existing signal modifies its statistical profile in addition to potentially changing the audible qualities of the signal. Making such steganography less detectable depends on making the changes look like legitimately occurring noise or signal distortion. 7 The following is a short visual sample of happens when a 13k audio file is embedded into a 168k .wav file using Steganos (Steganos GmbH, 2004); the audio of the steganographic file sounds fuzzy, like a radio that is not well-tuned (the original sound file is quite clear). The representation captured below shows the end of the sound trailing off into silence. While both images are similar, an examination of the steganographic representation shows that there is additional noise (this is made particularly obvious by the fact that line at the end of the right sample is much thicker than that of the left sample): Clean .wav file (cover object) Stego .wav file (images from the Audacity audio program) While the original audio is still clearly discernable in the stego-object, the changes are detectable by the human listener. While they fall within the range of the kind of noise heard on a radio station or through interference, if such noise is not expected in context, it may raise questions about the audio file itself. 1.1.3 Text steganography Text steganography, which is what this paper specifically deals with, uses text as the medium in which to hide information. Our definition of text steganography remains broad in order to differentiate it from the more specific “linguistic steganography”. Text steganography can involve anything from changing the formatting of an existing text, to changing words within a text, to generating random character sequences or using context-free grammars to generate readable texts. With any of these methods, the common denominator is that hidden messages are 8 imbedded in character-based text. What sets each method apart from another, coarsely speaking, is whether or not the text is preexisting or is completely generated by the steganographic tool, and whether or not the resulting text is a result of random generation/modification, “statistical” generation, or linguistically-driven generation/modification. Only linguistically-driven generation and modification of cover texts qualifies as linguistic steganography by our definition, and will be further detailed in section 2. Each of these text-based methods has its own unique characteristics, and from an both a linguistic and security standpoint, problems. To give an idea of what text steganography looks like, here is a short sample of steganographic text generated by a tool called spammimic. spammimic is a web-based steganography tool which uses a “spam” grammar and Wayner’s mimic algorithm for context-free-grammars to produce spam-like steganographic text (spammimic.com, 2000). For example, entering “Lovely spam” as a secret message into the web interface creates the following (lengthy) steganographic message: Dear Friend , This letter was specially selected to be sent to you ! This is a one time mailing there is no need to request removal if you won't want any more ! This mail is being sent in compliance with Senate bill 1622 ; Title 6 ; Section 305 ! This is NOT unsolicited bulk mail . Why work for somebody else when you can become rich in 10 days ! Have you ever noticed nearly every commercial on television has a .com on in it and people love convenience . Well, now is your chance to capitalize on this ! WE will help YOU turn your business into an E-BUSINESS and process your orders within seconds ! You are guaranteed to succeed because we take all the risk ! But don't believe us ! Mr Jones of Indiana tried us and says "Now I'm rich, Rich, RICH" ! This offer is 100% legal . Because the Internet operates on "Internet time" you must act now ! Sign up a friend and you'll get a discount of 50% . Thanks . The message is created through a grammar of spam phrases, each of which expresses a bit or series of bits. In some sense, the spammimic variant of Wayner’s mimic grammars is one of the more convincing text methods, since so much essentially meaningless spam content is readily available on the internet (one estimate claims (as of May 2004) that 82% of all email in the U.S. was spam (Gaudin 2004)). Wading through the volumes of “legitimate” spam to find hidden messages would be a hefty job indeed; in fact, one of the ways to improve the chances that a steganographic message won’t be found is to camouflage the data in such a way that there is a 9 high cost for the attacker to search for the message (Petitcolas et al 1999: 1065). While this “needle in a haystack” metaphor is appropriate and is relied upon by many steganographic methods, given the grammar used to create the message or a known steganographic message created with the grammar, the task becomes significantly less daunting. 1.2 Steganalysis As with most areas of information security, steganography is an arms race. This is perhaps best exemplified by Ross Ande
本文档为【Linguistic SteganographySurvey, Analysis, and Robustness Concerns for Hiding Information in Text】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_408275
暂无简介~
格式:pdf
大小:393KB
软件:PDF阅读器
页数:0
分类:互联网
上传时间:2011-12-26
浏览量:14