Linux下文件压缩和解压缩分析研究与实现

Linux下文件压缩和解压缩分析研究与实现Linux下文件压缩和解压缩分析研究与实现北方民族大学学士学位论文论文题目: Linux下文件压缩和解压缩分析研究与实现院(部)名称: 电气信息工程学院学生姓名: XXX 专业: 信息工程学号: 00000000 指导教师姓名: XX 教授论文提交时间: 2013.5.15 论文答辩时间: 2013.5.25 学位授予时间: 北方民族大学教务处制摘要在现代社会，计算机技术的发展，使得现代社会更加丰富多彩，我们可以随时随地在任何地方了解到世界各地的信息，而这又必须依赖信息的...

Linux下文件压缩和解压缩分析研究与实现北方民族大学学士学位论文论文题目: Linux下文件压缩和解压缩分析研究与实现院(部)名称: 电气信息工程学院学生姓名: XXX 专业: 信息工程学号: 00000000 指导教师姓名: XX 教授论文提交时间: 2013.5.15 论文答辩时间: 2013.5.25 学位授予时间: 北方民族大学教务处制摘要在现代社会，计算机技术的发展，使得现代社会更加丰富多彩，我们可以随时随地在任何地方了解到世界各地的信息，而这又必须依赖信息的传递。在信息化高度发达的当今社会，我们必须对信息的传递有着较高的要求，我们希望信息在传递的过程中，能够保持节省性和保密性和无损性，而著名的霍夫曼编码就能够达到这样的要求。因此研究霍夫曼编码对信息的压缩和解压缩就时相当有必要的，我们用C/C++语言对霍夫曼编码给出算法以实现对文件的压缩和解压缩。而Linux系统提供了编辑器(vim)、编译链接器(gcc)、调试器(gdb)及项目管理工具(make)。利用这些工具我们可以非常方便的进行C/C++程序的开发以实现对文件的压缩解压缩。本文将利用霍夫曼树与数据结构中最优二叉树的相似性，以及通过对文件I/O的操作，在Linux环境下实现对文件的压缩与解压缩。关键词: 压缩，解压缩，Linux，霍夫曼编码 ABSTRACT In modern society, the development of the communication, the more colorful modern society, we have learned anywhere anytime, anywhere around the world, which in turn must rely on the transmission of information. In the highly developed information technology in today's society, we have a higher demand on the transmission of information, we hope that the information in the delivery process can save and confidentiality and non-destructive, and the famous Huffman coding will be able to achieve such requirement. A result of Huffman coding compression and decompression of the information on quite necessary, with C/C++ language for Huffman coding algorithm is given in order to achieve the compression and decompression of files. The Linux system provides an editor (vim), compiler linker (gcc), debugger (gdb) and project management tools (make). The use of these tools can be very convenient for the development of the C program to implement file compression decompression. The paper will use the optimal binary the Huffman tree data structure, as well as file compression and decompression file I/O operation in the Linux. KEY WORDS: compression , Decompression，Linux , Huffman coding 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现目录第1章绪论 ............................................................................................................. 1 1.1数据压缩技术简介 .......................................................................................... 1 1.2数据压缩的分类.............................................................................................. 1 1.3本文的主要工作.............................................................................................. 2 第2章 Linux编程环境概述 ................................................................................... 3 2.1 Linux系统的由来及发展现状 ........................................................................ 3 2.2 Linux下C/C++语言编程的主要工具 ............................................................ 4 2.2.1编辑器vim................................................................................................ 4 2.2.2 编译链接器gcc........................................................................................ 5 2.2.3 调试器gdb ............................................................................................... 7 2.2.4 工程管理器make..................................................................................... 7 第3章霍夫曼编码原理 ......................................................................................... 9 3.1 霍夫曼编码的理论基础 ................................................................................. 9 3.2 霍夫曼编码 ...................................................................................................10 3.2.1 霍夫曼编码步骤 .....................................................................................10 3.2.2 霍夫曼表 .................................................................................................10 3.2.3 霍夫曼树 .................................................................................................11 3.2.4 霍夫曼树与压缩编码 .............................................................................12 第4章基于霍夫曼编码的文件压缩与解压缩的实现 ..........................................15 4.1 程序的设计思想 ............................................................................................15 4.2 编码程序设计 ...............................................................................................154.3 译码程序设计 ...............................................................................................17 4.4 软件的运行结果 ............................................................................................19 第5章结论 ............................................................................................................21 致谢 .......................................................................................................................22 参考文献 .................................................................................................................23 附录1:程序源代码 ...............................................................................................25 附录2:英文原文 ...................................................................................................38 附录3:中文译文 ...................................................................................................47 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现第1章绪论 1.1数据压缩技术简介 [1]随着计算机技术的发展，数据压缩技术有了越来越重要的作用。只有数据有重复性，冗余性，才能够实现压缩。在我们的生活中，一部电影，一首歌曲等，这些数据都有着相当的重复部分。比如一篇英文文章，即使很长，它也是由26个英文字母组成的。如果我们对这些数据进行编码，将大大减少数据的存储空间。总之，生活中的许多数据都是有重复性的，而减少重复，以尽可能少的码来 [2]编排一段具有重复性的有效数据便是数据压缩的精髓。 1.2数据压缩的分类数据压缩的研究己有几十年的历史，其间，人们提出了许多压缩算法。在分类上，也存在几种不同的方法，一般根据压缩过程是否可逆分为两种:无失真压缩编码(Noiseless Coding)与有失真压缩编码(Noise Coding);也有人按编码的建模不同将其分为:模型基编码和波形基编码;还可按压缩技术所使用的方法进行分类，可分为预测编码(Predictive Coding)、变换编码(Trannnsform Coding)和统计编 [3]码(Statistical Coding)三大类。其中第一类是最常用的分类方法。无失真压缩，也可称之为冗余度压缩(Redundancy Compression)，原始数据可通过对压缩数据的解码完整的还原。这种压缩方法的原理是除去或尽量除去数据中重复和冗余部分，而不丢失其中的信息，从而确保被压缩了的数据还原后与压缩前完全一致。无失真压缩方法主要应用于不允许出现失真的场合例如对程序文件，文本文件的压缩等。常用的无失真压缩技术有:霍夫曼(Huffman)编码，算术(Arithmetic)编码，LZ编码，游程编码(RLE)等。有失真压缩:也可称为熵压缩(Entropy Compression)，这是一种不可逆压缩。 1 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现它主要适用于允许数据出现失真的场合。例如图像，视频等。由于丢失的部分信息，所以这种压缩方法可以实现较高的压缩率。在这些方法中，霍夫曼编码是一种无损压缩方法，霍夫曼编码是可变字长编码(VLC)的一种。一般用来压缩文本文件，程序文件。霍夫曼压缩属于可变代码长度算法一族。意思是不同符号(例如，文本文件中的字符)用一个特定长度的位序列替代。在文件中出现频率高的符号，使用较短的编码，而那些很少出现的符号，则用较长的编码。在本文中将利用霍夫曼编码实现对文件的压缩与解压缩。 1.3本文的主要工作本文的主要内容将分为三部分: 第一部分，将对Linux系统做简单的介绍，对Linux系统中常用的工具简做单的介绍。第二部分，将对霍夫曼编码进行介绍，并研究利用霍夫曼编码实现对文件的压缩与解压缩。第三部分，将在Linux系统中，利用霍夫曼编码实现对文件的压缩与解压缩。并给出给出了具体的实现示例，通过截图的方式直观地表现在论文中。 2 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现第2章 Linux编程环境概述实现文本的压缩与解压缩，首先应找到一个非常利于编程的环境。Linux环境下的C/C++语言程序设计与在其它环境中的C/C++程序设计一样,主要包括编辑器(vim)、编译链接器(gcc)、调试器(gdb)和项目管理工具(make)。利用这些工具以及Linux开源的特点，在Linux下可以方便的进行C程序软件的开发。 2.1 Linux系统的由来及发展现状 Linux系统的创始人是林纳斯?托瓦兹(Linus Torvalds)。而Linux系统的出现则是无数自由软件运动爱好者智慧的结晶。 Linux是类UNIX操作系统，UNIX上的软件不需或经过小的改动就可以在Linux上运行。1983年，托瓦兹创建了以创建一个自由的操作系统为目标的GNU 计划。在其后的几年中越来越多的开发者加入这个组织，致力于Linux内核的开发。终于，在1992年，在GNU GPL下Linux内核被重新授权使用，产生了第 [4]一个―Linux发行版本‖。在1994年3月, 托瓦兹认为内核的所有组件已经完全成熟，同年，他发布了Linux的1.0版本，Linux内核版本的发展如表2-1所示。XFree86项目组提供了一个图形化操作界面(GUI)。同年red hat(红帽)公司和SUSE发行了他们各自的Linux 1.0分发版本。相比windows操作系统，Linux系统最大的特点是开源性。程序员可以阅读到Linux系统中任何程序的源代码。也正是这个特点使Linux系统迅速发展。Linux的应用领域不断扩展，从最早的Web、FTP、邮件服务开始，逐步扩张到个人桌面应用、网络安全、远程教育、集群计算、网络计算、嵌入式系统等各个领域。更是吸引了想IBM、SUN、惠普这样的IT巨头积极参与到Linux应用的开发和推广中去。Linux之前主要应用于服务器及计算集群，未来应该该在个人计算机上有所发展，优化目前的图形化界面，以及加快桌应用的开发，以及在智能终端的应用。 3 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现表2-1 Linux内核版本版本号发布时间说明 0.00 1991.2 两个进程分别显示AAA BBB 0.10 1991.9 第一个正式向外公布的Linux 内核版本 0.02 1991.10 内部版本，目前已经无法找到 0.10 1991.10 由Ted Ts'o 发布的Linux 内核版本 0.11 1991.12 基本可以正常运行的内核版本 0.12 1992.1 主要加入对数学协处理器的软件模拟程序 0.95(0.13) 1992.3 开始加入虚拟文件系统思想的内核版本 0.96 1992.5 开始加入网络支持和虚拟文件系统VFS 0.97 1992.8 增加了对SCSI驱动程序的支持 0.98 1992.9 改善了对TCP/IP网络的支持 0.99 1992.12 从新设计内存分配，每个进程有4G空间 1.0 1994.3 第一个正式版本 2.2 Linux下C/C++语言编程的主要工具 2.2.1编辑器vim vim是Linux系统中常用的文本编辑器。vim是在vi的基础上增加部分功能发展而来的。安装好Linux操作系统后，一般已经默认安装完毕了vim编辑器。使用vim可以方便是对程序进行编辑，vim提供了相当丰富的命令。例如:插入命令就有i，I，a，A;删除整行dd等。正是vim提供的丰富的命令使得程序员的手不离开键盘的主键盘区便可以完成对程序的编辑修改。vim的操作界面如图2-1所示，图中文字为vim常用设置。 4 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现图2-1 vim工作界面 2.2.2 编译链接器gcc GNU CC(简称为gcc)是GNU项目中符合ANSI C标准的编译系统;gcc能够编译用C、C++和Object C等语言编写的程序。gcc不仅功能强大，而且可以编译如C、C++、Object C、Java、Fortran、Pascal对交互式数据压缩、Modula-3和Ada等许多语言。因此在用C/C++实现文件的压缩与解压缩过程中，强大的gcc是不可或缺的编译工具。表2-2 gcc支持的后缀名及解释后缀名对应语言后缀名对应语言 .c C原始程序 .s/S 汇编原始程序 .C/cc/cxx C++原始程序 .h 预处理文件(头文件) .m Objective-C原始程序 .o 目标文件 .i 预处理后C程序 .a/so 编译后的库文件 .ii 预处理后C++原始程序 gcc编译分为:预处理阶段，编译阶段，汇编阶段，链接阶段。gcc编译过 5 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现程如图2-2所示。图 2-2 gcc编译过程预处理阶段，在该阶段，对包含的头文件(#include)和宏定义(#define、#ifdef等)进行处理。可以使用gcc的选项―-E‖ 让gcc在预处理结束后停止编译过程。以程序―hello.c‖为例，其命令为:[root@localhost gcc]# gcc –E hello.c –o hello.i。接下来进行的是编译阶段，在这个阶段中，gcc首先要检查代码书写是否规范，是否含有语法错误，以确定代码接下来要做的工作。检查结束后，gcc将把代码翻译成汇编语言。此时，用户可以使用[root@localhost gcc]# gcc –S hello.i命令进行查看，该选项只是进行编译。汇编阶段，汇编生成目标文件.o，但是没有经过链接，所以不是可执行文件。在此可使用选项―-c‖就可看到汇编代码已转化为―.o‖的二进制目标代码了。命令为:[root@localhost gcc]# gcc –c hello.s –o hello.o。最后的链接阶段，如果没有特别说明，gcc会到系统默认的路径―/usr/lib‖ 6 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现下查找，链接到libc.so.6库函数，也即实现了printf函数。 2.2.3 调试器gdb gdb是一款由GUN开发组织研究并发布的UNIX/Linux下的程序调试器。它虽然没有图形界面，但是它拥有强大的功能不亚于VC等调试工具。一般来说，gdb可以实现一下四个功能:启动目标程序后，按照程序员意愿运行程序;设置断点，程序会在断点处停止;当程序被停住时，可以检查此时程序运行的现场， [5]实时改变程序的运行环境。使用命令[root@localhost gcc]# gcc –g hello.c –o hello及[root@localhost gcc]# gdb hello 来运行gdb。此时可以看出,在gdb的启动画面中指出了gdb的版本号、使用的库文件等信息。然后可以通过设置断点，单步运行等方法来调试程序。图 2-3 工作环境相关命令命令格式含义 set args运行时的参数指定运行参数 show args 查看运行参数 path dir 设定程序运行路径 show path 查看程序运行路径 set environment var[=value] 设置环境变量 show environment [var] 查看环境变量 cd dir 进入dir路径 pwd 显示当前路径 shell command 运行shell中command命令 2.2.4 工程管理器make 工程管理器就是管理较多文件的工具。它能构根据文件时间戳自动发现更新过的文件，从而减少编译的工作量，同时，它通过读入makefile文件的内容来执 [6]行大量的编译工作。 makefile是make读入的配置文件。在一个makefile中通常包含如下内容:需要由make工具创建的目标体(target)，通常是目标文件或可执行文件;要创 7 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现建的目标体所依赖的文件(dependency_file);创建每个目标体时需要运行的命令(command)，这一行必须以制表符(tab键)开头。例如对―hell.c‖进行编译其makefile为: $ make hello.o gcc –c hello.c –o hello.o $ ls hello.c hello.h hello.o makefile 8 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现第3章霍夫曼编码原理霍夫曼编码是熵编码中最常用的压缩与解压缩方法之一。在熵编码中霍夫曼编码应用较为广泛，而且比较容易实现。1952年David. A. Huffman提出了一种构造最佳码的方法称为霍夫曼码。霍夫曼码非常适用于多元独立信源，对于多元独立信源来说是最佳码。它充分利用了信源概率分布的特性进行编码，它是一种最佳的逐个符号的编码方法。 3.1 霍夫曼编码的理论基础 1928年霍特莱首先提出了信息定量化的初步设想，他定义信息量为:消息数的对数。若信源含有m种消息，并且每个消息是以相等可能产生的,那么该信源的信息量为I=-log(m)。一个事件集合x,x2,……xn,处于一个基本概率空间，其相应概率为1 p1,p2,……pn,且p1,p2,……pn之和为1，每一个事件的信息量为I(x)=-log(p),knk如定义在空间中的每一事件的概率不相等的平均不肯定程度或平均信息量叫做H，则 H=E{I(x)}=?pI(x)=- ?plogp。对于图像来说，n=2m个灰度级xi,则kkkka(k) p(xi)为各灰度级出现的概率，熵表示平均信息量为多少比特，熵是编码所需比特 [8]数的下限，即编码所需的最少比特。编码一定要用不比熵少的比特数编码才能完全保持原图像完整信息，这便是图像压缩的下限。当a=2是，H的单位是比特。霍夫曼编码是1952年为文本文件而建立，是一种统计编码。霍夫曼编码是常用的无损编码方法，广泛应用于图像压缩技术。JPEG标准中的基准模式采用的就是霍夫曼编码。霍夫曼编码是不定长编码，即代表各元素的码字长度不等。该编码是基于不同符号的概率分布，在信息源中出现概率越大的符号，相应的码越短;出现概率越小的符号，其码越长，从而达到用尽可能少的码符号表示源数 [9]据。它在变长编码中是最佳的。在计算机信息处理中，―霍夫曼编码‖是一种一致性编码法(又称"熵编码法")。 9 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 3.2 霍夫曼编码 3.2.1 霍夫曼编码步骤霍夫曼编码一般分为一下几个步骤，设信息源空间为[A*P]:{A:a1 a2 ……an}{P(A):P(a1) P(a2) P(a3)……P(an)}其中P(ak)=1,先用r个码的号码符号集X:{x1,x2,……xr}对信源A中的每一个符号ak进行编码。 1. 把信源符号ai按照其出现的概率的大小顺序排列起来; 2. 把最末两个具有最小概率的元素的概率加起来; 3. 把该概率之和同其余概率由大到小排队，然后再把两个最小概率加起来，再排队; 4. 重复第(2) 、 (3)步骤, 直到概率和达到 1 为止: 5. 在每次合并消息时，将被合并的消息赋以1和0或0和1; 6. 寻找从每个信源符号到概率为1处的路径，记录下路径上的1和0; 7. 对每个符号写出"1"、"0"序列(从码数的根到终节点); 8. 创建霍夫曼表; 9. 压缩编码时，将码值用码字代替; 10. 在解码时，将码字用码值代替。 3.2.2 霍夫曼表将所有码值和码字的关系整理成一张表，为了整字节输出码字，表中还含有各码字的长度。这种表就称为霍夫曼表，如表3-1所示。进行压缩编码时，只要将码值用码字代替即可。所以源符码的编码为: “000100001001”。 10 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现表 3-1 霍夫曼表码值码字长度码字 a4 2 00 a1 3 010 a2 3 011 a3 3 101 a5 3 110 a6 3 111 a7 4 1000 a8 4 1001 则平均码长:B = 0.1*3+0.1*3+0.15*3+0.2*2+0.15*30.15*3+0.1*4+0.05*4 = 2.95(b) 熵 H = 2.9087 编码效率 N = H/B = 98.6% 如果概率统计十分不准确，则压缩效率会变得很低，甚至起不到压缩效果。霍夫曼树平均码长是霍夫曼树的带权路径长度，由于霍夫曼树是最优二叉树，带 [10]权路径长度最小的二叉树，故其压缩效果最好。 3.2.3 霍夫曼树霍夫曼编码实际上构造了一个码树，即最优二叉树。码树从最上层的端点开始构造，直到树根结束，最后得到一个横放的码树即霍夫曼树。这个霍夫曼树的建立遵循一个原则——概率大的字符尽量编码短。反映在霍夫曼树上即从树根到概率大的叶子的路径短。这里举个例子说明如何生成霍夫曼树。假设对由a、a、12a、a、a、a、a、a八个信源符号组成的源信息字符串:―a a a a a a a a a a 4a a a a a a a a a a‖进行霍夫曼编码。首先应对信息中各数字出现的次数进4555666778 行统计，得出各数字出现的相对概率。假设各数字出现的次数及概率如表3-2所示。具体过程是这样的，先将所有符号排成一行构成8个最底层节点。首先找到概率最小的码值相加即0.05+0.1 = 0.15，得到新的节点作为这两个节点的根，去掉那两个最小的概率值，同时添加新得到的概率值，然后此时的概率值为:0.2, 0.1, 11 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 0.1, 0.15, 0.15, 0.15, 0.15。找出此时两个最小的概率值，继续相加得到新的节点 [11]作为这两个节点的根。以此类推知道使概率值相加为1。规定节点左侧分支为0，右侧分支为2。若反过来规定亦可。此时便得到霍夫曼树如图3-1所示。表3-2 各符号出现的次数与概率码值 a1 a2 a3 a4 a5 a6 a7 a8 次数 2 2 3 4 3 3 3 1 概率 0.1 0.1 0.15 0.2 0.15 0.15 0.1 0.05 图3-1 霍夫曼树对于各值(码值)的代码(码字)就是从根节点出发到底层节点所经历的分支序列。如a的代码(码字)为00，a的码字为111... ...通常a和a等称为码4646值，00和111等称为码字。所有码值和码字对应关系如表2-2所示。表3-3 码值和码字对应关系码值 a1 a2 a3 a4 a5 a6 a7 a8 码字 010 011 101 00 110 111 1000 1001 3.2.4 霍夫曼树与压缩编码在霍夫曼树与压缩编码的小节将举例说明利用霍夫曼编码实现对文本的压 12 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现缩与解压缩。在一则文本共1000个字符，分为8种字符(A…H), 若不编码压缩文本的总比特数为:8000bit。每种字符出现的概率如表3-4所示。通过此例子将介绍利用霍夫曼编码实现对文本文件压缩与解压缩的原理，以及与普通编码比较霍夫曼编 [12]码的压缩效率。表3-4 各个字符出现的概率 A B C D E F G H 字符 0.05 0.29 0.07 0.08 0.14 0.23 0.03 0.11 概率若直接编码很容易得出八个字符的编码表3-5。此时很容易的出通过编码后文章所占总比特数为:3000bit。表3-5 编码表 A:000 E:100 B:001 F:101 C:010 G:110 D:011 H:111 若使用霍夫曼编码，以3.2.3介绍原理可以得出霍夫曼树如图3-2。此时规定左支为0，右支为1。便容易得出霍夫曼编码表如表3-6所示。此时得出总比特数为:4*50 + 2*290 + 4*70 + 4*80 + 3*140 + 4*30 + 2*230 + 3*110 = 2710bit。对比原文的8000bit以及普通编码的3000bit可以看出霍夫曼编码可以很好的实现对 [13]文本文件的压缩。综合本小节的叙述可以得出霍夫曼编码实现对数据进行压缩存储的步骤:对数据进行统计，统计每种符号发生的概率;根据符号概率的大小，生成霍夫曼树;根据霍夫曼树对每个符号编写(编)码表;按照给定的码表对数据进行编码;解码 [14]时需要原编码方案。译码时从根结点出发，按二进制位串中的0和1确定是进入左分支还是右分支，当到达叶子结点时译出该叶子对应的字符。然后重新从根节点出发重复上述步骤，直至结束。 13 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现图3-2 霍夫曼树表 3-6 霍夫曼编码表编码长度频度 A:0110 4 50 B:11 2 290 C:1000 4 70 D:1001 4 80 E:101 3 140 F:0111 4 30 G:00 2 230 H:010 3 110 14 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现第4章基于霍夫曼编码的文件压缩与解压缩的实现本章将主要介绍利用C语言编程实现对文件的压缩与解压缩，编程环境为Linux redhat5.5(虚拟机)。 4.1 程序的设计思想先是读取文件，统计文件中不同字符出现的次数，并存入数组中，然后将存入另外数组中，按照数组个ASCII及相应数组中不同的字符对应的ASCII值出现的频率的来建立霍夫曼编码，然后将码表存入压缩文件，再将各字符对应的霍夫曼编码写入压缩文件中，实现文件压缩。解压缩时也是先把文件整体读入， [15]通过霍夫曼编码的长短，依次解码，从原来的位存储还原到字节存储。建立霍夫曼树根据霍夫曼树编码根据霍夫曼树解码统计字符，得出统计出字符的权值n 对二进制文件进行对编码进行压缩解码生成霍夫曼树生成二进制文件生成对应文件图4-1 程序设计思想 4.2 编码程序设计首先运行的时候，用户主界面上有菜单提示该如何使用软件，如图4-2。根 15 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现据菜单提示选择所要执行的选项，依次进行，因为各个环节之间有先后顺序。第一步为输入压缩软件的名称，由键盘输入文件路径和文件名称，读入字符数组中，打开该文件，按照提示进行压缩。若打不开，则继续输入。文件将信息存放在字符数组中，计算每个字符出现的次数，申请一个结构体 [16]数组空间，用读取的字符减去字符结束符作为下标记录字符的频率。将所记录的字符的频率作为权值来创建霍夫曼树，依次选择权值最小的两个字符作为左右孩子，其和作为父结点的权值，依次进行下去，直到所有的字符结点都成为叶子结点。根据创建的霍夫曼树来确定个字符的01编码，左侧结点0，侧结点为1。读取文件，依次将每个字符用他们的编码表示，即完成一次编码。图4-2 压缩软件界面 16 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现开始键入文件名读文件操作 N 字符权值统计文件是否读完 Y 生成霍夫曼树生成霍夫曼编码生成编码文件结束图4-3 编码程序流程图 4.3 译码程序设计读取编码文件，依据创建的霍夫曼树，定义一个指针指向根结点。从根结点开始，每读一个字符，指针变化一次(当读取的字符是?1‘时，指针指向当前所指结点的右侧结点，当读取的字符是?0‘时，指针指向当前所指结点的左侧结点)，直至该指针所指结点为叶子结点时结束(即当结点的左右结点均为空时)。将当前叶子结点所代表的字符值输出到译码文件中，依次读取编码文件中的字符，按 17 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现照上述方法依次进行下去直至文件结束。开始键入文件名读文件操作文件是否读完 N 扫描霍夫曼树 Y N 是否为叶节点 Y 译码写入新文件，并返回根节点 N 是否为根节点 Y 输出错误编码结束图4-4 译码流程图 18 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 4.4 软件的运行结果以压缩本文附录程序文件为例，验证软件的正确性。首先查看源代码文件属性，大小为11.3KB如图4-5所示。然后进行压缩，得到文件aa.hub大小为8.5KB如图4-6所示。图4-5 压缩前文件属性图4-6 压缩后文件属性最后执行解压缩，对文件aa.hub进行解压缩操作得到文件c.cpp，类型与原文件a.cpp一致，即C++源代码。大小同位11.3KB。如图4-7所示。 19 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现图4-7 解压后文件属性 20 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现第5章结论在当今信息时代，如何采用有效的数据压缩技术来节省数据文件的存储空间和计算机网络的传送时间已越来越引起人们的重视，霍夫曼编码正是一种应用广泛且非常有效的数据压缩技术。在毕业设计过程中，我选择了《linux下文件压缩和解压缩分析研究与实现》这一课题。在这个完成这个课题的过程中。我通过书籍和网络等途径学习到了以前所未学到的知识。同时也加深了我对已学习知识的理解。对Linux系统，对C/C++语言有了根深的了解。理解了文件压缩与解压缩的意义。本次课程设计所实现的软件比较适用于文本文件。由于知识水平的限制本文的程序利用的是静态的霍夫曼编码，有很大的局限性。相比自适应霍夫曼编码，静态霍夫曼编码所需时间较长，效率较低。在实际的运用中，对文件的压缩与解压缩一般采用的是自适应霍夫曼编码。 21 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现致谢在这里我首先要感谢我的论文知道老师——XX教授。在我完成论文的过程中，是您给予了我巨大的帮助。您治学严谨，知识渊博，在课题的选定直到到论文撰写的最终完成的过程中，XX教授都始终给予我细心的指导。在此谨向您致以诚挚的谢意和崇高的敬意。我要感谢我的父母，是你们，含辛茹苦养育了我;是你们，一直在背后默默支持着我;是你们，给了我所有的一切，给了我今天。感谢我的大学，在这里，各位老师教会了我各种专业知识，教会了我做人的道理，感谢我的各位老师;在这里，我遇到了我的同学，我的挚友，在我遇到困难的时候，是你们陪在我的身边，和我一起共渡难关，我们一同嬉笑过、拼搏过，感谢我的朋友以及同学。 22 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现参考文献 [1] 傅祖芸.信息论——基础理论与应用[M]. 北京:电子工业出版社，2007. [2] Brian W Kernighan , Dennis M Ritche . C语言程序设计语言[M]. 北京:机械工业出版社，2004. [3] Mark Allen Weiss.数据结构与算法分析——C语言描述[M]. 北京:机械工业出版社，2004. [4] 栾阳.关于数据压缩技术的研究[J]. 硅谷，2009. 15(12),192-197. [5] 曾玲.数据压缩技术在通信中的应用[D]. 四川:西南交通大学，2003. [6] 张敬敬，朱永利，郝宁等. Huffman压缩算法在智能电网通信系统中的应用 [J]. 河北工业科技，2001,9(19A),123-130. [7] 赵晓莉.浅析多媒体数据压缩技术[J]. 科技信息，2003,2(10),23-31. [8] 董雨西.无损音频编码算法研究[D]. 天津:天津大学，2011. [9] Sayood K.数据压缩导论[M]. 北京:人民邮电出版社，2009. [10] 吴乐南.数据压缩[M]. 北京:电子工业出版社，2012. [11] 金子一.图像信源压缩编码及信道传输理论与新技术[M]. 北京:北京工业大学出版社，2006. [12] 鸟哥.鸟哥的Linux私房菜 ——基础学习篇[M].北京:人民邮电出版社，2010. [13] 刘忆智.Linux从入门到精通[M]. 北京:清华大学出版社，2010. [14] 杨宗德，吕光宏，刘雍. Linux高级程序设计[M].北京:清华大学出版社，2009. [15] 邢国庆.Red Hat Enterprise Linux 6从入门到精通[M]. 北京:电子工业出版社，2011. [16] 霍顿. C语言入门经典[M]. 北京:清华大学出版社，2008. 23 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 [17] 曾俊.无损压缩技术在姿态数据采集系统中的应用研究[D]. 南京:南京理工大学，2009. [18] 庄明强.基于数据压缩与GPRS的配变数据采集与监控系统研究[D]. 杭州: 浙江大学，2008. [19] 王敬生,罗家融,李俊照.Linux和Unix编译链接配置文件分析[J].皖西学院学报，2009,10(5),18-22. [20] 刘燕清,龚声蓉. 基于一次排序动态编码的Huffman编码算法[J]. 计算机应用与软件,2009,8(9A):55-62. [22] 折小利.抽样数字图像压缩方法的改进[D]. 西安:西安电子科技大学，2009. [23] 路淼，周卫东. 基于嵌入式零树编码(EZW)的脑电信号压缩算法[J]. 航天医学与医学工程,2004,12(9):59-72. [24] 韩菲.基于雷达视频的Huffman编码研究[J]. 舰船电子工程，2004， 13(15):110-124. 24 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现附录1:程序源代码 #include #include #include #include #include #include struct huofuman { unsigned char b; //记录字符在数组中的位置 long count; //字符出现频率(权值) long parent,lch,rch; //定义霍夫曼树指针变量 char bits[256]; //定义存储霍夫曼编码的数组 } header[512],tmp; //压缩函数 void compress() { char filename[255],outputfile[255],buf[512]; unsigned char c; long i,j,m,n,f,lh=0; long min1,pt1,flength,length1,length2; double div; FILE *readfile,*writefile; 25 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 cout<<"请输入要压缩的文件名:"; cin>>filename; readfile=fopen(filename,"rb"); if(readfile==NULL) {cout<<"文件打开失败～～,请检查该文件是否存在，或遇到未知文件"<>outputfile; writefile=fopen(strcat(outputfile,".hub"),"wb"); if(writefile==NULL) { cout<<"文件压缩失败!"<header[j].count) { pt1=j; min1=header[j].count; continue; } } header[i].count=header[pt1].count; header[pt1].parent=i; //依据parent域值(结点层数)确定树中结点之间的关系 header[i].lch=pt1; //计算左分支权值大小 min1=999999999; for(j=0;jheader[j].count) { pt1=j; min1=header[j].count; continue; } } header[i].count+=header[pt1].count; header[i].rch=pt1; //计算右分支权值大小 header[pt1].parent=i; 28 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 } for(i=0;i=8) //对霍夫曼编码位操作进行压缩存储 { for(i=0;i<8;i++) { if(buf[i]=='1') c=(c<<1)|1; else c=c<<1; } fwrite(&c,1,1,writefile); 30 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 pt1++; //统计压缩后文件的长度 strcpy(buf,buf+8); //一个字节一个字节拼接 j=strlen(buf); } if(f==flength) break; } if(j>0) //对霍夫曼编码位操作进行压缩存储 { strcat(buf,"00000000"); for(i=0;i<8;i++) { if(buf[i]=='1') c=(c<<1)|1; else c=c<<1; } fwrite(&c,1,1,writefile); pt1++; } fseek(writefile,4,SEEK_SET); fwrite(&pt1,sizeof(long),1,writefile); fseek(writefile,pt1,SEEK_SET); fwrite(&n,sizeof(long),1,writefile); for(i=0;i>filename; readfile=fopen(strcat(filename,".hub"),"rb"); if(readfile==NULL) { cout<<"文件打开失败～～,请检查该文件是否存在，或遇到未知文件"<>outputfile; writefile=fopen(outputfile,"wb"); if(writefile==NULL) { cout<<"文件解压失败～"<0) m=p/8+1; else m=p/8; for(j=0;jf;l--) { strcat(header[i].bits,"0"); } strcat(header[i].bits,buf); } header[i].bits[p]=0; } for(i=0;istrlen(header[j].bits)) { tmp=header[i]; header[i]=header[j]; header[j]=tmp; } } } p=strlen(header[n-1].bits); fseek(readfile,8,SEEK_SET); m=0; bx[0]=0; while(1) //通过霍夫曼编码的长短，依次解码，从原来的位存储还原到字节存储 { while(strlen(bx)<(unsigned int)p) { fread(&c,1,1,readfile); f=c; itoa(f,buf,2); f=strlen(buf); for(l=8;l>f;l--) //在单字节内对相应位置补0 { strcat(bx,"0"); } strcat(bx,buf); } 35 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 for(i=0;i>a; while(a!=0 && a!=1 && a!=2) 36 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 { cout<<"输入错误，请核对后再输入"<>a; } switch(a) { case 1:compress();break; case 2:decompression();break; case 0:exit(0); } } system("pause"); //任意键继续 system("cls"); //清屏 } 37 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现附录2:英文原文 From:www.mdpi.com/1999-4893/3/1/63 Authors: R. M. Capocelli Interactive Compression of Digital Data Abstract: If we can use previous knowledge of the source (or the knowledge of a source. that is correlated to the one we want to compress) to exploit the compression process then we can have significant gains in compression. By doing this in the fundamental source. coding theorem we can substitute entropy with conditional entropy and we have a new theoretical limit that allows for better compression. To do this, when data compression is used for data transmission, we canassume some degree of interaction between the compressor and the decompressor that can allow a more efficient usage of the previous knowledge they both have of the source. In this paper we review previous work that applies interactive approaches to data compression and discuss this possibility. Keywords: data compression, entropy, conditional compression 1 Introduction Data Compression is crucial for the transmission and storage of digital data. It is possible to improve the compression performance when we can use previous knowledge of the emitting source in the compression process. In this case in the fundamental source coding theorem we can substitute entropy with conditional entropy and we have a new theoretical limit that allows for better compression. When data compression is used for data transmission we might assume some degree of interaction between the compressor and the de compressor that allows a more efficient usage of the previous knowledge of the source they both might have. The price we accept to pay, in some cases, is a negligible probability of errors. 38 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 In this paper we review recent work that applies previous knowledge of the emitting source to data compression and interactive approaches and discuss these possibilities. We will concentrate on lossless, dictionary based, compression algorithms. In the next section we review relevant concepts about the theoretical limits of data compression and about the dictionary based compression methods. Section 3 discusses how to use the previous knowledge of the source when the shared knowledge is represented by common, identical files that both sides have available but there is no possibility of interaction between the Sender and the Receiver that need to communicate. Section 4 reviews the situation in which both Sender and Receiver have knowledge of the source, but do not share necessarily the same, identical, files. In this case, by taking advantage of the interaction between the two communication sides, Sender and Receiver can improve the compression process by using the shared knowledge if they accept a small possibility of communication errors. Section 5 presents our conclusions and points to new research directions. 2 Compression, Entropy and Dictionaries Data and information today are digital. We need data compression to deal with digital data: without compression we cannot store and/or transmit the huge amount of data we treat every day. The theoretical background of data compression techniques is strong and well established. It dates back to the seminal work of Claude Elwood Shannon (see ). In the late 1940s, he gave a formal definition of the notion of information content or randomness of a source, which he called entropy. This theoretical background gives precise limits on the performance of a compression algorithm. Any process that generates information can be viewed as a source that emits a sequence of symbols from a finite alphabet. The source output can be seen as a sequence of words: i.e., finite length sequences of alphabet symbols. From the practical point of view to divide the source outputs into words might involve a semantic analysis that depends on the source and that it is possible only if the output of the source is interpreted in a determinate context. For example, a word of a source 39 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 that emits English sentences can be a sequence of one or more common English words, including blanks, punctuation marks, etc. If we consider a first-order source S, in which successive symbols are statistically independent, Shannon and Weaver in [1] showed that, given the set of probabilities P = {p1, p2, ..., pn) for the source symbols, the optimal expected number of code bits is: a quantity which they called the Entropy of the source S, usually denoted by H(S). Conditional entropy H(S1|S2) is the natural analogue of the entropy H(S) but it is based on the conditional probability. It can be shown that: H(S1|S2) ? H(S1). Entropy is a measure of uncertainty about an event: it has a zero value if and only if there is absolute certainty and it is maximal when all the events are equip robable: i.e., there is absolute uncertainty. The ―fundamental source-coding theorem‖ can be stated as in Storer : ―Let S be a first-order source over an alphabet let be an alphabet of r > 1 characters. Then encoding the characters of S with characters of requires an average of Hr(s) characters of per character of Furthermore, for any real number > 0 there exists a coding scheme that uses an average of Hr(s) + of per character of Because of the fundamental source coding theorem entropy is a limit to the length of the string that we can use to lossless code a message. Today lossless compressors are very efficient. While it is not possible to prove that they always achieve the theoretical limit, the performances of the state of the art compressors for specific types of data, like natural language text, are certainly very close to this theoretical limit. Current research on lossless compression focuses on developing compressors for new types of digital data or on improving, often only by small amounts, existing compressors for a specific class of data. Even a small improvement is an important achievement, given the economical impact it can have on data transmission and storage. Many data compression methods are based on the notion of ―dictionary‖. In this paper we will primarily consider dictionary based compression methods. Dictionaries can be static, i.e., built at the beginning of the compression/communication process and never updated and in this case the compression method is called static dictionary method. Otherwise, if the dictionaries are consistently and independently built, maintained, and updated by both compressor 40 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 and decompressor, then the compression method is called dynamic dictionary method. We can use static dictionary methods when the source is known in advance. When we use data compression to communicate, the Sender and the Receiver use the same (static) dictionary and eventually, at the beginning of the communication, the Sender needs to send the dictionary to the Receiver. for example, in vector quantization algorithms the compressor builds up a dictionary on a training set of images and it uses this dictionary to compress the new data. The decompressor needs the same dictionary to decompress; therefore the compressor shall transmit this dictionary to the decompressor. The cost of making this dictionary available depends on its dimensions. Generally this cost is not considered in the analysis of a static dictionary compression method by using the argument that the constant cost can be amortized over time by compressing a large amount of data. In real life situations this is not always true. Frequently we have as compression target only small or medium size files, that represent computer programs, images, text data, music, videos, etc. For example it is often necessary to send only a few computer programs or a few images from a source to a destination, possibly via a telephone line by using a modem or via a fast internet connection. In these cases the cost of sending the dictionary might have an important impact on the total cost of the communication, making static compression methods unpractical. 3 Conditional Compression One option we have to increase compression is to use the knowledge of similar messages that Sender and Receiver have compressed in the past from the same source and to design algorithms that efficiently compress and decompress given this previous knowledge: by doing this the conditional entropy is the new theoretical limit and it allows for better compression. Figure 1 (from [3]) shows the results of this approach for MacNS6FullInstaller.sea.bin. We have plotted on the x-axis the number of 2,000 bytes segments on which we apply the difference operator (between this file and MacNS6FullInstaller2.sea.bin) and on the y-axis the size of the compressed file. For 41 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 convenience we have also reported the size of the gzip compressed version of MacNS6FullInstaller.sea.bin (the straight line). Figure 1. Applying the difference operator between file 1 We see in the figure that the first part of the first file and the first part of the second file (up to aboutthe first two hundreds 2,000 bytes segments) are strictly correlated. By operating a difference betweenthe actual distribution of the software and the previous version we have used previous knowledge of thesource (or the knowledge of a source that is correlated to the one we want to compress) to exploit the compression process. The savings we have obtained are limited because of the original nature and correlation of the files we are experimenting, but it is significant that even with this type of data and at least in this case, this approach works. the experiment we have used a segmentation of the two files in fixed size segments. This implies a sort of alignment of the two files we are comparing. While it has been proven that the related ―optimal vector alignment problem‖ is NP complete (see Carpentieri and Storer [4]), we can expect to be able to find heuristically good approximations of this optimal alignment and therefore to further improve the compression obtained in similar experiments by a careful alignment of the files. Cilibrasi in [10] and Cilibrasi and Vitany in [11] studied the Normalized Compression Distance (NCD) a parameter-free, universal, similarity measure based on a data compressor. NCD is a non-negative number representing how different two files are. NCD is a new way of looking at data compression that allows us to use compression programs for a wide variety of problems, such as learning, clustering and classification, data mining, etc. 42 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 NCD gives a different way of looking at the problem we deal with in this paper, the intuition is that if two files are close (in the NCD sense) then probably we can do a better job in sending one of them to a destination that already has the other file if we can use interaction. 4 Interactive Compression If we assume some degree of interaction between the compressor and the decompressor then we can exploit the previous knowledge they both have of the source when data compression is used for data ransmission. The price we accept to pay, in some cases, is a low probability of communication errors. The solution of the communication problem, in this case, is an interactive protocol that allows Sender and Receiver to take advantage of the knowledge they have of the source and that exploits the interaction to minimize the communication cost. In real life this could be used in situations in which finite size files are transmitted over a bidirectional communication line, like a telephone line, internet, a satellite link or any other line between two systems that have already available a large number of files of the same type. As an example consider an internet user that has to frequently download the same kind of file(a newsletter, an image, a report, etc.) from the same source, or a system manager that has to download repetitively an update file for a program or for an antivirus, or an ftp mirroring site that has to be periodically brought up to date, etc.The compression algorithms involved in the communication might be static or dynamic dictionary methods or, in the case of images, Vector Quantization. 4.1 Interactive data compression An efficient communication protocol that allows a Sender and a Receiver to communicate via dictionaries that are built independently from previous (and may be discordant) examples of communication messages that each of the two sides has available is presented by Carpentieri in [8]. This communication protocol results in an improvement in compression paid with a negligible or almost null possibility of communication errors. 43 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 4.2 An interactive communication protocol Consider an information source U that emits words over a certain alphabet and for which it is possible to partition the words emitted by the source into two sets: one D of cardinality m that has probability P(D) = 1 ? α and another Dcomp with probability P(Dcomp) = α, where α << 1/2, and the words w D have probability pw such that |(1 ? α)/m ? pw| < ε for an appropriately chosen ε. The set D might be viewed as the ideal ―Dictionary‖ for the source. It is possible to recognize this property in many real life situations. As an example consider a source that outputs programs. For this source an ideal set D might include all the keywords and reserved words of the programming language and the set of the most common identifiers. Consider now a situation in which Sender and Receiver have access to different corpora of the same language: for example different sets of books and/or web pages and/or newspapers, etc. They can use these different corpora to independently construct their own dictionaries. If the Sender and the Receiver have both knowledge of the source in the form of a large number of messages that are recorded in two files of length approximately n, if n is big enough and nw is the number of occurrences of w in the file and δ is an opportunely chosen number in [0, 1/2], then we have that for each word w, (P|nw/n? pw| < ε) > (1? δ). The Sender and the Receiver build up, respectively, dictionaries D S and DR of size m. They can estimate the probability of each word w by counting the number of its appearances in the file of length n. Now consider the possibility that a word is in DS but not in DR. This could happen if the frequency count of w done by the Sender does not reflect the probability p w, and therefore w belongs to DS incorrectly and w does not belong to DR correctly, or if the Receiver has a wrong estimate of p w and therefore w belongs to DS correctly but incorrectly does not belong to DR or, finally, if both Sender and Receiver have a wrong estimate of pw. Therefore, if n is big enough the probability of a word w in D S but not in DR is: P(w belongs to DS but does not belong to DR) < 2δ ? δ2. 44 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 5 Conclusions and Future Work The interaction between a Sender and a Receiver that have to communicate over a bidirectional line can improve the transmission process if both Sender and Receiver share previous knowledge about the data they want to communicate. By doing this in the fundamental source coding theorem we can substitute entropy with conditional entropy and therefore we have a new theoretical limit that allows for better compression. In this paper we have reviewed recent work that applies interactive approaches to data compression and discussed this possibility. Future work will focus on the improvement of the algorithms presented and on a wider testing of the approaches described. There are points of contacts between part of the work described in this paper and the transfer learning techniques studied in machine learning and data mining. Future work might also include a wider analysis of this relationship between the techniques that are described in this paper to improve the interaction between the Sender and the Receiver and the transfer learning techniques. Acknowledgements I would like to thank James A. Storer for fruitful discussions we had on interactive data compression, and my students Mario Immobile Molaro and Vincenzo Viola for carrying out some of the experimental work described in this paper. Thanks also to the anonymous reviewers that, with their constructive comments, have helped 45 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 in improving this paper. 46 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现附录3:中文译文来源:作者:R. M 卡伯曼互动式的数字数据压缩摘要:如果我们可以使用以前的知识源(或我们要压缩一个相关的知识来源。)利用压缩过程，那么我们就可以有显着的收益压缩。这样做的根本源泉。编码定理，我们可以替代熵条件熵，我们有一个新的理论的限制，可以更好的压缩。要做到这一点，用于数据传输的数据压缩时，我们可以假定某种程度的压缩机和减压装置，可以允许更有效地使用的以前的知识，它们都具有的源之间的相互作用。本文回顾以前的工作，适用于交互式的数据压缩方法，并讨论这种可能性。关键词:数据压缩，熵，有条件压缩 1 前言数据压缩的数字数据的传输和存储的关键。这是可能的，以提高压缩性能时，我们可以使用以前的知识在压缩过程中的发射源。在这种情况下，我们的根本源编码定理可以替代熵条件熵，我们有一个新的理论的限制，可以更好的压缩。数据压缩是用于数据传输时，我们可能会假定压缩机和减压装置，允许更有效地使用的源的以前的知识，它们都可能具有某种程度之间的相互作用。我们接受支付的价格，在某些情况下，是一个错误的概率微乎其微。在本文中，我们回顾近期的工作，适用于以前的知识的发射源数据压缩和互动方法，并讨论这些可能性。我们将集中精力，基于字典的无损压缩算法。在下一节，我们将审查有关数据压缩的理论极限的概念的基于字典的压缩方法。第3节讨论如何使用以前的知识共享知识源为代表的常见，双方都有相同的 47 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现文件，但有没有可能需要通信的发送者和接收者之间的互动。第4节评论Sender和Receiver有知识的来源，但不分享不一定是相同的，相同的，文件的情况。在这种情况下，通过利用在通信双方，发送者和接收者之间的相互作用，可以提高压缩过程中，通过使用共享的知识，如他们接受的通信错误的可能性小。第5节介绍我们的结论和新的研究方向。 2 压缩，熵和字典今天的数据和信息，是数字化的。我们需要的数据压缩处理数字数据:不压缩的情况下，我们不能存储和/或传输巨量的数据，我们把每一天都。数据压缩技术的理论背景是强大且完善。它可以追溯到克劳德?埃尔伍德香的开创性工作(见[1])。在20世纪40年代后期，他给的信息内容或随机性的来源，这是他称为熵的概念的正式定义。这样的理论背景下对压缩算法的性能，可以精确地限制。不限信息生成的过程，可以被看作是一个源，其发射来自有限字母表的符号序列。源输出可以被看作是单词序列:即，有限长度的字母符号序列。从实用的角度出发，以分割成字的源输出可能涉及语义分析，取决于信号源，它有可能只在一个确定的范围内，如果输出的源程序被解释。例如，一个字的源极，其发射英语句子，如果我们考虑一阶的源极S，连续符号是统计独立的，可以是一个序列中的一个或多个常见的英语单词，包括空格，标点符号等[1]中的香农和韦弗表明，给定的一组概率P ={P1，P2，...，pn)的源符号的码位是最佳的预期数量: 他们称之为熵的源极S的量，通常记为H(S)。视情况而定，熵H(S1| S2)是天然的熵H(S)的类似物，但它是基于的条件概率。它可以显示的是:H(S1| S2)?H(S1)。熵是衡量有关事件的不确定性:当且仅当有绝对的把握，它具有零值时，所有的事件是等概率的，这是最大的:即，是绝对的不确定性。可以表述为―基本的信源编码定理‖斯托勒:―让S是一阶源超过一个字母R>1个字符。然后HR(次)此外，每个字符的字符，字符编码字符的S平均需要对任意实数，存在使用平均 48 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现每个字符的根本源泉编码定理熵编码方案一个限制，我们可以使用无损编码，消息的字符串的长度。现在，无损压缩是非常有效的。虽然这是不可能的，以证明他们总是达到的理论极限，本技术领域的状态，压缩机的特定类型的数据的性能，如自然语言文本，当然，这个理论的限制非常接近。无损压缩目前的研究侧重于开发新的数字数据类型的压缩机或改善，往往只有少量特定类的数据，现有压缩机。即使是很小的改善是一个重要的成就，因为它可以对数据传输和存储的经济影响。许多数据压缩方法的基础上的―字典‖的概念。在本文中，我们将主要考虑基于字典的压缩方法。字典可以是静态的，即，上面的压缩/通信过程的开始和永远不会更新，在这种情况下的压缩方法被称为静态辞典的方法。否则，如果在字典一致，独立建立，维护和更新由压缩器和解压，则压缩方法被称为动态辞典的方法。当源是已知的，我们可以使用静态字典方法。当我们使用数据压缩通信的发送器和接收器使用相同的(静态的)词典，最终，在开始通信时，发送器需要发送到接收器的字典。例如，在压缩机的矢量量化算法建立的图像训练集的字典，它使用此字典压缩的新的数据。解压需要相同的字典，因此压缩机应将本字典解压。使本字典的成本取决于其尺寸。这笔费用一般不被视为一个静态字典压缩方法的分析，使用成本不变的说法，随着时间的推移，通过压缩大量数据可以摊销。这是在现实生活中的情况并非总是如此。我们经常有压缩目标只有小型或中型大小的文件，指计算机程序，图像，文字，音乐，视频等，例如，它常常是必要的，只有少数计算机程序或几个图像从一个源发送到一个目的地，可能是通过调制解调器通过电话线或通过一个快速的互联网连接。在这些情况下，发送的字典的成本可能具有的通信的总成本产生重要影响，使静态的压缩方法不实用。 3 有条件的压缩一个选项，我们必须增加压缩是使用类似的消息，发送方和接收方都压缩在过去从同一来源的知识，并设计算法，有效地压缩和解压，这个以前的知识:做这个条件熵新的理论极限，它允许更好的压缩。 49 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现其结果示于图1，这种方法的MacNS6FullInstaller.sea.bin。我们已绘制在x轴的数量为2000字节的段上的应用差分算子(此文件并 MacNS6FullInstaller2.sea.bin之间)和在y轴的压缩文件的大小。为了方便起见，我们也报道了gzip压缩版本的大小MacNS6FullInstaller.sea.bin(直线)。图1 两个文件之间对比我们看到在图中的第一个文件的第一部分和第二个文件的第一部分(最多攻方第一两百2000个字节的段)是严格的相关性。通过操作的软件和以前的版本，我们已经使用以前的知识源(或知识的源泉，我们要压缩一个相关)利用压缩过程中的实际分布之间的差异。我们所获得的节约是有限的，因为原来的性质和相关的文件，我们正在尝试，但是，即使在这种情况下，这种类型的数据和至少，这种方法效果是显着的。在实验中，我们已经使用了固定大小的段中的两个文件的分割。这意味着一种我们比较两个文件对齐。虽然它已被证明，相关的―最佳载体对齐问题‖NP完全(见卡皮特和斯托勒)，我们可以期望能够找到这个最佳路线启发式很好的近似，并因此得到进一步提高压缩在类似的实验，通过精心调整的文件。 Cilibrasi Vitany Cilibrasi研究归压缩距离(NCD)一个无参数，通用，相似措施的基础上的数据压缩。 NCD是一个非负的数字代表不同的两个文件是如何。 NCD是看数据压缩的一种新的方式，使我们能够学习，聚类和分类，数据挖掘等各种各样的问题，如使用压缩程序慢性非传染性疾病的看问题，我们处理本文给出了一种不同的方式，直觉是，如果两个文件都接近(NCD感)，那么我们可能 50 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现可以做一个更好的工作，其中之一发送到目的地，已经有其他文件，如果我们可以使用互动。 4 互动压缩如果我们假设某种程度的压缩机与解压缩器之间的相互作用，那么我们就可以利用以前的知识，它们都具有数据压缩时使用的数据ransmission源。我们接受支付的价格，在某些情况下，通信错误的概率很低。通信问题的解决方案，在这种情况下，是一个交互式的协议，它允许发送者和接收者，以利用它们的源，它利用的相互作用，以尽量减少通信成本的知识。在现实生活中，这可以用于在有限大小的文件是一个双向的通信线路传输的情况下，如电话线，网络技术，卫星链路或两个系统之间的已经可用的任何其他线路的大量的文件相同的类型。作为一个例子考虑，有一个互联网用户频繁下载同一个文件的种 (通讯，图像，报告等)从同一来源，或有重复下载的更新文件的程序或为防病毒系统经理，或一个ftp镜像网站，有定期长大迄今为止，参与通信等。压缩算法可能会对静态或动态的字典的方法，或在图像的情况下，矢量量化。 4.1 互动数据压缩一个高效的通信协议，它允许一个发送器和一个接收器，通过沟通从以前的(可能是不和谐的)例子，沟通信息，双方都有可用由Carpentieri提出[8]，建立独立的字典。这种通信协议的结果，改善在支付的压缩，可以忽略不计或通信错误的可能性几乎为零。 4.2 一个交互式通信协议考虑一个信息源Û单发射词语超过目标字母，这就是它未能分区由源发射的词语，分成两组:一个D的基数为m的概率P(D)= 1 - α和其他DCOMP概率P(DCOMP)=α，其中α<< 1/2，和词语的瓦特的概率呃，使得|(1 - α)/米 - PW | <ε用于选择适当ε。集合D可能被视为理想的―词典‖为源。这是在许多现实生活中的情况下，能够认识到这一点物业。作为一个例子考 51 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现虑一个源输出方案。对于这个源可能包括一个理想的集合D的所有关键字和保留字的编程语言和最常见的标识符的组。现在考虑一种情况，即发送方和接收方使用同一种语言的不同语料库:例如不同的套书和/或网页和/或报纸等，他们可以使用这些不同的语料库独立构建自己的字典。如果发送器和接收器有两个长约n，若n是足够大的，和nw是在该文件中出现的瓦特数在两个文件中记录了大量的消息，这些消息的形式和知识的源δ是一个适时选择号码[0，1/2]，然后，为每个词w(P | nw/n- PW | <ε)>(1 - δ)。发送者和接收者建立，分别字典大小为m的DS和DR。他们能够评估每个词w的概率计数长度为n的文件中它的出现的数量。现在考虑一个字是在DS，但不是在DR的可能性。这可能发生，如果通过发件人的w频率计数不能反映的概率PW，因此瓦特属于DS不正确和w不属于DR正确，或者如果接收器有一个错误的估计，因此PW瓦特属于到DS正确，但错误地不属于DR，最后，如果发送者和接收者有一个错误的估计PW。因此，如果n是足够大的一个词w在DS中，但不是在DR的概率是:P(瓦特属于DS，但不属于DR)<2δ - δ2。图2 发送与接受原理图 5. 结论和展望有通信的双向线发送器和接收器之间的相互作用可以改善传输过程中，如果发送者和接收者的数据共享以前的知识，他们要沟通。这样做的根本源编码定理，我们可以替代熵与条件熵，因此，我们有一个新的理论限制，可以更好的压缩。在本文中，我们回顾了近年来的工作，适用于交互式的数据压缩方法，并讨论了这种可能性。未来的工作将集中于改进的算法更 52 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现广泛的测试描述的方法。本文中所描述的工作和迁移学习技术，在机器学习和数据挖掘研究的一部分之间的接触点。未来的工作可能还包括更广泛的分析，本文中所描述的技术，发件人和接收器之间的互动和转移学习技术的提高之间的这种关系。致谢我想感谢和詹姆斯.A.斯托勒富对交互式数据压缩有成果的讨论，和我的学生马里奥对在本文中所描述的一些试点工作的开展。也感谢匿名审稿，有建设性的意见，这些都有助于改善本文。 53 北方民族大学学士学位论文 Linux下文件压缩和解压缩分析研究与实现 54

                    本文档为【Linux下文件压缩和解压缩分析研究与实现】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Linux下文件压缩和解压缩分析研究与实现

你可能还喜欢