Java检测文件是否UTF8编码

Java检测文件是否UTF8编码Java检测文件是否UTF8编码介绍UTF-8编码规则 UTF-8 编码字符理论上可以最多到 6 个字节长, 然而 16 位 BMP 字符最多只用到 3 字节长. Bigendian UCS-4 字节串的排列顺序是预定的. 字节 0xFE 和 0xFF 在 UTF-8 编码中从未用到. 下列字节串用来表示一个字符. 用到哪个串取决于该字符在 Unicode 中的序号. U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx ...

Java检测文件是否UTF8编码介绍UTF-8编码规则 UTF-8 编码字符理论上可以最多到 6 个字节长, 然而 16 位 BMP 字符最多只用到 3 字节长. Bigendian UCS-4 字节串的排列顺序是预定的. 字节 0xFE 和 0xFF 在 UTF-8 编码中从未用到. 下列字节串用来表示一个字符. 用到哪个串取决于该字符在 Unicode 中的序号. U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx xxx 的位置由字符编码数的二进制表示的位填入. 越靠右的 x 具有越少的特殊意义. 只用最短的那个足够表达一个字符编码数的多字节串. 注意在多字节串中, 第一个字节的开头"1"的数目就是整个串中字节的数目. 例如: Unicode 字符 U+00A9 = 1010 1001 (版权符号) 在 UTF-8 里的编码为: 11000010 10101001 = 0xC2 0xA9 而字符 U+2260 = 0010 0010 0110 0000 (不等于) 编码为: 11100010 10001001 10100000 = 0xE2 0x89 0xA0 特殊规则: 文件头三个字节用16进制表示是EFBBBF, 此规则不通用, 由编辑工具定义. 这种编码的官方名字拼写为 UTF-8, 其中 UTF 代表 UCS Transformation Format. 请勿在任何文档中用其他名字 (比如 utf8 或 UTF_8) 来表示 UTF-8, 当然除非你指的是一个变量名而不是这种编码本身. 复制代码源码实现: package com.yy.game.test; import java.io.BufferedInputStream; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.nio.CharBuffer; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.nio.channels.FileChannel.MapMode; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CoderResult; public class UTF8Checker { public static void main(String[] args) throws IOException { File dir = new File("F:\\test"); for (File file : dir.listFiles()) { System.out.format("%s: %s, %s%n", file, check(file), check2(file)); } } /** * JDK自带API实现 */ @SuppressWarnings("resource") public static boolean check2(File file) throws IOException { long start = System.nanoTime(); FileChannel fc = null; try { fc = new FileInputStream(file).getChannel(); MappedByteBuffer buf = fc.map(MapMode.READ_ONLY, 0, fc.size()); Charset utf8 = Charset.forName("UTF-8"); CharsetDecoder decoder = utf8.newDecoder(); CharBuffer cbuf = CharBuffer.allocate((int) (buf.limit() * decoder.averageCharsPerByte())); CoderResult result = decoder.decode(buf, cbuf, true); return !result.isError(); } finally { if (fc != null) { fc.close(); } long end = System.nanoTime(); System.out.println("used(ns):" + (end - start)); } } /** * 自定义实现 */ public static boolean check(File file) throws IOException { long start = System.nanoTime(); InputStream in = null; try { in = new BufferedInputStream(new FileInputStream(file)); StreamBuffer sbuf = new StreamBwww.baiyuewang.netuffer(in, 1024); if (sbuf.next() == 0xEF && sbuf.next() == 0xBB && sbuf.next() == 0xBF) { return true; } sbuf.redo(); // 1. U-00000000 - U-0000007F: 0xxxxxxx // 2. U-00000080 - U-000007FF: 110xxxxx 10xxxxxx // 3. U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx // 4. U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx // 5. U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx // 6. U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx for (int ch = 0; (ch = sbuf.next()) != -1;) { int n = 0; if (ch <= 0x7F) { n = 1; } else if (ch <= 0xBF) { return false; } else if (ch <= 0xDF) { n = 2; } else if (ch <= 0xEF) { n = 3; } else if (ch <= 0xF7) { n = 4; } else if (ch <= 0xFB) { n = 5; } else if (ch <= 0xFD) { n = 6; } else { return false; } while (--n > 0) { if ((sbuf.next() & 0x80) != 0x80) { return false; } } } return true; } finally { if (in != null) { in.close(); } long end = System.nanoTime(); System.out.println("used(ns):" + (end - start)); } } static class StreamBuffer { final InputStream in; final byte[] buf; int pos = -1;// 初始值为-1,表示指针尚未移动. int len; public StreamBuffer(InputStream in, int size) { this.in = in; if (size < 3) { size = 3; } this.buf = new byte[size]; } public void redo() { this.pos = 0; } public int next() throws IOException { if (len > 0 || pos < 0) { if (++pos == len) { if ((len = in.read(buf)) == 0) { return -1; } pos = 0; } return this.buf[this.pos] & 0xFF; } else { return -1; } } } } 复制代码在本机测试, JDK原生API需要创建CharBuffer,性能明显慢了25%以上. used(ns):472420 used(ns):4490075 F:\test\b334d5fd-b8a7-48f4-9099-f6011c7e5a48.sql: true, true used(ns):122515 used(ns):343490 F:\test\b334d5fd-b8a7-48f4-9099-f6011c7e5a482.sql: false, false used(ns):55164 used(ns):82425 F:\test\test.sql: false, false

                    本文档为【Java检测文件是否UTF8编码】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Java检测文件是否UTF8编码

你可能还喜欢