基于 C# 的文本文件的编码识别

在 Windows 系统下，文本文件编码存在有无 BOM 的编码。BOM（Byte Order Mark），字节顺序标记，出现在文本文件头部，Unicode 编码标准中用于标识文件是采用哪种格式的编码。有文件 BOM 头的 Unicode 编码容易识别，无 BOM 文件头的要在文件中查找字节顺序来判断 Unicode 编码。识别 UTF32、UTF16、UTF8 后，就是 ASCII 文件与简体中

中游鱼

1564人浏览 · 2024-09-27 21:05:24

中游鱼 · 2024-09-27 21:05:24 发布

基于 C# 的文本文件的编码识别

前言
一、有 BOM 文件头
二、无 BOM 文件头
三、简体中文汉字编码
四、C# 程序对编码的识别
五、获得各种编码的汉字
六、获得各种编码的代码页和名称
七、程序通过验证
八、源代码下载

前言

在 Windows 系统下，文本编码存在有无 BOM 的编码。BOM（Byte Order Mark），字节顺序标记，出现在文本文件头部，Unicode 编码标准中用于标识文件是采用哪种格式的编码。有文件 BOM 头的 Unicode 编码容易识别，无 BOM 文件头的要在文件中查找字节顺序来判断 Unicode 编码。

Unicode 称为 Unicode 16 LE
BigEndianUnicode 称为 Unicode 16 BE
Unicode32 称为 Unicode 32 LE
Unicode32BE 称为 Unicode 32 BE

一、有 BOM 文件头

文件编码	文件头 0-3 字节顺序 BOM 标记	汉字字节组成	以“中文”为例的编码(尾含回车)
Unicode	0xFF，0xFE，必非0，视中英文定	一个汉字 2 字节，“中文”二字共 4 字节	2D 4E 87 65 0D 00 0A 00
BigEndianUnicode	0xFE，0xFF，视中英文定，必非0	一个汉字 2 字节，“中文”二字共 4 字节	4E 2D 65 87 00 0D 00 0A
UTF32	0xFF，0xFE，0，0	一个汉字 4 字节，“中文”二字共 8 字节	2D 4E 00 00 87 65 00 00 0D 00 00 00 0A 00 00 00
UTF32BE	0，0，0xFE，0xFF	一个汉字 4 字节，“中文”二字共 8 字节	00 00 4E 2D 00 00 65 87 00 00 00 0D 00 00 00 0A
UTF8	0xEF，0xBB，0xBF	一个汉字3字节，“中文”二字共6字节	E4 B8 AD E6 96 87 0D 0A

汉字混搭对比示例说明：

文件编码	字符串	十六进制
Unicode 16 LE	LOVE C#\r\n中文	FF FE 4C 00 4F 00 56 00 45 00 20 00 43 00 23 00 0D 00 0A 00 2D 4E 87 65
Unicode 16 BE	LOVE C#\r\n中文	FE FF 00 4C 00 4F 00 56 00 45 00 20 00 43 00 23 00 0D 00 0A 4E 2D 65 87
Unicode 32 LE	LOVE C#\r\n中文	FF FE 00 00 4C 00 00 00 4F 00 00 00 56 00 00 00 45 00 00 00 20 00 00 00 43 00 00 00 23 00 00 00 0D 00 00 00 0A 00 00 00 2D 4E 00 00 87 65 00 00
Unicode 32 BE	LOVE C#\r\n中文	00 00 FE FF 00 00 00 4C 00 00 00 4F 00 00 00 56 00 00 00 45 00 00 00 20 00 00 00 43 00 00 00 23 00 00 00 0D 00 00 00 0A 00 00 4E 2D 00 00 65 87
UTF8	LOVE C#\r\n中文	EF BB BF 4C 4F 56 45 20 43 23 0D 0A E4 B8 AD E6 96 87

从上面有 BOM 头的编码可以看出：
Unicode 16 LE 与 Unicode 16 BE 的区别是每 2 个字节反序。
Unicode 32 LE 与 Unicode 32 BE 的区别是每 4 个字节反序。
Unicode 16 英语字符区间内，一个英文字母为单字节，占用一个 00 字节；Unicode 32 占用空间更大，一个英文字母占用 4 个 00 字节；UTF8文件越大占用空间越小。几乎没有 00 字节。

二、无 BOM 文件头

文件编码	文件头 0-4 字节顺序 BOM 标记	汉字字节组成	以“中文”为例的编码(尾含回车)
Unicode	无	一个汉字 2 字节，“中文”二字共 4 字节	2D 4E 87 65 0D 00 0A 00
BigEndianUnicode	无	一个汉字 2 字节，“中文”二字共 4 字节	4E 2D 65 87 00 0D 00 0A
UTF32	无	一个汉字 4 字节，“中文”二字共 8 字节	2D 4E 00 00 87 65 00 00 0D 00 00 00 0A 00 00 00
UTF32BE	无	一个汉字 4 字节，“中文”二字共 8 字节	00 00 4E 2D 00 00 65 87 00 00 00 0D 00 00 00 0A
UTF8	无	一个汉字3字节，“中文”二字共6字节	E4 B8 AD E6 96 87 0D 0A
ANSI	无	一个汉字 2 字节，“中文”二字共 4 字节	D6 D0 CE C4 0D 0A

汉字混搭对比示例说明：

文件编码	字符串	十六进制(无BOM头部字节识别码，可从字节排序识别)
Unicode 16 LE	LOVE C#\r\n中文	4C 00 4F 00 56 00 45 00 20 00 43 00 23 00 0D 00 0A 00 2D 4E 87 65
Unicode 16 BE	LOVE C#\r\n中文	00 4C 00 4F 00 56 00 45 00 20 00 43 00 23 00 0D 00 0A 4E 2D 65 87
Unicode 32 LE	LOVE C#\r\n中文	4C 00 00 00 4F 00 00 00 56 00 00 00 45 00 00 00 20 00 00 00 43 00 00 00 23 00 00 00 0D 00 00 00 0A 00 00 00 2D 4E 00 00 87 65 00 00
Unicode 32 BE	LOVE C#\r\n中文	00 00 00 4C 00 00 00 4F 00 00 00 56 00 00 00 45 00 00 00 20 00 00 00 43 00 00 00 23 00 00 00 0D 00 00 00 0A 00 00 4E 2D 00 00 65 87
UTF8	LOVE C#\r\n中文	4C 4F 56 45 20 43 23 0D 0A E4 B8 AD E6 96 87

从上面没有有 BOM 头的编码可以看出，与有 BOM 头的编码相比，只是少了文件头的 BOM 标识。

三、简体中文汉字编码

中文编码(按新旧顺序)	代码页	汉字字符串	以“中文”为例的编码
Unicode 32	1200	中文	2D 4E 87 65
Unicode 32 BE	1201	中文	4E 2D 65 87
Unicode	1200	中文	2D 4E 00 00 87 65 00 00
Unicode BE	1201	中文	00 00 4E 2D 00 00 65 87
UTF8	65001	中文	E4 B8 AD E6 96 87
GB18030	CP54936	中文	D6 D0 CE C4
GBK	CP936	中文	D6 D0 CE C4
GB2312	CP20936	中文	D6 D0 CE C4
HZ-GB2312	CP52936	中文	D6 D0 CE C4
BIG5	CP950	中文	A4 A4 A4 E5

四、C# 程序对编码的识别

//声明引用指令名称空间
using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using System.Windows.Forms;

1、文件选择按钮代码：

    private void button1_Click(object sender, EventArgs e)
        {
            OpenFileDialog Openfiledialog = new OpenFileDialog();
            Openfiledialog.InitialDirectory = Application.StartupPath;
            Openfiledialog.Filter = "文本文件(*.TXT)|*.TXT|所有文件(*.*)|*.*";
            Openfiledialog.Title = "打开文件";
            Openfiledialog.Multiselect = true;
            Openfiledialog.FilterIndex = 0;

            dataGridView1.Columns.Clear();
            DataGridViewColumn col;
            DataGridViewRow row;
            DataGridViewCell cell = new DataGridViewTextBoxCell();
            string[] HeaderText = { "文件", "编码" };
            for (int i = 0; i < 2; i++)
            {
                col = new DataGridViewColumn();
                col.HeaderText = HeaderText[i];
                col.CellTemplate = cell;
                col.HeaderCell.Style.Alignment = DataGridViewContentAlignment.MiddleCenter;

                dataGridView1.Columns.Add(col);
                dataGridView1.DefaultCellStyle.Alignment = DataGridViewContentAlignment.MiddleCenter;
            }

            if (Openfiledialog.ShowDialog() == DialogResult.OK)
            {
                string[] FilePath = Openfiledialog.FileNames;

                foreach (string file in FilePath)
                {
                    row = new DataGridViewRow();
                    cell = new DataGridViewTextBoxCell();
                    cell.Value = file;
                    row.Cells.Add(cell);

                    cell = new DataGridViewTextBoxCell();
                    cell.Value = GetEncoding(file).WebName;
                    row.Cells.Add(cell);
                    dataGridView1.Rows.Add(row);
                }
            }
            dataGridView1.AutoResizeColumns();
        }

2、获取文件编码，有 BOM 的文件识别

       /// <summary>获取文件的编码格式</summary>
        /// <param name="file_name">文件</param>
        /// <returns>文件的编码类型</returns>
        private static Encoding GetEncoding(string file_name)
        {
            //文件的字符集在Windows下有两种，一种是ANSI，一种Unicode。
            //对于Unicode，Windows支持了它的三种编码方式，
            //一种是小尾编码（Unicode)，一种是大尾编码(BigEndianUnicode)，一种是UTF - 8编码。
            if (file_name == null)
            {
                throw new ArgumentNullException(nameof(file_name));
            }

            FileStream fs = new FileStream(file_name, FileMode.Open, FileAccess.Read);
            long FsLeng = fs.Length;

            if (FsLeng < 0)
            {
                throw new ArgumentOutOfRangeException(nameof(FsLeng));
            }

            byte[] bytes = new byte[FsLeng];

            if (fs.Length < 3)//小于BOM文件头3字节
            {
                fs.Close();
                return Encoding.ASCII;
            }

            fs.Read(bytes, 0, bytes.Length);
            fs.Close();

            if (bytes[0] == 0xFF && bytes[1] == 0xFE && (bytes[2] != 0 || bytes[3] != 0))//Unicode BOM标记
            {
                return Encoding.Unicode;
            }
            else if (bytes[0] == 0xFE && bytes[1] == 0xFF && (bytes[2] != 0 || bytes[3] != 0))//BigEndianUnicode BOM标记
            {
                return Encoding.BigEndianUnicode;
            }
            else if (bytes[0] == 0xFF && bytes[1] == 0xFE && bytes[2] == 0 && bytes[3] == 0)//UTF32 BOM标记
            {
                return Encoding.UTF32;
            }
            else if (bytes[0] == 0x00 && bytes[1] == 0x00 && bytes[2] == 0xFE && bytes[3] == 0xFF)//UTF32BE BOM标记
            {
                return Encoding.GetEncoding("utf-32BE");
            }
            else if (bytes[0] == 0xEF && bytes[1] == 0xBB && (bytes[2] == 0xBF))//UTF8 BOM标记
            {
                return Encoding.UTF8;
            }
            else//无BOM标记
            {
                Encoding encoding = CheckUtf16Ascii(bytes, bytes.Length);//识别无BOM标记的UTF16
               
                if (encoding == null)//不是无BOM标记的UTF16
                {
                   if (IsOnlyAscii(bytes))//只有ASCII字符
                   {
                       return Encoding.ASCII;//纯 ASCII 
                   }
                   if(IsUTF8Bytes(bytes))识别无BOM标记的UTF8
                   {
                       return Encoding.UTF8; //无BOM标记的UTF8
                   }
                   else //排除所有 UTF16、UTF8、纯ASCII，剩下的就是中文状态的文本文件
                   {
                         if (IsBIG5(bytes))
                        {
                            return Encoding.GetEncoding("BIG5");//一代繁体编码
                        }              
                         else if (IsGB2312(bytes))//一代简体中文编码
                        {
                            return Encoding.GetEncoding("GB2312");
                        }
                        else if (IsGBK(bytes))//二代中文编码（包含简体、繁体中文）
                        {
                            return Encoding.GetEncoding("GBK");
                        }

                        else if (IsGB18030(bytes))//三代中文编码，四代中文编码为全球通 UTF 码中文
                        {
                            return Encoding.GetEncoding("GB18030");
                        }
                        else
                        {
                            return Encoding.Default;//无法识别，设定默认为系统编码
                        }
                }
                else
                {
                   return encoding;//无BOM标记的UTF16
                }
           }
        }

3、获取文件编码，UTF8 无 BOM 文件的识别

 private static bool IsUTF8Bytes(byte[] data)
        {
            int charByteCounter = 1;  //计算当前正分析的字符应还有的字节数
            byte curByte; //当前分析的字节.

            for (int i = 0; i < data.Length; i++)
            {
                curByte = data[i];
                if (charByteCounter == 1)
                {
                    if (curByte >= 0x80)
                    {
                        //判断当前
                        while (((curByte <<= 1) & 0x80) != 0)
                        {

                            charByteCounter++;
                        }
                        //标记位首位若为非0 则至少以2个1开始 如:110XXXXX...........1111110X 
                        if (charByteCounter == 1 || charByteCounter > 6)
                        {
                            return false;
                        }
                    }
                }
                else
                {
                    //若是UTF-8 此时第一位必须为1
                    if ((curByte & 0xC0) != 0x80)
                    {
                        return false;
                    }
                    charByteCounter--;
                }
            }

            if (charByteCounter > 1)
            {
                throw new Exception("非预期的byte格式");
            }
            return true;
        }

4、获取文件编码，UTF16 无 BOM 文件的识别

 private static Encoding CheckUtf16Ascii(byte[] buffer, int size)
        {
            if (buffer == null)
            {
                throw new ArgumentNullException(nameof(buffer));
            }

            if (size < 0)
            {
                throw new ArgumentOutOfRangeException(nameof(size));
            }

            if (size < 2)
            {
                return null;
            }

            // 将大小减小1，这样我们就不必担心字节对的边界检查
            size--;

            const double threshold = 0.5; // 允许一些非英语ISO-8859-1子集的UTF-16字符，同时仍检测编码。
            const double limit = 0.1;

            var leAsciiChars = 0;
            var beAsciiChars = 0;

            var pos = 0;
            while (pos < size)
            {
                byte ch1 = buffer[pos++];
                byte ch2 = buffer[pos++];

                // 偶数计数为空
                if (ch1 == 0 && ch2 != 0)
                {
                    beAsciiChars++;
                }

                // 奇数计数空值
                if (ch1 != 0 && ch2 == 0)
                {
                    leAsciiChars++;
                }
            }

            // 恢复大小
            size++;

            double leAsciiCharsPct = leAsciiChars * 2.0 / size;
            double beAsciiCharsPct = beAsciiChars * 2.0 / size;

            if (leAsciiCharsPct > threshold && beAsciiCharsPct < limit)
            {
                return Encoding.Unicode;
            }

            if (beAsciiCharsPct > threshold && leAsciiCharsPct < limit)
            {
                return Encoding.BigEndianUnicode;
            }

            // 无法识别的编码
            return null;
        }

5、获取非 UTF8、UTF16、UTF32 文件的 ASCII 和中文编码

        private static bool IsOnlyAscii(byte[] bytes)
        {
            if (bytes == null)
            {
                throw new ArgumentNullException(nameof(bytes));
            }
            for (int i = 0; i < bytes.Length; i++)
            {
                if (bytes[i] > 127) return false;//小于127的只有ASCII字符
            }
            return true;
        }
        static bool IsBIG5(byte[] bytes)
        {
            string input = Encoding.GetEncoding("BIG5").GetString(bytes);
            var regex = new Regex("[\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|[\xa1-\xfe])");//检查是否都匹配BIG5编码区间
            return regex.IsMatch(input);
        }

        private static bool IsGBK(byte[] bytes)
        {
            string input = Encoding.GetEncoding("GBK").GetString(bytes);
            // 正则表达式匹配所有汉字字符
            var regex = new Regex(@"[\u4E00-\u9FA5]"); // 检查输入字符串中的所有字符是否都匹配GBK编码区间
            return regex.IsMatch(input);
        }

        private static bool IsGB2312(byte[] bytes)
        {
            string input = Encoding.GetEncoding("GB2312").GetString(bytes);
            var regex = new Regex(@"[\uB0A1-\uF7FE\u8140-\uA4D0]"); // 检查输入字符串中的所有字符是否都匹配GB2312编码区间
            return regex.IsMatch(input);
        }

        private static bool IsGB18030(byte[] bytes)
        {
            string input = Encoding.GetEncoding("GB18030").GetString(bytes);
            var regex = new Regex(@"[\u4E00-\u9FA5\uE7C7-\uE7F3]");// 检查输入字符串中的所有字符是否都匹配GB18030编码区间
            return regex.IsMatch(input);
        }

五、获得各种编码的汉字

            Console.WriteLine ("GB2312汉字区 B0-F7,A1-FE");
            for (int i = 176; i < 248; i++)//GB2312汉字区，首字节 B0-F7, 第二字节 A1-FE
            {
                for (int j = 161; j < 255; j++)
                {
                    Console.Write("{0},{1},{2},{3} ", i, j,i.ToString("X")+ j.ToString("X"), GB18030.GetString(new byte[] { (byte)i, (byte)j }).ToString());
                }
                Console.Write("\r\n");
            }
            Console.WriteLine("");
            
            Console.WriteLine("CJK3汉字区 81-A0,40-FE");
            for (int i = 129; i < 161; i++)//CJK3汉字区 81-A0,40-FE
            {
                for (int j = 64; j < 255; j++)
                {
                    Console.Write("{0},{1},{2},{3} ", i, j, i.ToString("X") + j.ToString("X"), GB18030.GetString(new byte[] { (byte)i, (byte)j }).ToString());
                }
                Console.Write("\r\n");
            }
            Console.WriteLine("");
            
            Console.WriteLine("CJK4汉字区 AA-FE,40-FE");
            for (int i = 170; i < 255; i++)//CJK4汉字区 AA-FE,40-FE
            {
                for (int j = 64; j < 161; j++)
                {
                    Console.Write("{0},{1},{2},{3} ", i, j, i.ToString("X") + j.ToString("X"), GB18030.GetString(new byte[] { (byte)i, (byte)j }).ToString());
                }
                Console.Write("\r\n");
            }
            Console.WriteLine("");
            
            Console.WriteLine("BIG5基本汉字区 A4-C6,40-7E");
            for (int i = 164; i < 199; i++)//BIG5汉字区 A4-C6,40-7E
            {
                for (int j = 64; j < 126; j++)
                {
                    Console.Write("{0},{1},{2},{3} ", i, j, i.ToString("X") + j.ToString("X"), BIG5.GetString(new byte[] { (byte)i, (byte)j }).ToString());
                }
                Console.Write("\r\n");
            }
            Console.WriteLine("");
            
            Console.WriteLine("BIG5补充汉字区 C9-F9,40-D5");
            for (int i = 201; i < 250; i++)//BIG5汉字区 C9-F9,40-D5
            {
                for (int j = 64; j < 214; j++)
                {
                    Console.Write("{0},{1},{2},{3} ", i, j, i.ToString("X") + j.ToString("X"), BIG5.GetString(new byte[] { (byte)i, (byte)j }).ToString());
                }
                Console.Write("\r\n");
            }

六、获得各种编码的代码页和名称

            using (StreamWriter writer = new StreamWriter(Application.StartupPath + "\\EncodingCodePage.TXT", false, Encoding.Default))
            {
                EncodingInfo[] encodings = Encoding.GetEncodings();
                writer.WriteLine("{0}\t{1}\t{2}\t{3}\t{4}", "编码代码页", "编码名称", "编码头名称", "编码WEB名称", "Windows系统编码代码页");
                foreach (var encodingInfo in encodings)
                {
                    Encoding encoding = Encoding.GetEncoding(encodingInfo.CodePage);
                    
                    writer.WriteLine("{0}\t{1}\t{2}\t{3}\t{4}", encodingInfo.CodePage, encodingInfo.Name, encoding.HeaderName, encoding.WebName, encoding.WindowsCodePage);
                }
            }