Menu

How to read UTF8 file with BOM?

Wikipedia about UTF8 BOM

The byte order mark (BOM) is a Unicode character, U+FEFF byte order mark (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:[1]

  • What byte order, or endianness, the text stream is stored in;
  • The fact that the text stream is Unicode, to a high level of confidence;
  • Which of several Unicode encodings that text stream is encoded as.

BOM use is optional, and, if used, should appear at the start of the text stream.

Problem

UTF8 file are a special case because it is not recommended to add a BOM to them because it can break other tools like Java. In fact, Java assumes the UTF8 don’t have a BOM so if the BOM is present it won’t be discarded and it will be seen as data. I focused with some issues when I was working with integrations. I built a web service, which accepts request data. Request payload was looking fine and I was wondering why XML validator is not working, when I figured out that one strange character exists in the begging of data.

Solution

Here is a Java example how to read and ignore BOM from text file (GitHub link):

package com.vilbay;

import java.io.*;

public class Main {

    public static final String UTF8_BOM = "\uFEFF";


    public static void main(String[] args) throws IOException {


        InputStreamReader inputStreamReader;
        inputStreamReader = new InputStreamReader(Main.class.getResourceAsStream("example.txt"), "UTF8");
        BufferedReader bufferedReader = new BufferedReader(inputStreamReader);

        boolean firstLine = true;

        for (String line = ""; (line = bufferedReader.readLine()) != null;) {
            if (firstLine) {
                line = Main.removeUTF8BOM(line);
                firstLine = false;

            }

            System.out.println(line);
        }

        bufferedReader.close();
    }

    private static String removeUTF8BOM(String string) {
        if (string.startsWith(UTF8_BOM)) {
            string = string.substring(1);
        }
        return string;
    }
}

 

Leave a Reply

Your email address will not be published. Required fields are marked *