Working with text in Java: Using BreakIterator API

java.text.BreakIterator is a very good API to find boundaries – character, word, sentence & line break – with in a text.  The API provides a factory method to create the appropriate Iterator of our choice.

// Instantiating a word iterator with optional locale parameter
BreakIterator anIterator = BreakIterator.getWordInstance(locale);

// Without locale parameter
BreakIterator anotherIterator = BreakIterator.getWordInstance();

The locale is an optional parameter to have locale specific breaks. We can instantiate BreakIterator without specifying the locale also. (java.util.Locale) The locale is important when we are working with languages like Arabic or Chinese where the standards may be different compared to English.

Once we have an instance of BreakIterator, iterating through the boundaries / breaks is the same. The API offers methods like first(), last(), previous(), next(), preceding(), following() to iterate through boundaries.

Iterating through boundaries

BreakIterator aWordIterator = null;
String targetString = null;
int nextIndex = -1;
int anIndex = -1;

// Initialising the word iterator
aWordIterator = BreakIterator.getWordInstance();
targetString = "This is a sample text";
aWordIterator.setText(targetString);

// Iterating through the boundaries
nextIndex = aWordIterator.first();
while (BreakIterator.DONE != nextIndex)
{
	anIndex = nextIndex;
	nextIndex = aWordIterator.next();

	if ((BreakIterator.DONE != nextIndex) &&
		Character.isLetterOrDigit(targetString.charAt(anIndex)))
	{
		System.out.format("%10s (%d, %d) \n",
			targetString.substring(anIndex, nextIndex),
			anIndex, nextIndex);
	}
}

The constant BreakIterator.DONE indicates the end of boundaries.

Output

      This (0, 4)
        is (5, 7)
         a (8, 9)
    sample (10, 16)
      text (17, 21)

Useful links
http://java.sun.com/docs/books/tutorial/i18n/text/examples/BreakIteratorDemo.java – Sun’s BreakIterator demo
http://java.sun.com/docs/books/tutorial/i18n/text/index.html – Sun’s tutorial on “Working with text”

Advertisements