Tagged: publishing Toggle Comment Threads | Keyboard Shortcuts

  • Subinkrishna Gopi 12:34 pm on February 21, 2009 Permalink |
    Tags: , iterator, , publishing, text, text processing,   

    Working with text in Java: Using BreakIterator API 

    java.text.BreakIterator is a very good API to find boundaries – character, word, sentence & line break – with in a text.  The API provides a factory method to create the appropriate Iterator of our choice.

    // Instantiating a word iterator with optional locale parameter
    BreakIterator anIterator = BreakIterator.getWordInstance(locale);
    // Without locale parameter
    BreakIterator anotherIterator = BreakIterator.getWordInstance();

    The locale is an optional parameter to have locale specific breaks. We can instantiate BreakIterator without specifying the locale also. (java.util.Locale) The locale is important when we are working with languages like Arabic or Chinese where the standards may be different compared to English.

    Once we have an instance of BreakIterator, iterating through the boundaries / breaks is the same. The API offers methods like first(), last(), previous(), next(), preceding(), following() to iterate through boundaries.

    Iterating through boundaries

    BreakIterator aWordIterator = null;
    String targetString = null;
    int nextIndex = -1;
    int anIndex = -1;
    // Initialising the word iterator
    aWordIterator = BreakIterator.getWordInstance();
    targetString = "This is a sample text";
    // Iterating through the boundaries
    nextIndex = aWordIterator.first();
    while (BreakIterator.DONE != nextIndex)
    	anIndex = nextIndex;
    	nextIndex = aWordIterator.next();
    	if ((BreakIterator.DONE != nextIndex) &&
    		System.out.format("%10s (%d, %d) \n",
    			targetString.substring(anIndex, nextIndex),
    			anIndex, nextIndex);

    The constant BreakIterator.DONE indicates the end of boundaries.


          This (0, 4)
            is (5, 7)
             a (8, 9)
        sample (10, 16)
          text (17, 21)

    Useful links
    http://java.sun.com/docs/books/tutorial/i18n/text/examples/BreakIteratorDemo.java – Sun’s BreakIterator demo
    http://java.sun.com/docs/books/tutorial/i18n/text/index.html – Sun’s tutorial on “Working with text”

  • Subinkrishna Gopi 2:46 pm on February 17, 2009 Permalink |
    Tags: , , gnome, Linux, publishing,   

    Today’s tool: Gnome Blog 


    Gnome Blog is a nice tool using which we can publish to our blogs – currently they support Blogger.com / Blogspot.com, Advogato.org, Movable Type, WordPress, LiveJournal.com, Pyblosxom & Any other blog using bloggerAPI or MetaWeblog.

    There is post in Ankur’s blog about this tool. I am yet to try it.

    What is Gnome ?Wikipedia/Gnome

    Home page: http://www.gnome.org/~seth/gnome-blog/
    Download page: http://www.gnome.org/~seth/gnome-blog/download.html

  • Subinkrishna Gopi 6:10 pm on February 4, 2009 Permalink |
    Tags: , jdom, , publishing, , ,   

    A basic Jdom parser for RSS 


    Almost two years back I posted a SAX based RSS parser (find it here) which was intented for J2ME. But we have JDOM parser which I think is a lot easier than SAX. In this post you can find a very simple JDOM based RSS parser.

    Know more about RSS here: http://en.wikipedia.org/wiki/RSS_(file_format)

    Step1: Import the JDOM libraries

    import org.jdom.Document;
    import org.jdom.Element;
    import org.jdom.input.SAXBuilder;

    Step 2:

    // I have not given the implementation of
    // getUrlConnectionInputStream(url)
    inputStream = getUrlConnectionInputStream(url);
    if(null == inputStream)
       throw new Exception("No input stream for " + url);
    saxBuilder = new SAXBuilder();
    document = saxBuilder.build(inputStream);
    rssFeed = _build(document);

    Step 3:
    Implementation of _build(org.jdom.Document)

    // Entry point. Returns a RssFeed object corresponding
    // to the given RSS feed URL
    private RssFeed _build(Document document)
    throws Exception
      RssFeed rssFeed = null;
      Element rootElement = null;
      Element channelElement = null;
      String  rssFeedVersion = null;
      if (null == document)
        throw new Exception("Empty document");
      rootElement = document.getRootElement();
      if (!"rss".equalsIgnoreCase(rootElement.getName()))
        throw new Exception("Invalid XML");
      rssFeedVersion = rootElement.getAttributeValue("version");
      channelElement = rootElement.getChild("channel");
      if (null == channelElement)
        throw new Exception("Empty feed");
      // Getting the feed contents
      rssFeed = _getHeader(channelElement);
      rssFeed.version = rssFeedVersion;
      _addFeedItems(channelElement, rssFeed);
      return (rssFeed);

    Step 4: _getHeader(org.jdom.Element)

    This method reads the feed header and sets the values to the RssFeed object.

    // Sets the RSS feed heder information to the
    // RssFeed object
    rssFeed = new RssFeed();
    rssFeed.title = getValueOfChildElement(channelElement, "title");
    rssFeed.link = getValueOfChildElement(channelElement, "link");
    rssFeed.description = getValueOfChildElement(channelElement, "description");

    Step 5:
    _addFeedItems(org.jdom.Elements,  subin.xml.RssFeed)

    This method extracts each feed item from the XML and adds it to the given RssFeed object.

    // Iterates through the feed item list, extracts the
    // feed item details, creates corresponding RssItem object
    // and adds it to the RssFeed item list
    java.util.List<element> itemElements = null;
    RssItem anRssItem = null;
    itemElements = channelElement.getChildren("item");
    if (null != itemElements)
      for (Element anItemElement : itemElements)
        anRssItem = new RssItem();
        anRssItem.title = getValueOfChildElement(anItemElement, "title");
        anRssItem.link = getValueOfChildElement(anItemElement, "link");
        anRssItem.description = getValueOfChildElement(anItemElement, "description");
        anRssItem.pubDate = getValueOfChildElement(anItemElement, "pubDate");

    Step 6: getValueOfChildElement(org.jdom.Element, String)

    This method extract the value of the child node (specified by the name) from the given JDOM Element.

    // Get the child node value
    private String getValueOfChildElement(Element parentElement,
      String tagName)
      Element childElement = null;
      String  tagValue     = null;
      childElement = parentElement.getChild(tagName);
      tagValue = (null != childElement)
        ? childElement.getValue().trim() : null;
      return (tagValue);

    Step 7: RssFeed & RssItem

    Two classes to hold the feed information.

    class RssFeed
      public  String version;
      public  String title;
      public  String description;
      public  String link;
      public List <rssitem> items;
    class RssItem
      public  String title;
      public  String description;
      public  String link;
      public  String pubDate;

    If you find it difficult to follow as it is not a single file, I’m very sorry. But I hope this will be useful.

    • jimmyzhang 7:23 am on December 2, 2009 Permalink

      you may also want to check out vtd-xml, the latest and most advanced xml processing model


    • Subinkrishna G 2:45 pm on December 2, 2009 Permalink

      Jimmy, Thanks for the information. I will definitely have a look at it.


  • Subinkrishna Gopi 3:13 pm on July 25, 2007 Permalink |
    Tags: feeds, , publishing, , sax, ,   

    RSS Parser (SAX) 

    RSS (Really Simple Syndication)
    RSS is way to publish frequently changing contents like blog posts, news updates, stock quotes & things like that. An RSS document, which is called a “feed,” “web feed,” or “channel,” contains either a summary of content from an associated web site or the full text. RSS formats are specified using XML, a generic specification for the creation of data formats.

    I have attached a simple SAX parser for RSS. Please let me know if there is any flaw in the attached code. This code is provided for learning purpose with less focus on coding standards & it’s efficiency. You are free to use & modify it.

    package subin.rnd.xml;
    import java.io.IOException; 
    import java.io.InputStream; 
    import java.net.URL; 
    import java.util.ArrayList; 
    import java.util.HashMap; 
    import java.util.Properties;
    import javax.xml.parsers.SAXParser; 
    import javax.xml.parsers.SAXParserFactory;
    import org.xml.sax.Attributes; 
    import org.xml.sax.helpers.DefaultHandler;
    public class RssParser extends DefaultHandler 
        private String        urlString; 
        private RssFeed       rssFeed; 
        private StringBuilder text; 
        private Item          item; 
        private boolean       imgStatus; 
        public RssParser(String url) 
            this.urlString = url; 
            this.text = new StringBuilder(); 
        public void parse() 
            InputStream urlInputStream = null; 
            SAXParserFactory spf = null; 
            SAXParser sp = null; 
                URL url = new URL(this.urlString); 
                _setProxy(); // Set the proxy if needed 
                urlInputStream = url.openConnection().getInputStream();            
                spf = SAXParserFactory.newInstance(); 
                if (spf != null) 
                    sp = spf.newSAXParser(); 
                    sp.parse(urlInputStream, this); 
             * Exceptions need to be handled 
             * MalformedURLException 
             * ParserConfigurationException 
             * IOException 
             * SAXException 
            catch (Exception e) 
                System.out.println("Exception: " + e); 
                    if (urlInputStream != null) urlInputStream.close(); 
                catch (Exception e) {} 
        public RssFeed getFeed() 
            return (this.rssFeed); 
        public void startElement(String uri, String localName, String qName, 
                Attributes attributes) 
            if (qName.equalsIgnoreCase("channel")) 
                this.rssFeed = new RssFeed(); 
            else if (qName.equalsIgnoreCase("item") && (this.rssFeed != null)) 
                this.item = new Item(); 
            else if (qName.equalsIgnoreCase("image") && (this.rssFeed != null)) 
                this.imgStatus = true; 
        public void endElement(String uri, String localName, String qName) 
            if (this.rssFeed == null) 
            if (qName.equalsIgnoreCase("item")) 
                this.item = null; 
            else if (qName.equalsIgnoreCase("image")) 
                this.imgStatus = false; 
            else if (qName.equalsIgnoreCase("title")) 
                if (this.item != null) this.item.title = this.text.toString().trim(); 
                else if (this.imgStatus) this.rssFeed.imageTitle = this.text.toString().trim(); 
                else this.rssFeed.title = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("link")) 
                if (this.item != null) this.item.link = this.text.toString().trim(); 
                else if (this.imgStatus) this.rssFeed.imageLink = this.text.toString().trim(); 
                else this.rssFeed.link = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("description")) 
                if (this.item != null) this.item.description = this.text.toString().trim(); 
                else this.rssFeed.description = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("url") && this.imgStatus) 
                this.rssFeed.imageUrl = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("language")) 
                this.rssFeed.language = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("generator")) 
                this.rssFeed.generator = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("copyright")) 
                this.rssFeed.copyright = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("pubDate") && (this.item != null)) 
                this.item.pubDate = this.text.toString().trim(); 
            else if (qName.equalsIgnoreCase("category") && (this.item != null)) 
                this.rssFeed.addItem(this.text.toString().trim(), this.item); 
        public void characters(char[] ch, int start, int length) 
            this.text.append(ch, start, length); 
        public static void _setProxy() 
        throws IOException 
            Properties sysProperties = System.getProperties(); 
            sysProperties.put("proxyHost", "<Proxy IP Address>"); 
            sysProperties.put("proxyPort", "<Proxy Port Number>"); 
        public static class RssFeed 
            public  String title; 
            public  String description; 
            public  String link; 
            public  String language; 
            public  String generator; 
            public  String copyright; 
            public  String imageUrl; 
            public  String imageTitle; 
            public  String imageLink; 
            private ArrayList <Item> items; 
            private HashMap <String, ArrayList <Item>> category; 
            public void addItem(Item item) 
                if (this.items == null) 
                    this.items = new ArrayList<Item>(); 
            public void addItem(String category, Item item) 
                if (this.category == null) 
                    this.category = new HashMap<String, ArrayList<Item>>(); 
                if (!this.category.containsKey(category)) 
                    this.category.put(category, new ArrayList<Item>()); 
        public static class Item 
            public  String title; 
            public  String description; 
            public  String link; 
            public  String pubDate; 
            public String toString() 
                return (this.title + ": " + this.pubDate + "n" + this.description); 

    Using RssParser.java :

    RssParser rp = new RssParser("<RSS Feed URL>"); 
    RssFeed feed = rp.getFeed();
    // Listing all categories & the no. of elements in each category 
    if (feed.category != null) 
     System.out.println("Category List: "); 
     for (String category : feed.category.keySet()) 
        + ": " 
        + ((ArrayList<Item>)feed.category.get(category)).size()); 
    // Listing all items in the feed 
    for (int i = 0; i < feed.items.size(); i++) 
    • sciafranz 9:20 pm on September 9, 2008 Permalink

      thank you!! I need it!

    • Subinkrishna G 10:22 am on September 10, 2008 Permalink

      My pleasure 🙂

    • Subinkrishna G 12:55 pm on January 22, 2009 Permalink

      This is a very basic RSS parser. My intension was to write a parser for J2ME enabled mobile devices. Thats why I wrote it with SAX. By changing some of the Collection objects used in the code we can use it in J2ME (e.g. HashMap to Hashtable).

      But I personally prefer to use JDOM parser for all XML parsing needs. JDOM is again SAX based and gives us a DOM-like document object. And it’s very simple to use too. Find more about it here: http://www.jdom.org

      I will try to post a JDOM based parser soon.

    • vandershraaf 12:49 am on December 29, 2010 Permalink

      Dude, this is amazing. Thanks!

    • Isaac Ojeda 1:31 pm on January 22, 2011 Permalink

      Thanks!!! 😀

    • glennbech 4:53 am on March 28, 2011 Permalink

      Thanks. I need to to something like this in Android. I thought about using Rome, but it depends on JDOM, and before you know it 200kb + is added to the download just for the XML parsing.

      Great example on how to do stuff the “right way” on mobile!

    • Subinkrishna G 10:26 am on March 28, 2011 Permalink

      Sure. This code is written back in 2007 and may be missing some key. Please feel free to customize it for your needs.

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc