HTML to TEXT in Python

April 29, 2011

I just wrote a small Python program. In the script there was a part where I needed to get the body of a web page and get rid of all the html tags, javascript, css styles, html comments etc. So I searched Google, found several threads in stackoverflow and then found this: http://www.aaronsw.com/2002/html2text/ This looks cool. But when I tested it against the 'about me' page of my blog, it didn't work because of some broken tags! Then I started to write the html to text function myself to get the plain text only. With help of regular expression I solved my problem (but may be I created more problems!). Here is my Python code:

def html_to_text(data):
    # remove the newlines
    data = data.replace("\n", " ")
    data = data.replace("\r", " ")

    # replace consecutive spaces into a single one
    data = " ".join(data.split())

    # get only the body content
    bodyPat = re.compile(r'< body[^<>]*?>(.*?)< / body >', re.I)
    result = re.findall(bodyPat, data)
    data = result[0]

    # now remove the java script
    p = re.compile(r'< script[^<>]*?>.*?< / script >')
    data = p.sub('', data)

    # remove the css styles
    p = re.compile(r'< style[^<>]*?>.*?< / style >')
    data = p.sub('', data)

    # remove html comments
    p = re.compile(r'')
    data = p.sub('', data)

    # remove all the tags
    p = re.compile(r'<[^<]*?>')
    data = p.sub('', data)

    return data

Note that in order to use the function, you need to remove some space characters, just copy-paste won't work. Guess why. ;)

Please share if you have better ideas or know useful libraries that can perform better than my code. Don't forget to test it first! :)

Comments

thinXer said…

I think beautiful soup could do this quite well, as it's a quite good html parser.

April 29, 2011 at 7:25 AM

Ryan said…

While not a pure Python solution, I use in Python 3 on a system with Lynx installed and callable by command line:

def stripblog(html_file):
#call lynx and do a text dump into stdout
process = subprocess.Popen(["lynx", "-dump", "-nolist", "-nonumbers", html_file], shell=False, stdout=subprocess.PIPE)
#convert binary data to a utf-8 string
file = process.communicate()[0].decode("utf-8")

April 29, 2011 at 10:23 AM

José L. Romero P. said…

Hello Subeen:

Great blog! Do you know of any "Python regular expression for dummies" material? I mean, really entry level stuff...

June 9, 2011 at 7:54 PM

decorr said…

Please look at NLTK's clean html function :

http://nltk.googlecode.com/svn/trunk/doc/api/index.html

June 9, 2011 at 9:24 PM

Dennis said…

just an addition to your script:

import HTMLParser
pars = HTMLParser.HTMLParser()
data = pars.unescape(data)

converts html entities to unicode characters

July 5, 2011 at 7:21 PM