HTML to TEXT in Python

I just wrote a small Python program. In the script there was a part where I needed to get the body of a web page and get rid of all the html tags, javascript, css styles, html comments etc. So I searched Google, found several threads in stackoverflow and then found this: http://www.aaronsw.com/2002/html2text/ This looks cool. But when I tested it against the 'about me' page of my blog, it didn't work because of some broken tags! Then I started to write the html to text function myself to get the plain text only. With help of regular expression I solved my problem (but may be I created more problems!). Here is my Python code:

def html_to_text(data):       
    # remove the newlines
    data = data.replace("\n", " ")
    data = data.replace("\r", " ")
  
    # replace consecutive spaces into a single one
    data = " ".join(data.split())  
  
    # get only the body content
    bodyPat = re.compile(r'< body[^<>]*?>(.*?)< / body >', re.I)
    result = re.findall(bodyPat, data)
    data = result[0]
  
    # now remove the java script
    p = re.compile(r'< script[^<>]*?>.*?< / script >')
    data = p.sub('', data)
  
    # remove the css styles
    p = re.compile(r'< style[^<>]*?>.*?< / style >')
    data = p.sub('', data)
  
    # remove html comments
    p = re.compile(r'')
    data = p.sub('', data)
  
    # remove all the tags
    p = re.compile(r'<[^<]*?>')
    data = p.sub('', data)
  
    return data


Note that in order to use  the function, you need to remove some space characters, just copy-paste won't work. Guess why. ;)

Please share if you have better ideas or know useful libraries that can perform better than my code. Don't forget to test it first! :)

Comments

thinXer said…
I think beautiful soup could do this quite well, as it's a quite good html parser.
Ryan said…
While not a pure Python solution, I use in Python 3 on a system with Lynx installed and callable by command line:

def stripblog(html_file):
#call lynx and do a text dump into stdout
process = subprocess.Popen(["lynx", "-dump", "-nolist", "-nonumbers", html_file], shell=False, stdout=subprocess.PIPE)
#convert binary data to a utf-8 string
file = process.communicate()[0].decode("utf-8")
Hello Subeen:

Great blog! Do you know of any "Python regular expression for dummies" material? I mean, really entry level stuff...
decorr said…
Please look at NLTK's clean html function :

http://nltk.googlecode.com/svn/trunk/doc/api/index.html
Dennis said…
just an addition to your script:

import HTMLParser
pars = HTMLParser.HTMLParser()
data = pars.unescape(data)

converts html entities to unicode characters
Ferdous said…
import re
from django.utils.encoding import force_unicode

def strip_tags(value):
return re.sub(r'<[^>]*?>', '', force_unicode(value))
Tamim Shahriar said…
Check the discussion in this post: http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html
gulam said…
wow.. Nice blog..
Unknown said…
I think beautiful soup should do that more efficiently :)
Tamim Shahriar said…
Beautiful soup couldn't perform the task (because of broken html), forget about efficiency. :)

Popular posts from this blog

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code

Adjacency Matrix (Graph) in Python