HTML to TEXT in Python
I just wrote a small Python program. In the script there was a part where I needed to get the body of a web page and get rid of all the html tags, javascript, css styles, html comments etc. So I searched Google, found several threads in stackoverflow and then found this: http://www.aaronsw.com/2002/html2text/ This looks cool. But when I tested it against the 'about me' page of my blog, it didn't work because of some broken tags! Then I started to write the html to text function myself to get the plain text only. With help of regular expression I solved my problem (but may be I created more problems!). Here is my Python code:
def html_to_text(data):
# remove the newlines
data = data.replace("\n", " ")
data = data.replace("\r", " ")
# replace consecutive spaces into a single one
data = " ".join(data.split())
# get only the body content
bodyPat = re.compile(r'< body[^<>]*?>(.*?)< / body >', re.I)
result = re.findall(bodyPat, data)
data = result[0]
# now remove the java script
p = re.compile(r'< script[^<>]*?>.*?< / script >')
data = p.sub('', data)
# remove the css styles
p = re.compile(r'< style[^<>]*?>.*?< / style >')
data = p.sub('', data)
# remove html comments
p = re.compile(r'')
data = p.sub('', data)
# remove all the tags
p = re.compile(r'<[^<]*?>')
data = p.sub('', data)
return data
Note that in order to use the function, you need to remove some space characters, just copy-paste won't work. Guess why. ;)
Please share if you have better ideas or know useful libraries that can perform better than my code. Don't forget to test it first! :)
def html_to_text(data):
# remove the newlines
data = data.replace("\n", " ")
data = data.replace("\r", " ")
# replace consecutive spaces into a single one
data = " ".join(data.split())
# get only the body content
bodyPat = re.compile(r'< body[^<>]*?>(.*?)< / body >', re.I)
result = re.findall(bodyPat, data)
data = result[0]
# now remove the java script
p = re.compile(r'< script[^<>]*?>.*?< / script >')
data = p.sub('', data)
# remove the css styles
p = re.compile(r'< style[^<>]*?>.*?< / style >')
data = p.sub('', data)
# remove html comments
p = re.compile(r'')
data = p.sub('', data)
# remove all the tags
p = re.compile(r'<[^<]*?>')
data = p.sub('', data)
return data
Note that in order to use the function, you need to remove some space characters, just copy-paste won't work. Guess why. ;)
Please share if you have better ideas or know useful libraries that can perform better than my code. Don't forget to test it first! :)
Comments
def stripblog(html_file):
#call lynx and do a text dump into stdout
process = subprocess.Popen(["lynx", "-dump", "-nolist", "-nonumbers", html_file], shell=False, stdout=subprocess.PIPE)
#convert binary data to a utf-8 string
file = process.communicate()[0].decode("utf-8")
Great blog! Do you know of any "Python regular expression for dummies" material? I mean, really entry level stuff...
http://nltk.googlecode.com/svn/trunk/doc/api/index.html
import HTMLParser
pars = HTMLParser.HTMLParser()
data = pars.unescape(data)
converts html entities to unicode characters
from django.utils.encoding import force_unicode
def strip_tags(value):
return re.sub(r'<[^>]*?>', '', force_unicode(value))