Strip HTML tags using Python
We often need to strip HTML tags from string (or HTML source). I usually do it using a simple regular expression in Python. Here is my function to strip HTML tags:
Here is another function to remove more than one consecutive white spaces:
Note that re module needs to be imported in order to use regular expression.
Here you can find an updated code that gets the text from html: http://love-python.blogspot.com/2011/04/html-to-text-in-python.html
def remove_html_tags(data): p = re.compile(r'<.*?>') return p.sub('', data)
Here is another function to remove more than one consecutive white spaces:
def remove_extra_spaces(data): p = re.compile(r'\s+') return p.sub(' ', data)
Note that re module needs to be imported in order to use regular expression.
Here you can find an updated code that gets the text from html: http://love-python.blogspot.com/2011/04/html-to-text-in-python.html
Comments
Don't count on people using <, best to check for known tag names, and perhaps limit tag length to 10 characters.
how about
>>> re.sub("</?[^\W].{0,10}?>", "", "<a>what if 3 < 5 </>")
what if 3 < 5
The regex should probably check for a list of valid HTML tags, either closing the tag immediately after the tag name, or following it with a space and other characters (for attributes) until closing the tag.
Would still be one regex, but guess a good solution isn't one-line simple.
p = re.compile(r'<[^<]*?>')
import re
import os, sys, glob
from os import system
from urllib.request import urlopen
page = urlopen("http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html").read()
myfile = open('testfile.txt', 'w')
fileencoding = "iso-8859-1"
txt = page.decode(fileencoding)
def remove_html_tags(txt):
p = re.compile(r'<[^<]*?/>')
return p.sub('', txt)
myfile.write(txt)
Fixed version:
x = re.compile(r'<[^<]*?/?>')
x.sub('', 'a <b style="blah">gsts</b>')
-> 'a gsts'
.sub('', 'a t')
the fixed one, re.compile(r'<[^<]*?/?>') matches open-tags as well
http://kodos.sourceforge.net/
Then again, it's still a great way to prevent an error against the few who don't change them. :)
Note: (ampersand) = &. I had to write it differently so that it wouldn't show up as < or >.
the regular ecpression is fails. because this is not a valid html tag.
How would an HTML parser work with "3 < 5 " ? By throwing an error? Not much use then.
If OP just wanted to display strings I suppose it would also work to encode the pointies as < and > , at the expense of messy output.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regex on HTML is heuristics but that will not work on every condition. It is possible to find a problem in all regex that are trying to parse HTML. Please, go read "Chomsky hierarchy", and then, you are going to know that the context-free language set is bigger than the regular language set, and that is the WHY this discussion here makes no sense based on the principles of Computer Science!