Strip HTML tags using Python

July 26, 2008

We often need to strip HTML tags from string (or HTML source). I usually do it using a simple regular expression in Python. Here is my function to strip HTML tags:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

Here is another function to remove more than one consecutive white spaces:

def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)

Note that re module needs to be imported in order to use regular expression.

Here you can find an updated code that gets the text from html: http://love-python.blogspot.com/2011/04/html-to-text-in-python.html

Comments

Graham Poulter said…

The regex will kill most of the string, if it contains "well, if a < b, then blah <em>blah</em>"

Don't count on people using <, best to check for known tag names, and perhaps limit tag length to 10 characters.

how about

>>> re.sub("</?[^\W].{0,10}?>", "", "<a>what if 3 < 5 </>")

what if 3 < 5

August 20, 2008 at 7:37 PM

Tamim Shahriar said…

Yes, your regexp is better. Thanks.

August 23, 2008 at 11:30 PM

Graham Poulter said…

Oops... spotted a problem: it won't match if there's attributes.

The regex should probably check for a list of valid HTML tags, either closing the tag immediately after the tag name, or following it with a space and other characters (for attributes) until closing the tag.

Would still be one regex, but guess a good solution isn't one-line simple.

August 24, 2008 at 10:54 PM

Tamim Shahriar said…

What do you think about this one:
p = re.compile(r'<[^<]*?>')

August 24, 2008 at 10:59 PM

Graham Poulter said…

that's a good one - telling it to avoid a match if there's another "<" anywhere in the potential tag should let it parse "3 < 5" safely

August 24, 2008 at 11:18 PM

Unknown said…

I'm a n00b to programming in general and was just trying to write text from this website to a .txt file. Seems like whatever I do I keep getting all of the tags. I'm sure it's obvious but I don't get what's wrong. Any ideas?

import re
import os, sys, glob
from os import system
from urllib.request import urlopen

page = urlopen("http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html").read()
myfile = open('testfile.txt', 'w')
fileencoding = "iso-8859-1"
txt = page.decode(fileencoding)

def remove_html_tags(txt):
p = re.compile(r'<[^<]*?/>')
return p.sub('', txt)

myfile.write(txt)

March 9, 2009 at 11:53 AM

Graham Poulter said…

Ok anpanman, there's a bug in that one too (didn't we test it?).

Fixed version:

x = re.compile(r'<[^<]*?/?>')

x.sub('', 'a <b style="blah">gsts</b>')

-> 'a gsts'

.sub('', 'a t')

March 9, 2009 at 1:28 PM

Unknown said…

Hmm. I don't really get it but I suspect it's more because of my lack of fundamental skill and knowledge than your reply. Thanks for your time Graham.

March 11, 2009 at 1:50 PM

Graham Poulter said…

anpanman, the broken regex re.compile(r'<[^<]*?/>') requires the closing "/" and so only matches close-tags

the fixed one, re.compile(r'<[^<]*?/?>') matches open-tags as well

March 11, 2009 at 2:15 PM

Graham Poulter said…

Also, the open-source app Kodos makes it much easier to test python regexesO

http://kodos.sourceforge.net/

March 11, 2009 at 2:29 PM

Karthik Viswanathan said…

I don't believe it is necessary to take "a < b" into account. HTML strings should always have < replaced with (ampersand)lt; and > replaced with (ampersand)gt; if they aren't used for a tag.

Then again, it's still a great way to prevent an error against the few who don't change them. :)

Note: (ampersand) = &. I had to write it differently so that it wouldn't show up as < or >.

April 15, 2009 at 12:43 PM

staff said…

thanks!!

September 16, 2009 at 1:17 AM

Unknown said…

Google Buzz Export to Twitter[...]I've written a python script to grab your Google Buzz feed (as detailed in the Buzz API), and automatically post your Buzz-es to Twitter. It includes a link back to the original Buzz URL[...]

February 11, 2010 at 12:39 PM

NIket said…

In following example tag =
the regular ecpression is fails. because this is not a valid html tag.

July 14, 2010 at 5:05 PM

Igor Partola said…

Please don't! Using regular expressions to parse HTML makes kittens cry: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

August 22, 2010 at 7:13 AM

Graham Poulter said…

Igor Partola: the regex's purpose is to strip parts that vaguely resemble a pointy-bracket markup, not for parsing HTML.

How would an HTML parser work with "3 < 5 " ? By throwing an error? Not much use then.

If OP just wanted to display strings I suppose it would also work to encode the pointies as < and > , at the expense of messy output.

August 23, 2010 at 4:40 PM

musicaonrails said…

It must be a joke that you are trying to use a Regex to parse HTML. Let's start with some Language Theory from the basic Computer Science classes.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regex on HTML is heuristics but that will not work on every condition. It is possible to find a problem in all regex that are trying to parse HTML. Please, go read "Chomsky hierarchy", and then, you are going to know that the context-free language set is bigger than the regular language set, and that is the WHY this discussion here makes no sense based on the principles of Computer Science!

January 11, 2012 at 6:40 PM

Tamim Shahriar said…

@musicaonrails, yes, you are theoretically correct, but in practice, regex works very well specially when you are trying to parse some specific websites. :)

January 11, 2012 at 7:28 PM

Graham Poulter said…

Remember, the regex does not in fact claim to parse HTML. What it does is strips any and all HTML tags with no regard for the grammar of HTML as a whole. It would thus work just as well in invalid tag soup as actual HTML.

January 11, 2012 at 8:10 PM

Joany said…

i need to extract the data between html tags. how do i do that?

September 4, 2012 at 12:47 AM

Search This Blog

life is short - you need Python!

Strip HTML tags using Python

Comments

Popular posts from this blog

lambda magic to find prime numbers

python code to compute jaccard index