But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page:
content = content.replace("\n", "")
content = content.replace("\r", "")
Now the regex should work if everything else is ok!


3 comments:
Well, I always use multiline regular expression. they work.
Beside this Its good to use domxml to parse the content. It wont fail.
Another good technique would be using tidy and xslt for proper scrapping . . .
I use urllib2 for fetching content from a website.
Post a Comment