Regular Expression not working in scraper?

July 05, 2008

This is a very common problem for the beginners who try to write web crawler / spider / scraper. The content is fetched but regex is not working right. :(

But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page:


content = content.replace("\n", "")
content = content.replace("\r", "")

Now the regex should work if everything else is ok!

Comments

Shiplu said…

Well, I always use multiline regular expression. they work.
Beside this Its good to use domxml to parse the content. It wont fail.

July 13, 2008 at 12:10 AM

Shiplu said…

Another good technique would be using tidy and xslt for proper scrapping . . .

July 13, 2008 at 12:12 AM

Tamim Shahriar said…

I use urllib2 for fetching content from a website.

July 13, 2008 at 10:09 AM

Search This Blog

life is short - you need Python!

Regular Expression not working in scraper?

Comments

Popular posts from this blog

Accept-Encoding 'gzip' to make your cralwer faster

Strip HTML tags using Python

Python all any built-in function