Regular Expression not working in scraper?

This is a very common problem for the beginners who try to write web crawler / spider / scraper. The content is fetched but regex is not working right. :(

But the problem is not with the regular expression. You just need to add the following two lines after you fetch content of a web page:

content = content.replace("\n", "")
content = content.replace("\r", "")



Now the regex should work if everything else is ok!

Comments

Shiplu said…
Well, I always use multiline regular expression. they work.
Beside this Its good to use domxml to parse the content. It wont fail.
Shiplu said…
Another good technique would be using tidy and xslt for proper scrapping . . .
Tamim Shahriar said…
I use urllib2 for fetching content from a website.

Popular posts from this blog

Python all any built-in function

Accept-Encoding 'gzip' to make your cralwer faster

lambda magic to find prime numbers