Python code to scrape email address using regular expression
Programmers when learn writing web spiders or crawlers, try to write a script to parse/collect email address from website. Here I post a class in Python that can harvest email address. It uses regular expression.
Note that it's not a bullet proof program and don't use it for any bad purpose ;)
import re
class EmailScraper():
def __init__(self):
self.emails = []
def reset(self):
self.emails = []
def collectAllEmail(self, htmlSource):
"collects all possible email addresses from a string, but still it can miss some addresses"
#example: t.s@d.com
email_pattern = re.compile("[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+")
self.emails = re.findall(email_pattern, htmlSource)
def collectEmail(self, htmlSource):
"collects all emails that starts with mailto: in the html source string"
#example: <a href="mailto:t.s@d.com">
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\">", re.IGNORECASE)
self.emails = re.findall(email_pattern, htmlSource)
Note that it's not a bullet proof program and don't use it for any bad purpose ;)
Comments
is there a more advanced, that does not miss the ones you mentioned ?
thanks a lot.
how to use it?!
thanks
Thx btw ;)
['verified_listing@2x-4ab84159ae2ff5f4ecd817beef9ede50.png', 'favorite_notif@2x-6c64c717f1101c319ee357505bbac5cd.jpg', 'activity_empty@2x-307af746773b2fc77d3b5c0ca83d65e9.png', 'rent_back_notif@2x-5682bc7a8194336bf86ec7fb60019037.jpg', 'account_creation@2x-b22082bfcd48013d684a68fb9989180a.jpg', 'top_cities@2x-d268f37ec8600943158855c910fbd9ed.png', 'powered-by-housing@2x-d73306a6a71886351a2b4af5beacd8c6.png']
How to solve this please help me