Python code to scrape email address using regular expression

Programmers when learn writing web spiders or crawlers, try to write a script to parse/collect email address from website. Here I post a class in Python that can harvest email address. It uses regular expression.


import re

class EmailScraper():
def __init__(self):
self.emails = []

def reset(self):
self.emails = []

def collectAllEmail(self, htmlSource):
"collects all possible email addresses from a string, but still it can miss some addresses"
#example: t.s@d.com
email_pattern = re.compile("[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+")
self.emails = re.findall(email_pattern, htmlSource)

def collectEmail(self, htmlSource):
"collects all emails that starts with mailto: in the html source string"
#example: <a href="mailto:t.s@d.com">
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\">", re.IGNORECASE)
self.emails = re.findall(email_pattern, htmlSource)

Note that it's not a bullet proof program and don't use it for any bad purpose ;)

Comments

this was very useful for me. thanks.
is there a more advanced, that does not miss the ones you mentioned ?

thanks a lot.
Tamim Shahriar said…
Good to know that it was useful to you. You can definitely try to improve it and make an advanced version. Thanks.
Anonymous said…
how do i specify which url I want to scrape? Sorry...I'm just trying to learn Python for the first time.
Tamim Shahriar said…
@Matt, you can't do this using this code. When you get the content of an url, you can pass the content to parse and extract email addresses.
fishcooker said…
if i have file:///home/knoppix/src.html

how to use it?!
Tamim Shahriar said…
Just read the file in a variable and use it.
What does that mean? read it in a variable? Can you explain?
thanks
ndj said…
i have changed your regex a little now no white space is allowed [a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z]+[\.-0-9a-zA-Z]*\.[a-zA-Z]+

Thx btw ;)
Unknown said…
thanks, still useful...
Unknown said…
It also give the results like :
['verified_listing@2x-4ab84159ae2ff5f4ecd817beef9ede50.png', 'favorite_notif@2x-6c64c717f1101c319ee357505bbac5cd.jpg', 'activity_empty@2x-307af746773b2fc77d3b5c0ca83d65e9.png', 'rent_back_notif@2x-5682bc7a8194336bf86ec7fb60019037.jpg', 'account_creation@2x-b22082bfcd48013d684a68fb9989180a.jpg', 'top_cities@2x-d268f37ec8600943158855c910fbd9ed.png', 'powered-by-housing@2x-d73306a6a71886351a2b4af5beacd8c6.png']



How to solve this please help me

Popular posts from this blog

lambda magic to find prime numbers

Strip HTML tags using Python

python code to compute jaccard index