Python code to scrape email address using regular expression

Programmers when learn writing web spiders or crawlers, try to write a script to parse/collect email address from website. Here I post a class in Python that can harvest email address. It uses regular expression.


import re

class EmailScraper():
def __init__(self):
self.emails = []

def reset(self):
self.emails = []

def collectAllEmail(self, htmlSource):
"collects all possible email addresses from a string, but still it can miss some addresses"
#example: t.s@d.com
email_pattern = re.compile("[-a-zA-Z0-9._]+@[-a-zA-Z0-9_]+.[a-zA-Z0-9_.]+")
self.emails = re.findall(email_pattern, htmlSource)

def collectEmail(self, htmlSource):
"collects all emails that starts with mailto: in the html source string"
#example: <a href="mailto:t.s@d.com">
email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\">", re.IGNORECASE)
self.emails = re.findall(email_pattern, htmlSource)

Note that it's not a bullet proof program and don't use it for any bad purpose ;)

Comments

this was very useful for me. thanks.
is there a more advanced, that does not miss the ones you mentioned ?

thanks a lot.
Tamim Shahriar said…
Good to know that it was useful to you. You can definitely try to improve it and make an advanced version. Thanks.
Anonymous said…
how do i specify which url I want to scrape? Sorry...I'm just trying to learn Python for the first time.
Tamim Shahriar said…
@Matt, you can't do this using this code. When you get the content of an url, you can pass the content to parse and extract email addresses.
fishcooker said…
if i have file:///home/knoppix/src.html

how to use it?!
Tamim Shahriar said…
Just read the file in a variable and use it.
What does that mean? read it in a variable? Can you explain?
thanks
ndj said…
i have changed your regex a little now no white space is allowed [a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z]+[\.-0-9a-zA-Z]*\.[a-zA-Z]+

Thx btw ;)
Unknown said…
thanks, still useful...
Unknown said…
It also give the results like :
['verified_listing@2x-4ab84159ae2ff5f4ecd817beef9ede50.png', 'favorite_notif@2x-6c64c717f1101c319ee357505bbac5cd.jpg', 'activity_empty@2x-307af746773b2fc77d3b5c0ca83d65e9.png', 'rent_back_notif@2x-5682bc7a8194336bf86ec7fb60019037.jpg', 'account_creation@2x-b22082bfcd48013d684a68fb9989180a.jpg', 'top_cities@2x-d268f37ec8600943158855c910fbd9ed.png', 'powered-by-housing@2x-d73306a6a71886351a2b4af5beacd8c6.png']



How to solve this please help me

Popular posts from this blog

Strip HTML tags using Python

lambda magic to find prime numbers

Convert text to ASCII and ASCII to text - Python code