Intro¶

Installation¶

pip install extract_emails[all]
# or
pip install extract_emails[httpx]
# or
pip install extract_emails[playwright]
playwright install chromium --with-deps

Quick Usage¶

As library¶

from pathlib import Path

from extract_emails import DefaultWorker
from extract_emails.browsers import ChromiumBrowser, HttpxBrowser
from extract_emails.models import PageData

def main():
    with ChromiumBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = worker.get_data()
        PageData.to_csv(data, Path("output.csv"))

    with HttpxBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = worker.get_data()
        PageData.to_csv(data, Path("output.csv"))

async def main():
    async with ChromiumBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = await worker.aget_data()
        await PageData.to_csv(data, Path("output.csv"))

    async with HttpxBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = await worker.aget_data()
        await PageData.to_csv(data, Path("output.csv"))

As CLI tool¶

$ extract-emails --help

$ extract-emails --url https://en.wikipedia.org/wiki/Email -of output.csv
$ cat output.csv
email,page,website
bob@b.org,https://en.wikipedia.org/wiki/Email,https://en.wikipedia.org/wiki/Email

There are several main parts in the framework:

browser - Class to navigate through specific website and extract data from the webpages (httpx, playwright etc.)
link filter - Class to extract URLs from a page corresponding to the website. There are two link filters (ContactInfoLinkFilter by default):
- DefaultLinkFilter - Will extract all URLs corresponding to the website
- ContactInfoLinkFilter - Will extract only contact URLs, e.g. /contact/, /about-us/ etc
data extractor - Class to extract data from a page. At the moment there are two data extractors (both by default):
- EmailExtractor - Will extract all emails from the page
- LinkedinExtractor - Will extract all links to Linkedin profiles from the page
DefaultWorker - All data extractions goes here