Skip to content

Intro

Installation

pip install extract_emails[all]
# or
pip install extract_emails[httpx]
# or
pip install extract_emails[playwright]
playwright install chromium --with-deps

Quick Usage

As library

from pathlib import Path

from extract_emails import DefaultWorker
from extract_emails.browsers import ChromiumBrowser, HttpxBrowser
from extract_emails.models import PageData

def main():
    with ChromiumBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = worker.get_data()
        PageData.to_csv(data, Path("output.csv"))

    with HttpxBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = worker.get_data()
        PageData.to_csv(data, Path("output.csv"))

async def main():
    async with ChromiumBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = await worker.aget_data()
        await PageData.to_csv(data, Path("output.csv"))

    async with HttpxBrowser() as browser:
        worker = DefaultWorker("https://example.com, browser)
        data = await worker.aget_data()
        await PageData.to_csv(data, Path("output.csv"))

As CLI tool

$ extract-emails --help

$ extract-emails --url https://en.wikipedia.org/wiki/Email -of output.csv
$ cat output.csv
email,page,website
bob@b.org,https://en.wikipedia.org/wiki/Email,https://en.wikipedia.org/wiki/Email
There are several main parts in the framework:

  • browser - Class to navigate through specific website and extract data from the webpages (httpx, playwright etc.)
  • link filter - Class to extract URLs from a page corresponding to the website. There are two link filters (ContactInfoLinkFilter by default):
  • data extractor - Class to extract data from a page. At the moment there are two data extractors (both by default):
  • DefaultWorker - All data extractions goes here