Skip to content

Intro

There are several main parts in the framework:

  • browser - Class to navigate through specific website and extract data from the webpages (requests, selenium etc.)
  • link filter - Class to extract URLs from a page corresponding to the website. There are two link filters:
  • data extractor - Class to extract data from a page. At the moment there are two data extractors:
  • factories - Combination of different link filters and data extractors, e.g. DefaultFilterAndEmailFactory or ContactFilterAndEmailAndLinkedinFactory
  • DefaultWorker - All data extractions goes here

Simple Usage:

As library

from pathlib import Path

from extract_emails import DefaultFilterAndEmailFactory as Factory
from extract_emails import DefaultWorker
from extract_emails.browsers.requests_browser import RequestsBrowser as Browser
from extract_emails.data_savers import CsvSaver


websites = [
    "website1.com",
    "website2.com",
]

browser = Browser()
data_saver = CsvSaver(save_mode="a", output_path=Path("output.csv"))

for website in websites:
    factory = Factory(
        website_url=website, browser=browser, depth=5, max_links_from_page=1
    )
    worker = DefaultWorker(factory)
    data = worker.get_data()
    data_saver.save(data)

As CLI tool

$ extract-emails --help

$ extract-emails --url https://en.wikipedia.org/wiki/Email -of output.csv -d 1
$ cat output.csv
email,page,website
bob@b.org,https://en.wikipedia.org/wiki/Email,https://en.wikipedia.org/wiki/Email