Link Filters¶
LinkFilterBase
¶
Bases: ABC
Base class for link filters
Source code in extract_emails/link_filters/link_filter_base.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
__init__(website)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
website
|
str
|
website address (scheme and domain), e.g. https://example.com |
required |
Source code in extract_emails/link_filters/link_filter_base.py
12 13 14 15 16 17 18 |
|
filter(urls)
abstractmethod
¶
Filter links by some parameters
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
Iterable[str]
|
List of URLs for filtering |
required |
Returns:
Type | Description |
---|---|
list[str]
|
List of filtered URLs |
Source code in extract_emails/link_filters/link_filter_base.py
63 64 65 66 67 68 69 70 71 72 |
|
get_links(page_source)
staticmethod
¶
Extract all URLs corresponding to current website
Examples:
>>> from extract_emails.link_filters import LinkFilterBase
>>> links = LinkFilterBase.get_links(page_source)
>>> links
["example.com", "/example.com", "https://example2.com"]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page_source
|
str
|
HTML page source |
required |
Returns:
Type | Description |
---|---|
list[str]
|
List of URLs |
:param str page_source: HTML page source :return: List of URLs
Source code in extract_emails/link_filters/link_filter_base.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
|
get_website_address(url)
staticmethod
¶
Extract scheme and domain name from an URL
Examples:
>>> from extract_emails.link_filters import LinkFilterBase
>>> website = LinkFilterBase.get_website_address('https://example.com/list?page=134')
>>> website
'https://example.com/'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
URL for parsing |
required |
Returns:
Type | Description |
---|---|
str
|
scheme and domain name from URL, e.g. https://example.com |
Source code in extract_emails/link_filters/link_filter_base.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
DefaultLinkFilter
¶
Bases: LinkFilterBase
Default filter for links
Source code in extract_emails/link_filters/default_link_filter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
filter(links)
¶
Will exclude from a list URLs, which not starts with self.website
and not starts with '/'
Examples:
>>> from extract_emails.link_filters import DefaultLinkFilter
>>> test_urls = ["https://example.com/page1.html","/page.html","/page.html", "https://google.com"]
>>> link_filter = DefaultLinkFilter("https://example.com/")
>>> filtered_urls = link_filter.filter(test_urls)
>>> filtered_urls
["https://example.com/page1.html", "https://example.com/page.html"]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
links
|
Iterable[str]
|
List of links for filtering |
required |
Returns:
Type | Description |
---|---|
list[str]
|
Set of filtered URLs |
Source code in extract_emails/link_filters/default_link_filter.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
ContactInfoLinkFilter
¶
Bases: LinkFilterBase
Contact information filter for links.
Only keep the links might contain the contact information.
Examples:
>>> from extract_emails.link_filters import ContactInfoLinkFilter
>>> link_filter = ContactInfoLinkFilter("https://example.com")
>>> filtered_links = link_filter.filter(['/about-us', '/search'])
>>> filtered_links
['https://example.com/about-us']
>>> from extract_emails.link_filters import ContactInfoLinkFilter
>>> link_filter = ContactInfoLinkFilter("https://example.com", use_default=True)
>>> filtered_links = link_filter.filter(['/blog', '/search'])
>>> filtered_links
['https://example.com/blog', 'https://example.com/search']
>>> from extract_emails.link_filters import ContactInfoLinkFilter
>>> link_filter = ContactInfoLinkFilter("https://example.com", use_default=False)
>>> filtered_links = link_filter.filter(['/blog', '/search'])
>>> filtered_links
[]
>>> from extract_emails.link_filters import ContactInfoLinkFilter
>>> link_filter = ContactInfoLinkFilter("https://example.com", contruct_candidates=['search'])
>>> filtered_links = link_filter.filter(['/blog', '/search'])
>>> filtered_links
['https://example.com/search']
Source code in extract_emails/link_filters/contact_link_filter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
__init__(website, contruct_candidates=None, use_default=False)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
website
|
str
|
website address (scheme and domain), e.g. https://example.com |
required |
contruct_candidates
|
list[str] | None
|
keywords for filtering the list of URLs,
default: see |
None
|
use_default
|
bool
|
if no contactinfo urls found and return filtered_urls, default: True |
False
|
Source code in extract_emails/link_filters/contact_link_filter.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
filter(urls)
¶
Filter out the links without keywords
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
Iterable[str]
|
List of URLs for filtering |
required |
Returns:
Type | Description |
---|---|
list[str]
|
List of filtered URLs |
Source code in extract_emails/link_filters/contact_link_filter.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|