Email Extraction and Categorization

Data Structures in Python Email Extraction and Categorization

Scenario: We have a large block of text from various sources, and it contains several email addresses. We want to extract these emails and categorize them by their domain to see which are the most common domains in the text.

Approach:

Use a regular expression to find all email addresses in the text.
Use a dictionary to count occurrences by domain.
Sort and display the results.

Here’s the detailed code:

import re
from collections import defaultdict

# Sample text containing various email addresses
text = """
Contact us at support@example.com for further inquiries.
Alternatively, reach out to the helpdesk at help@example.org or sales@example.com.
Our team in Germany can be reached at kontakt@beispiel.de for more specialized assistance.
For collaborations, contact our team at partnerships@example.co.uk.
Visit us at www.example.com or follow us on contact@example.com.
"""

# Regular expression to extract all email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

# Find all matches in the text
emails = re.findall(email_pattern, text)

# Dictionary to count occurrences by domain
domain_count = defaultdict(int)

# Process each email found
for email in emails:
    # Split the email on '@' and get the domain part
    domain = email.split('@')[1]
    # Increment the count for this domain
    domain_count[domain] += 1

# Sort domains by count in descending order
sorted_domains = sorted(domain_count.items(), key=lambda item: item[1], reverse=True)

# Display the results
for domain, count in sorted_domains:
    print(f"{domain}: {count}")

Code Explanation:

Import Necessary Modules:
- re for regular expressions to match email patterns.
- defaultdict from collections to facilitate counting without initializing keys.
Define Sample Text:
- A multiline string text simulates a realistic scenario where emails are embedded in text.
Regular Expression for Email Extraction:
- email_pattern is defined to match email addresses. It looks for sequences of characters that form valid emails (characters, digits, and special symbols, followed by an @, then the domain part which includes periods).
Finding Emails:
- re.findall() searches the text for all non-overlapping occurrences of the pattern. It returns a list of email addresses found in the text.
Counting Domain Occurrences:
- Using a loop, each email address is split into username and domain. The domain part is used to increment its count in domain_count, a defaultdict that initializes non-existing keys with an integer (initially 0).
Sorting and Displaying Results:
- Domains are sorted by their occurrence counts in descending order using sorted() with a lambda function as the key. The lambda function specifies that sorting should be based on the second element of each tuple (the count).
- Finally, the sorted domains and their counts are printed.

Previous Lesson

Back to Tutorial

Next Lesson

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
tk_lr	1 year	The tk_lr is a referral cookie set by the JetPack plugin on sites using WooCommerce, which analyzes referrer behaviour for Jetpack.
tk_or	5 years	The tk_or is a referral cookie set by the JetPack plugin on sites using WooCommerce, which analyzes referrer behaviour for Jetpack.
tk_r3d	3 days	JetPack installs this cookie to collect internal metrics for user activity and in turn improve user experience.
tk_tc	session	JetPack sets this cookie to record details on how user's use the website.