Email Extraction and Categorization

Scenario: We have a large block of text from various sources, and it contains several email addresses. We want to extract these emails and categorize them by their domain to see which are the most common domains in the text.

Approach:

  1. Use a regular expression to find all email addresses in the text.
  2. Use a dictionary to count occurrences by domain.
  3. Sort and display the results.

Here’s the detailed code:

import re
from collections import defaultdict

# Sample text containing various email addresses
text = """
Contact us at support@example.com for further inquiries.
Alternatively, reach out to the helpdesk at help@example.org or sales@example.com.
Our team in Germany can be reached at kontakt@beispiel.de for more specialized assistance.
For collaborations, contact our team at partnerships@example.co.uk.
Visit us at www.example.com or follow us on contact@example.com.
"""

# Regular expression to extract all email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

# Find all matches in the text
emails = re.findall(email_pattern, text)

# Dictionary to count occurrences by domain
domain_count = defaultdict(int)

# Process each email found
for email in emails:
    # Split the email on '@' and get the domain part
    domain = email.split('@')[1]
    # Increment the count for this domain
    domain_count[domain] += 1

# Sort domains by count in descending order
sorted_domains = sorted(domain_count.items(), key=lambda item: item[1], reverse=True)

# Display the results
for domain, count in sorted_domains:
    print(f"{domain}: {count}")

Code Explanation:

  1. Import Necessary Modules:
    • re for regular expressions to match email patterns.
    • defaultdict from collections to facilitate counting without initializing keys.
  2. Define Sample Text:
    • A multiline string text simulates a realistic scenario where emails are embedded in text.
  3. Regular Expression for Email Extraction:
    • email_pattern is defined to match email addresses. It looks for sequences of characters that form valid emails (characters, digits, and special symbols, followed by an @, then the domain part which includes periods).
  4. Finding Emails:
    • re.findall() searches the text for all non-overlapping occurrences of the pattern. It returns a list of email addresses found in the text.
  5. Counting Domain Occurrences:
    • Using a loop, each email address is split into username and domain. The domain part is used to increment its count in domain_count, a defaultdict that initializes non-existing keys with an integer (initially 0).
  6. Sorting and Displaying Results:
    • Domains are sorted by their occurrence counts in descending order using sorted() with a lambda function as the key. The lambda function specifies that sorting should be based on the second element of each tuple (the count).
    • Finally, the sorted domains and their counts are printed.