How Do You Use Python to Process (and Thereby Clean) a Flat File That Is a List of Email Addresses with Extraneous Symbols?

Problem scenario
You have an input file (.txt) that is a list of email addresses.  The file is guaranteed to have one of the following: one or more spaces between each email address or there is certainly a new line between the each email address.  Beyond that there can be one or two commas after the email address.  There could also be one or two semicolons after the email address.  You know that this input file will not have a mix of the two punctuation marks.  That is you will not see both a comma and a semicolon after the each email address.

You want to write a Python program to produce a clean output with commas between the email addresses.  You want to eliminate duplicate email addresses, duplicate commas, and all semicolons.  You want to have the email addresses in all lowercase characters.  How do you use Python to manipulate (or process) a .txt file that has a list of email addresses and produce a new file (as output) that is a clean .csv?

Solution
Use this file csvmaker.py.  Follow the usage instructions in the comments at the top.

# Written by continualintegration.com.  
# Usage instructions and requirements.
# This Python program produces a clean .csv file of email addresses.  It requires an input file named inputfile.txt.
# inputfile.txt is in the same directory as this Python program.
# inputfile.txt can have email addresses on each line with or without a terminating semicolon or comma (but never a mix of those punctuation marks).
# The output file will be final.csv in the same directory as this .py file and the inputfile.txt.
#  Call this file "csvmaker.py".  Run it like this: "python csvmaker.py"
"""
To create an input file to demonstrate this program's features, create inputfile.txt with this as the content:

cool@cool.com
basic@basic.com;; another@another.com
good@good.com, whatever@whatever.com
fun@fun.com,,
try@try.com,

something@something.com;
example@example.com
"""
emaillist = [] # initialize a blank list
with open('inputfile.txt', 'r') as a:
        for b in a:
                if '@' in b:   #operate only on lines with "@"
                        c = b.split()  #split up words (that are separated by spaces) on line
                        for d in c:    #iterate through characters that make up words
                                if '@' in d:  #if email address,
                                        emaillist.append(d)  # This builds a list of raw email addresses that are unformatted.
e = list(set(emaillist)) # This eliminates duplicates in the list that is in memory called "emaillist"
e.sort() # Alphabetize the list of emails
fh = open('final.csv', 'w')
for c in e:  # Go through each line
        c = c.replace(";;", ",") # replace two semi-colons with a comma
        c = c.replace(",,", "") # replace two commas with a comma
        if ';' in c:
                c = c.replace(";", ",") # Replace one semi-colon with a comma
        else:
                if ',' in c:
                        var1 = 1; # do nothing. without an operation, Python will throw an error.
                else:
                        c = c + ","  # If there is no comma, add one.
#       print(c)
        fh.writelines(c)
fh.close() 

Leave a comment

Your email address will not be published. Required fields are marked *