Determining Someone’s Name From Their Email Address
This post has nothing to do with our awesome web performance products, but is about how I used software and some public data to solve an interesting problem, and how that helped improve how Zoompf sells our product.
Hundreds of people a week use our free performance report. Our sales team sends these people a personalized email thanking them and asking if they had any questions on the report. We don't want to spam people and so I strive to find ways I can use engineering to help the sales team send the best emails possible.
I know that personalized emails have a higher open and conversion rate, so it would help if I collected someone's name when do a free report. That is certainly something I can do. However I also know that conversion rates drop when a form has too many fields. This means adding a name field could actually hurt our sales effort more than it helps if fewer people do free reports.
Is there a way that I could help sales send personalized emails without having to add a form field? What if I could extract a persons name from the email address programmatically? Specifically, I wanted to determine someone's first name for a given email address. This is the story of how I did that.
Crazy addresses and common formats
Let's frame the problem. While people have some crazy email aliases I am primarily interested in business-related email addresses. These tend to take on a form like one of these:
Of course, I can't just look for these formats. Someone could have an email address like
email@example.com but chances are their first name isn't Mr! Addresses matching a specific format alone isn't going to help me. I needed some way to know if a string of letters is a valid first name. Really what I needed was a list of valid first names to compare against.
Building a list of valid first names
You know all those baby name websites? They tell you how popular names are from different years, going back decades? Even wonder how they know how popular the name Agatha was in 1912? Well, in the US, the Social Security Administration has published data about the first names of people born in every year since 1880. I used this as my starting point to build my list of first names. The US is a pretty good melting pot, so this list actually includes are large number of names you might consider to be associated with other countries.
Next, I wrote a little program to keep a running total of all names, and the number of people born with that name, across all years since 1945. Basically, how many people were born named Carol or Frank? I choose 1945 as an arbitrary cut off, and my belief that baby boomers and later may have an email address. I created a file with all of the names and counts of people with that name, sorted by the count. Fun fact: there are just over 4 million people named Michael born in the US in the last 70 years.
I only use the top 15,000 or so most popular names in my list. The last name in my list is "Karrigan", of whom there are 407 US citizens roaming the world. I limited my list to avoid false positives. Someone could have a really obscure name that I won't find, but chances are I'm not emailing with 1 of the 5 people named "Leeric" ever born.
Extracting names from email addresses
With my list of common first names, I was ready to begin. I narrowed my scope to extracting the first name from emails that followed these formats:
Luckily we only need to look the local part of the email, which is everything in front of the @ symbol.
Extracting a first name from format 1 is very easy. Take the entire local part of the email, and check it against our list of names. If we have a match, we have a first name.
Formats 2 and 3 and also fairly easily. We look to see if the local part has an obvious separator like
_. While other valid email characters could conceivably be used we err on the side of being safe.
If possible, we split the local part into 2 parts, so
billy. Next, we check each part against our list of first names. If we get a match, we have the first name.
The Samantha Cassidy Problem
This approach works pretty well, but we have what I like to call the Samantha Cassidy problem. Sam is a friend of mine who just happens to have a last name that is also a female first name. What happens if Sam's email address was
firstname.lastname@example.org? Our algorithm might think her name is "Cassidy", which, knowing Sam, she would be annoyed by. What can we do?
Luckily, we can use the popularity of a name to help determine the order. In this case, "Samantha" is the 79th most common name, with a hair over 500,000 people with that name. "Cassidy" is decidedly less popular, at number 747 on my list with only 54,917 people. If both segments of the local part are common first names, I choose the first name that is the most popular.
The giant run-on email address
What about email format #4, where the first and last name are concatenated together like
email@example.com? Well, we can do a brute force match. We could try all 15,000 names in our list and see if an email address begins with a name.
This sounds like a good idea. Indeed, it will extract "Billy" from
firstname.lastname@example.org. But what about something like
JimboJambo55@example.com? We would extract the name "Jim", which may or may not be correct. Or
SarabandingSphinx@example.com. "Sara" a name, but "saraband" is an actual English word.
To avoid these cases, I only brute force match with first names that are 5 or more characters.
Another challenge is when matching length. Consider the email
email@example.com. "Chris" would match, but so would "Christopher". In this case, I actually try all 15,000 names that are longer than 5 characters, and use the matched name which is the longest.
While the person who uses
firstname.lastname@example.org might be called "Chris" by his friends, I want to err on the side of being correct than saying the wrong name entirely.
Extracting last names?
Since I've found a way to determine first names from email addresses, should I also try and extract last names? Yes, you can, and I did. I spent an afternoon downloading the Master Death File (seriously, that's what it's called) of everyone who has ever died in US since 1962. I then parsed out last names, aggregated them, and did a similar style look up. While that was fun from an engineering perspective, it wasn't very useful. We never use last names in our sales outreach. Determining a valid last name really just helped us confirm we had found someone's first name, because the only remaining part of their address was a last name. This improved our accuracy in some cases, but just wasn’t worth the time.
Using data from the Social Security Administration, and the algorithms described above, I am able to reliably determine the first name of a user from the email address. When run against our entire database of free report users, I was able to determine a first name for around 40% of all emails! Below is a redacted snippet of the report we generate of new free scan users:
If you found this topic interesting, you should see what we can do when we focus our time and energy on auditing website performance! Our free performance report will examine your site for over 400 performance issues. You can also stay on top of your website performance by joining the free Zoompf Alerts beta to automatically scan your website every day and get alerted when we detect one of the common causes of slow website performance. Either way, you better believe that if we can figure out your first name from your email address, you’ll get a personal greeting from us!