Sunday, February 09, 2014

Initials data, part III

     Yes, after more than two years, I'm working on this project again.

     I've used data regarding first names together with data regarding last names to put together the following chart:



     Five combinations account for at least 1% of all Americans each: JB, JC, JH, JM, and JS. JM is the most common set of initials; 1 out of every 77 Americans should have it. Toward the other extreme, many combinations are rarer than 1 in every 20,000 Americans (represented above by very pale gray). The least common set of initals, XX, are rarer than 1 in 68 million. Chances are there are only whour or fighI mean, four or fiveXX's in the United States.

     We can see that first names seem to have a stronger impact than last names. All the combinations with frequencies greater than 1% have J as the first initial. Also, three columns are made up only of combinations with frequencies less than .005% (Q, U, and X first initials), while only one row is (X last initials).

     What color group are your initials in? Mine are yellow.

     Now for the technicalities. First of allnot that this is a bad thingthe first name data I used here reflect everyone living in the US in 1990, as opposed to individuals born in a single year, like in my previous first name post.

     Second, what I said in the previous initials posts about extrapolating still holds: I only had data on first names for about 90% of the population, and only had data on last names for about 79.5% of the population. I assumed the remaining parts of the population upheld the frequencies calculated for the population I did have data on. I think the possibility that this is false is somewhat greater in the case of last names.

     Third, the population has obviously changed since 1990, but that's the year from which the most data are available.

     Fourth, and crucially, the calculations and statements I made about the frequencies of combinations make a huge assumption: that the frequencies of first names are independent of the frequencies of last names. That is, I simply assumed that if letter 1 accounts for 10% of all first initials, and letter 2 accounts for 10% of all last initials, then the combination letter 1 letter 2 accounts for 1% (that's 10% of 10%) of the whole population.

     But is that true? That's actually the entire point of this project: to answer that question. Are alliterative names (or identical initials) more common than we'd expect by chance? Do parents prefer such names, and if so, to what extent? Are comic books lying to us? Is there enough room in this country for two Xavier Xiongs? Stay tuned to find out.

Bonus: the most common last names in the US (in 1990) beginning with each letter:
Anderson
Brown
Clark
Davis
Evans
Flores
Garcia
Harris
Ingram
Johnson
King
Lewis
Miller
Nelson
Owens
Perez
Quinn
Robinson
Smith
Taylor
Underwood
Vasquez
Williams
Xiong
Young
Zimmerman

Edit: Here is the same data presented slightly differently. The most common initials are colored the darkest red, the least common initials are the darkest blue, and the initials with median frequency are white: