Sunday, February 09, 2014

Initials data, part III

     Yes, after more than two years, I'm working on this project again.

     I've used data regarding first names together with data regarding last names to put together the following chart:



     Five combinations account for at least 1% of all Americans each: JB, JC, JH, JM, and JS. JM is the most common set of initials; 1 out of every 77 Americans should have it. Toward the other extreme, many combinations are rarer than 1 in every 20,000 Americans (represented above by very pale gray). The least common set of initals, XX, are rarer than 1 in 68 million. Chances are there are only whour or fighI mean, four or fiveXX's in the United States.

     We can see that first names seem to have a stronger impact than last names. All the combinations with frequencies greater than 1% have J as the first initial. Also, three columns are made up only of combinations with frequencies less than .005% (Q, U, and X first initials), while only one row is (X last initials).

     What color group are your initials in? Mine are yellow.

     Now for the technicalities. First of allnot that this is a bad thingthe first name data I used here reflect everyone living in the US in 1990, as opposed to individuals born in a single year, like in my previous first name post.

     Second, what I said in the previous initials posts about extrapolating still holds: I only had data on first names for about 90% of the population, and only had data on last names for about 79.5% of the population. I assumed the remaining parts of the population upheld the frequencies calculated for the population I did have data on. I think the possibility that this is false is somewhat greater in the case of last names.

     Third, the population has obviously changed since 1990, but that's the year from which the most data are available.

     Fourth, and crucially, the calculations and statements I made about the frequencies of combinations make a huge assumption: that the frequencies of first names are independent of the frequencies of last names. That is, I simply assumed that if letter 1 accounts for 10% of all first initials, and letter 2 accounts for 10% of all last initials, then the combination letter 1 letter 2 accounts for 1% (that's 10% of 10%) of the whole population.

     But is that true? That's actually the entire point of this project: to answer that question. Are alliterative names (or identical initials) more common than we'd expect by chance? Do parents prefer such names, and if so, to what extent? Are comic books lying to us? Is there enough room in this country for two Xavier Xiongs? Stay tuned to find out.

Bonus: the most common last names in the US (in 1990) beginning with each letter:
Anderson
Brown
Clark
Davis
Evans
Flores
Garcia
Harris
Ingram
Johnson
King
Lewis
Miller
Nelson
Owens
Perez
Quinn
Robinson
Smith
Taylor
Underwood
Vasquez
Williams
Xiong
Young
Zimmerman

Edit: Here is the same data presented slightly differently. The most common initials are colored the darkest red, the least common initials are the darkest blue, and the initials with median frequency are white:


Tuesday, December 10, 2013

This post will thench you

The other day I was looking at the etymology of drink when I was reminded that the Proto-Germanic language, an ancestor of English, had a morphological causative. Quick explanation for non-linguists: a causative construction is one that means "cause to do something" or "make do something". Usually in English, we just express it like that"make X" or "cause to X". Sometimes we have entirely separate words, like kill means "cause to die". But other languages have a morphological causative, meaning they get that causative meaning across by adding something to a verb or changing the verb slightly.

Why is this at all interesting? Well, modern English doesn't have a morphological causative. But it has kept at least one set of distinct descendants from both a Proto-Germanic verb and its causative form: drink and drench. That is, drench originally meant "cause someone to drink". Isn't that cool? I think that's cool.

But the part I like best is that, because sound changes are regular, there isn't really any reason why you shouldn't see the same pattern with other verbs that rhyme with drink, as long as they come from verbs that existed when this process worked.

Let's take sink. Making an analogy with drink ~ drench, we get sench, "cause to sink". I was happy to see that this one actually existed (Thanks Wiktionary!). Why not use it the next time you play Battleship? Notice how English got rid of a perfectly good word and made sink serve double duty.

Next! How about stink? Applying the pattern, we get stench. Interestingly, this one already exists as a noun, but why should we let that stop us? I guess this could be useful, maybe when talking about skunks.

Thench is (or would be) a great word, even though think has a complicated word history that means it never could have existed. You could use it for a non-particular kind of thinking, with the same meaning as "be thought provoking", or for a particular thought, like X thenched me that...

And shrink gives us shrench! Wouldn't shrench be the best? Shrink is like sink in that it plays double duty, meaning "become smaller" and "cause to become smaller". A clothes dryer shrenches things. So does a wizard.

There are only a few more to consider. Slink ~ slench and wink ~ wench sound cool, but I can't really think of any contexts to use them in. Blink ~ blench could be better; you could blench your Christmas lights, but if they're too bright, they might blench you.

Wednesday, August 28, 2013

Noch ein Spracheprogramm

I promise I do think about other things, but here's another language program. This time it's German, and focuses on adjective endings and articles for different cases and genders. I think it's helped me quite a bit.

Saturday, August 24, 2013

Nouns! In Navajo!

Here's another program I've been working on for a long time. It's very simple and plain; I use it myself to keep my Navajo vocabulary from getting too rusty. Maybe someone else can get some use from it. If so, I should probably come up with a more interesting name than Navajo nouns.

Friday, August 16, 2013

U redu

Despite not being an expert on the language, I made a program to teach and practice elementary Croatian. It's called U Redu. Get it here.

Wednesday, May 08, 2013

Bucking trends

     After last November's election, I heard more than one TV pundit remark upon a certain correlation: densely-populated areas tend to vote, in modern US presidential elections, for the Democratic candidate, and less densely-populated areas tend to vote for the Republican. That this seemed to be news to them puzzled me, but it did get me thinking: what are the counterexamples? Where are the rural liberal places, and where are the urban conservative places?

     All it took was election returns by county, census data on population density, and R. I ended up with this plot:

The ringed points on the plot represent the most positive and most negative residuals, i.e. the data with the greatest difference from the line of best fit.

     Obama's vote share is on the y-axis, and the logarithm of population density is on the x-axis (because adding, say, 100 people to a county could have a huge political effect if the county only has 100 people to begin with; adding 100 people to Los Angeles County, on the other hand, wouldn't change much).

     Clearly, the correlation we assumed holds up, although it's not as strong as I thought it'd be; as density increases, Obama's vote share increases. But what of the outliersthose data points down and to the right, or up and to the left?

     The red points are those where Obama's share was more than 26 percentage points lower than the model predicts; for the blue points, Obama's share was more than 34 percentage points higher than expected (those cutoff points aren't really significant). Here are the locations of those counties:



     The greatest outlier on the conservative side is Utah County, Utah, just south of Salt Lake City. It contains Provo, Utah's third largest city. The greatest left-leaning outlier is Shannon County, South Dakota, which lies within the Pine Ridge Indian Reservation. Over half of its population is below the poverty line.

     There are some obvious geographical patterns. Interestingly, the Deep South has outliers in both directions, located very near each other. A range of sparsely-populated, Democratic-leaning counties stretches from southwest Colorado to southern Texas, while there's a cluster of relatively dense conservative counties in northern Utah and southern Idaho. There's also a group of rural, blue counties in the Dakotas.

     It might not be a surprise that race is an important factor. Most of the red counties are 90% or more white; all are at least 80% white. The westernmost group is in the heart of Mormon country, so Mitt Romney's candidacy may have made these counties vote even more Republican than usual (I only used 2012 presidential returns). Other red areas include Randall County, Texas, which skirts the southern edge of Amarillo; Montgomery County, Texas, containing northern suburbs of Houston; Livingston Parish, Louisiana, between Baton Rouge and New Orleans; and Leslie County, Kentucky, a mountainous, coal-mining county with a community named "Hell For Certain".

     The blue areas in Montana, North and South Dakota, Wisconsin, and Arizona are majority-American Indian. The blue Texas counties and most of the blue counties in New Mexico are majority Hispanic/Latino, and the blue counties in the Southeast and Maryland are majority-black. Hawaiʻi County, Hawaii, on the other hand, is very diverse and doesn't seem to be majority-anything, although Obama's Hawaiian childhood may have played a part here. Finally, Windham County, Vermont, and two of the three blue counties in Colorado are majority-white (non-Hispanic). Who are these rural white liberals? Vermont hippies, maybe, but I'm not sure about Colorado. San Miguel County, in particular, had the greatest residual of any of the majority-white counties. It contains the town of Telluride, which I know nothing about, except that it's apparently an excellent ski resort. It's interesting that it's right across the border from conservative Utah.

     So, race probably accounts for much of the aberration from the trend. On the other hand, we could say that the correlation is only "normal" for the Midwest and the coasts, and reflects the perspective of the people who live there, not everywhere in the United States.

Friday, September 09, 2011

Initials data, part II

     Back to my initials project. This time, let’s look at last names. I don’t have as much to say here as I did about first names. Here’s the chart:






















     Fascinating. The colors just group together different ranges of frequency. I can’t really make much of this, except that all the “Mc”/”Mac” names probably contribute to the M count. And I guess “Smith” alone bumps up the S count. Beyond these ideas, I can’t find much reason. Any ideas?