Winning Pub Quizzes with pandas
I was in my local pub quiz recently when as usual we were faced with the anagram speed round. The quiz-master Dr Paul reads out a collection of words with a clue to a mystery person's identity, and the first team to correctly shout out the answer wins a spot prize. It's often a box of Tunnock's Teacakes - so well worth the effort!
We were told on this particular occasion that the jumbled letters 'OR ANAL MR BOND' (!) were an anagram of a famous actor. As the words are being read out the well-worn technique is to write them in a "cloud" on the page, and then try to decipher the correct name. This can often take some time, but on this night myself and a couple of other teams almost instantaneously shouted "Orlando Bloom!". One of the easiest anagrams we'd had, I thought. But I was wrong. It wasn't Orlando Bloom, and so we had to get to thinking again. It turned out that the correct name we were after was old-Godfather himself, Marlon Brando.
I had to double-check that Orlando Bloom didn't fit the original character set - that indeed, Orlando Bloom and Marlon Brando aren't anagrams of each other. They're close, but they're different. This got me to wondering whether there are any famous actors who's names are anagrams of each other, and to investigate this I took to Python.
The Internet Movie Database (or IMDb for short) is an encyclopedia of all things film and TV. On the page http://www.imdb.com/list/ls058011111/ I found a list of the "Top 1000 Actors and Actresses" created by user hagennelson. This forms the basis for the following research.
Since the list spreads over four pages we first need to download the elements and concatenate them into a single pandas data frame.
import pandas as pd
frame_list = []
for start in [1, 251, 501, 751]:
URL = "http://www.imdb.com/list/ls058011111/?start=" + str(start) + "&view=compact&sort=listorian:asc"
dfs = pd.read_html(URL, match="Name", header=0, index_col=0)
frame_list.append(dfs[0])
df = pd.concat(frame_list)
df.index = df.index.astype(int)
The next task is to create a new column containing the alphabetically sorted characters of the actors’ names:
df['Chars'] = df['Name'].apply(lambda x: ''.join(sorted(x.lower().strip().replace(" ", ""))))
We can see the effect of this in the top five rows of the data frame using df.head()
.
One quick way to check for anagrams now is just to look for duplicates in the Chars column. If we have 1000 unique character strings then we know that we have no anagrams at all. In fact it turns out that we have only 997 unique character strings, so it looks like we have three pairs of anagrams.
print len(df['Chars'].unique())
df_duplicates = df[df.duplicated(subset='Chars')][['Name', 'Chars']]
for x in df_duplicates['Chars']:
print df[df['Chars']==x][['Name', 'Chars', 'Birth Date']]
997
Name Chars
#
292 Neve Campbell abceeellmnpv
846 Neve Campbell abceeellmnpv
Name Chars
#
773 Paul Scofield acdefillopsu
949 Paul Scofield acdefillopsu
Name Chars
#
708 Patrick Wilson aciiklnoprstw
954 Patrick Wilson aciiklnoprstw
Unfortunately what we actually find above is that three of the actors appear erroneously on the list twice each! The actual list of 997 actors do not contain any self-anagrams whatsoever.
Given that we weren’t able to find anagrams, and that the downloaded list contains the actors birthdates, I thought it would be fun to see whether any of the famous thespians share the same date of birth. This turns out to be far more fruitful.
First we need to rid ourselves of the duplicated actors discovered above.
df = df.drop_duplicates(subset='Name')
Similar to above we now wish to find out which, if any, birth dates are duplicated within our dataframe.
df_duplicates = df[df.duplicated(subset='Birth Date')][['Name', 'Birth Date']]
for bd in df_duplicates['Birth Date']:
print df[df['Birth Date']==bd][['Name', 'Birth Date']].to_csv(header=False)
195,Julianne Moore,3 December 1960
199,Daryl Hannah,3 December 1960
238,Lena Headey,3 October 1973
292,Neve Campbell,3 October 1973
397,Anna Paquin,24 July 1982
493,Elisabeth Moss,24 July 1982
145,Glenda Jackson,9 May 1936
571,Albert Finney,9 May 1936
494,Adrianne Palicki,6 May 1983
587,Gabourey Sidibe,6 May 1983
523,Tom Wilkinson,5 February 1948
668,Christopher Guest,5 February 1948
248,Gillian Anderson,9 August 1968
673,Eric Bana,9 August 1968
379,Ellen Page,21 February 1987
779,Ashley Greene,21 February 1987
786,Thandie Newton,6 November 1972
827,Rebecca Romijn,6 November 1972
747,Keri Russell,23 March 1976
836,Michelle Monaghan,23 March 1976
753,Djimon Hounsou,24 April 1964
871,Cedric the Entertainer,24 April 1964
77,Angelina Jolie,4 June 1975
920,Russell Brand,4 June 1975
629,Catalina Sandino Moreno,19 April 1981
932,Hayden Christensen,19 April 1981
425,Chris Pine,26 August 1980
946,Macaulay Culkin,26 August 1980
913,Justin Bartha,21 July 1978
953,Josh Hartnett,21 July 1978
828,Michelle Rodriguez,12 July 1978
964,Topher Grace,12 July 1978
858,John Leguizamo,22 July 1964
970,David Spade,22 July 1964
And above we get a set of 17 pairs of actors from our original list who were born on exactly the same day. To pick out just three pairs we have:
3 December 1960: Julianne Moore and Daryl Hannah.
21 February 1987: Ellen Page and Ashley Greene.
4 June 1975: Angelina Jolie and Russell Brand.
You just never know — this could all come in useful at a pub quiz one day!