Past, Present and Future of Data Science
Q&A with Microsoft’s Jake Hofman
Jake Hofman was a data scientist before that term existed. Trained as an electrical engineer, Hofman came to Columbia to study quantum philosophy, and in 2008, earned his PhD in physics. Now a senior research scientist at Microsoft, Hofman spends much of his free time teaching. He designed some of Columbia’s earliest data science courses, and this past summer, ran Microsoft’s data science boot camp for NYC college students for a fifth year in a row. His next class at Columbia — Modeling Social Data — will be offered this spring. He spoke with us recently about teaching, research, and where he sees data science heading.
What do you do at Microsoft Research?
A mix of basic and applied research — Most of my work is published in academic journals, but I also work on product prototypes with colleagues throughout Microsoft.
How did you get into computers and data science?
I wanted to do hardware and electronics from an early age, and spent a lot of time fixing and taking apart my computer and building small circuits for radios or guitar pedal effects. I studied electrical engineering at Boston University, but after an internship during the dot-com boom, realized that hardware wasn’t for me. Working in a quantum optics lab at BU, I got interested in physics and the philosophy of physics, which I went to Columbia to study. But I found myself working constantly with data and realized it was the common thread in everything I enjoyed.
You only recently started identifying as a data scientist. Why?
I felt that it was a vague and somewhat meaningless phrase, and that it encompassed all that I was doing as in “regular science.” I used to ask myself, what's science without data, anyway? I'm doing the same type of work I’ve done since college, though in different domains. At BU it was data analysis for quantum imaging applications, and at Columbia it was image and video analysis for biological data. It was always a mix of skills, from understanding the domain, to gathering data, to analyzing and modeling it, to presenting results to other people, which are all core parts of data science.
What made you come back to Columbia to teach?
I've always been passionate about teaching. As a PhD student, I taught high school students in Columbia’s Science Honors Program, helped teach the undergraduate physics lab and ran a course to prepare graduate students for their qualifying exams. It seemed natural to continue teaching after graduation, and I'm glad I have. It forces me to learn new things and improve my understanding of topics I think I already know. I also get to meet talented, enthusiastic students, some of whom become research collaborators.
What’s the one skill all data scientists should have?
The ability to focus on the right questions, and to define them precisely. Understanding algorithms and models, how to code them up, and make solutions scale to large data sets, are all valuable technical skills. But I still find spend most of my time formulating a question and refining the answer. Simple tools (like linear models) often do the trick. Asking good questions is a central theme in my courses. It’s difficult to teach, but I've found case studies and projects work best.
Can you describe a typical assignment?
I have students use the New York Times API to download story snippets and build a text classifier that predicts which news section the snippet came from. I created this assignment for my Data-Driven Modeling class in 2009, and at the time had PhD students in statistics who had never gathered or cleaned their own data, or fit a model. They told me it was one of the most useful assignments they'd ever done. I still use it.
What will a data scientist need to know in five years?
Issues of causality. Economics has focused on causality for a long time, unlike computer science or statistics. As data scientists handle more decision making, they need to go beyond describing what's going on and evaluate different courses of action and make a recommendation. Data science programs should incorporate more formal training around causality, both in developing randomized controlled experiments and exploiting real-world experiments to infer causal effects from observational data.
In your work at Microsoft, what have you learned about virality and the web?
Contrary to what most people think, most content doesn't spread beyond the person who initially posts it. The popularity of a video or news story is largely driven by how many people receive a broadcast. A tweet that goes viral, by contrast, depends on a wide variety of people reposting the link at the same time. Predicting what will go viral is surprisingly difficult. We’ve built a product prototype, Viral Search, that lets people track specific stories across Twitter.
You’ve also developed tools to help people make sense of big numbers. Why?
People generally have a hard time putting numbers in perspective. A $100 million budget cut may sound like a lot, but if you’re told the cut is 0.003% of the overall budget, you might think differently. We’ve shown in large-scale, randomized controlled experiments that adding context to news articles helps people remember what they've read, estimate unfamiliar amounts, and detect errors in manipulated measurements. We’ve also built this capability into Microsoft’s search engine, Bing. If you ask for the area of Afghanistan, you not only get the answer, 251,827 square miles, but you also get a perspective that tells you it’s “about the size of Texas.”
Your website includes a link to “geek tips.” What’s that about?
It was a way for me to keep track of random things I learned in grad school. It’s driven by a shell script I hacked together and still use. It saves a line of text with a date to a file that I can search for later. I found myself sending these tips to other students and realized that others could benefit too so I added a hook to have the script tweet out the tips as I create them. @onelinetips has a small but dedicated following on Twitter. One nifty tip shows how to create an alias for searching my tips from the command line.
What are you reading?
One of Neal Stephenson's newer books, REAMDE. He writes nerdy, entertaining science fiction. Two of my data science favorites are Brian Christian and Tom Griffith’s Algorithms to Live By, and Jon Gertner’s The Idea Factory. The first explains how algorithms were developed, and why people should care about them. The second is an inspiring history of Bell Labs which frames much of the technology we are using today. It’s a reminder that nothing lasts forever — not even Bell Labs. Hans Rosling’s Factfulness is another recent favorite.
If you could recommend one book to aspiring data scientists what would it be?
Garrett Grolemund and Hadley Wickham’s, R for Data Science. Not because I'm so prescriptive about programming languages, but because few people have spent as much time thinking about the field. The authors hit on the right level of abstraction for people to easily interface with data, ask questions, visualize details, build models and iterate. Also, John Tukey's 1977 classic, Exploratory Data Analysis, is one of the first statistics books to discuss data for the sake of exploration rather than to confirm existing hypotheses. It's sobering to see how far back these ideas go.
— Kim Martineau