Vince Dorie grew up in the San Francisco Bay area, where he attended a Jesuit high school in San Jose that instilled in him a love of literature as well as the sciences. Later, at Stanford, he studied computer science and biology and earned a master’s degree in biomedical informatics. Back then, data science was emerging but at Stanford he helped a history professor build apps that enabled communities to collaboratively document their own histories.

After he got his master’s, he founded a startup company in Durham, North Carolina, called SparkIP. The company analyzed the network of patents by their citations and produced clusters of similar technologies using natural language processing to give the clusters human understandable names. Patents are a rough form of knowledge or innovation, and collections of similar patents come together to form an idea or a field, says Dorie, Given these clusters, people could relate patent clusters to each other based on their shared citations, making it possible to see what patent ideas are related. The goal was to design a tool to allow companies to find technologies in fields related to their own, facilitating research at the university level through partnerships and licensing agreements. The company lasted just a few years, but the work made Dorie realize that to improve upon the ad-hoc approach he used at SparkIP, as well as his limited understanding of machine learning, he needed a deeper understanding of statistical probability.

Consequently, he entered Columbia’s doctoral program in statistics, where under the direction of Professor Andrew Gelman he focused on Bayesian linear mixed models. In 2014, after earning his Ph.D., he did a Q-TRAIN postdoc at NYU in the Department of Applied Statistics, Social Science, and Humanities. While there, he developed Bayesian nonparametric methods for causal inference and conducted the first large-scale comparison of causal inference methods. He later spent a year working as a data fellow for the California Department of Justice, working on then Attorney General Kamala Harris’ Open Justice initiative. The program was designed to make criminal justice data publicly available, so that research could better inform the department’s policies and enhance accountability. As a first step in building that research capacity at the California Department of Justice, Dorie worked to de-identify several criminal justice data sets.

Now, as an associate research scientist at the Data Science Institute, he’s teaching the Probability and Statistics course at DSI and working on various research projects. One project involves building systems to export derived medical information from Columbia University’s Medical Center while respecting data privacy restrictions. For a second project, he is helping the medical center’s development team identify prospective patient-donors. And for a third, he’s helping with wide-ranging grant to obtain funds for a targeted opioid intervention in New York State.

He says he likes living in New York City and working at Columbia, a liberal arts university where he can keep up with several fields – not only science and tech. He quips that he spends “a lot of time trying to fill the gaps of his of science and engineering education by reading classic literature.” He’s also an avid music collector and closely follows national politics, and working in data science allows him to think about and do research in various fields, which he enjoys.

“I’ve studied biology and computer science and statistics,” he says, “and I now get to use all of that in conjunction with data science to help researchers across Columbia accomplish their goals.”