Skip to content

Exam project

Andreas Bjerre-Nielsen edited this page May 16, 2018 · 4 revisions

Project inspiration for Topics in Social Data Science

Relevant research

For getting some inspiration we recommend that your read a few recent research articles that have used machine learning for conducting social science. Below is a selection of topics and associated articles.

High level perspective Mullainathan, S. and Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives, 31(2):87--106

Specific subjects Blumenstock, J., Cadamuro, G., and On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science, 350(6264):1073--1076

Stephens-Davidowitz, S. (2013b). Unreported Victims of an Economic Downturn

Glaeser, E. L., Kominers, S. D., Luca, M., and Naik, N. (2015). Big Data and Big Cities: The Promises and Limitations of Improved Measures of Urban Life. Working Paper 21778, National Bureau of Economic Research

Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Aiden, E. L., and Fei-Fei, L. (2017). Using deep learning and Google Street View to esti- mate the demographic makeup of neighborhoods across the United States.

Wu, A. (2017). Gender Stereotyping in Academia: Evidence from Eco- nomics Job Market Rumors Forum. SSRN Scholarly Paper ID 3051462, Social Science Research Network, Rochester, NY

Gentzkow, M., Shapiro, J. M., and Taddy, M. (2016). Measuring Polariza- tion in High-Dimensional Data: Method and Application to Congressional Speech. Working Paper 22423, National Bureau of Economic Research

Hansen, S., McMahon, M. and Prat, A., 2017. Transparency and deliberation within the FOMC: a computational linguistics approach. The Quarterly Journal of Economics, 133(2), pp.801-870.

Mønsted, B., Sapieżyński, P., Ferrara, E. and Lehmann, S., 2017. Evidence of complex contagion of information in social media: An experiment using Twitter bots. PloS one, 12(9), p.e0184148.

Concrete ideas and data

We list a few brief ideas for getting started with the data collection and some questions that can be asked.

Danish house prices

A straightforward example of a project would be to use the public register of house price sales in Denmark. This can be used to ask both simple and interesting questions.

A simple question would be to analyze how good a model can we create for predicting the house price from sales? The focus of such an exercise will be use the existing features in the dataset (number of rooms etc.) in a smart way to construct a feature set for predicting the house price. The focus should also be on using new features e.g. neighborhood level measures and compute distance to sea, lakes, forest..

A much more ambitious project would be to use collect data from other sources and investigate the impacts on house prices. E.g. the method of Gebru et al. to estimate the impact of predicted demographic composition.

See example of scraping in our course repo here.

Analyzing the crypto-blockchain-market

A little bit more demanding, scraping wise, would be to analyze the market for blockchain technology. As demonstrated in the short Web Scraping Lecture, it is possible to collect both market information, and merging it with each companies press material - company homepage social media feed via the Twitter Rest API - and Github activity through the github API.

A question would be if one could predict market fluctuations (pump and dumps) or market success, using information extracted from either project homepages or their social media activity using natural language processing and/or network analysis of the twitter “follower”, “retweet” or “like” network, or “hashtags” as nodes in a bipartite network.

Using the same techniques an interesting question could be to describe bot like behavior. Network positions, activity patterns, generic and repetitive language etc.

Other datasets

Students in our basic course, Social Data Science have used a large variety of data sources including.

  • news on DR (Danish Broadcasting Company) and the Danish newspaper Information
  • price of cars for sales on bilbasen
  • analyzing linguistic content on Twitter
  • Airbnb pricing in Copenhagen
  • Prediction of bitcoin prices from Reddit data.

If you are interested in working with one or more of these datasets or see the assignments of the students who made them please contact us and we will put you in touch.

Grading

The grade for this course is exclusively determined by the project handed in. The project will be judged on a number of dimensions, these include:

  • how the data was obtained (setting up new data collection);
  • how the tools for working with networks, geo-data, text and machine learning are applied (at least one must be used, two is recommended);
  • how the methods are applied and which methods are used;
  • how results are explained (writing, figures, tables with model output etc.);
  • the research question and its originality and how it is answered.

Requirements for project

The exam projects have a number of requirements that must be met, these are: requirement

  • Research question (hand in April 30, max half page)
  • Groups with up to four members
  • Project formalia
    • Report (.pdf file)
      • The style should be like a light written research article (brief literature review, references to methods etc.)
      • Grading will be based on this report but process should be document in Jupyter Notebook.
      • The following maximum number of pages (normalsider): 1 pax: 12 pages, 2 pax: 16 pages, 3 pax: 20 pages, 4 pax: 24 pages.
    • Documentation in Jupyter Notebook (.ipynb file)
  • Some advice
    • It is more important that you spent time on calibrating and validating the models you work with rather than using as many models as possible.

Clone this wiki locally