-
Notifications
You must be signed in to change notification settings - Fork 96
Description
The objective of the analysis of the look is to document data quality. This means being transparent about your data collection. Analytically you look for signs of potentially systematic missing data (certain error codes being systematically distributed in part of the scrape, holes in the time series indicating an error in the scraping program), and artifacts (suspiciously similar response sizes or suspiciously short responses).
- Analyze systematic connection errors / error codes and systematically missing data.
- Plot the Number of Errors codes over time - to see if there are any systematics in missing answers
- Plot the Number of Errors codes in relation to different subsections of your scrape (cnn.com/health or cnn.com/business) to see if there are any systematics in missing answers.
- Plot length before response (dt column delta_t) over time, to see if server response times are changing, indicating potential problems.
- Look for artifacts, and potential signs of different html formatting. Systematically different formatting of the HTML will probably force you to design two or more separate parsing procedures.
- Plot size distribution (length of html/json response) - i.e. histogram /sns.distplot-, to look for potential artifacts and errors (unexpected small responses, standard responses with the exact same length).
- Plot size of response over time, or in relation to a specific subsections (e.g. cnn.com/health or cnn.com/business), to look for potentially formatting issues or errors in different subsections.
If any problems are present, you get the chance to demonstrate your serious attitude towards methodological issues. You should sample anomolies (i.e. breaks in the time series, samples suspiciously small response lengths or too similar (i.e. standard empty response)) and inspect them manually to find the explanation (report this).
If a real issue - think about potential consequences (if any) to your analysis - and you should now comment on potential causes and explanations, thereby demonstrating strong methodological scraping skills.