Skip to content

How should I analyze the Log? #41

@snorreralund

Description

@snorreralund

The objective of the analysis of the look is to document data quality. This means being transparent about your data collection. Analytically you look for signs of potentially systematic missing data (certain error codes being systematically distributed in part of the scrape, holes in the time series indicating an error in the scraping program), and artifacts (suspiciously similar response sizes or suspiciously short responses).

  1. Analyze systematic connection errors / error codes and systematically missing data.
  • Plot the Number of Errors codes over time - to see if there are any systematics in missing answers
  • Plot the Number of Errors codes in relation to different subsections of your scrape (cnn.com/health or cnn.com/business) to see if there are any systematics in missing answers.
  • Plot length before response (dt column delta_t) over time, to see if server response times are changing, indicating potential problems.
  1. Look for artifacts, and potential signs of different html formatting. Systematically different formatting of the HTML will probably force you to design two or more separate parsing procedures.
  • Plot size distribution (length of html/json response) - i.e. histogram /sns.distplot-, to look for potential artifacts and errors (unexpected small responses, standard responses with the exact same length).
  • Plot size of response over time, or in relation to a specific subsections (e.g. cnn.com/health or cnn.com/business), to look for potentially formatting issues or errors in different subsections.

If any problems are present, you get the chance to demonstrate your serious attitude towards methodological issues. You should sample anomolies (i.e. breaks in the time series, samples suspiciously small response lengths or too similar (i.e. standard empty response)) and inspect them manually to find the explanation (report this).
If a real issue - think about potential consequences (if any) to your analysis - and you should now comment on potential causes and explanations, thereby demonstrating strong methodological scraping skills.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions