Skip to content

Text analytics code for Lima Andina project. A selection of analysis and visualization performed on the Lima Andina datasets containing more than 6000 El Comercio newspaper migrant asociation anouncements from 1906 to 1933

License

Notifications You must be signed in to change notification settings

parejar/lima-andina-text-analytics

Repository files navigation

lima-andina-text-analytics

  • Text analytics code for the Lima Andina project.
  • This is a selection of analysis and visualizations performed on the Lima Andina datasets that contain more than 6500 El Comercio newspaper ads published by internal migrants organizations in Lima between 1906 and 1933.
  • We have made public two datasets, "Associations" and "Articles", through the Borealis repository.
  • From the description in the Borealis repository:

"The first contains information on the associations, and the second contains information on advertisements published in El Comercio by those associations. These announcements typically concerned upcoming and past meetings, elections and other association activities. In aggregate, the dataset reveals patterns in association activity that in turn illuminate the history of internal migration in early-20th century Peru."

  • Table of contents
  1. Exploratory analysis and visualizations
  2. Pre-processing
  3. Named Entities
  4. Topic modelling
  5. Syntactic similarity
  6. Embeddings
  • Example of a Tensorflow Embedding Projector visualization of semantic relationships in the corpus. External link
    • Change the visualization settings to Uniform Manifold approximation and Projection (UMAP) for a more complete view of the embedding space.
  • DISCLAIMER: The code in this repository is made available only as a way of documenting the Digital Humanities side of the project. You can download the Jupyter notebooks and experiment with them at your own risk. The dataset that we used in these notebooks is different from the dataset available in Borealis but contain the same textual information in terms of the newspaper ads. Here we use a somewhat untidy dataset to illustrate data cleaning techniques.

About

Text analytics code for Lima Andina project. A selection of analysis and visualization performed on the Lima Andina datasets containing more than 6000 El Comercio newspaper migrant asociation anouncements from 1906 to 1933

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published