AFAIK we should have the possibilities to harvest only articles, journals and ignore thesis and other documents that are not going to be ingested correctly or that are not supported by all the components of the pipeline.
E.g. Thesis is in average 200Mb up to Gbs for a PDF.