Improve parser detection of unhandled content #80
Open
Conversation
dracos
reviewed
Mar 7, 2017
pyscraper/new_hansard.py
Outdated
| tag.append(p) | ||
|
|
||
| if len(para) > 1: | ||
| for p in para: |
Member
There was a problem hiding this comment.
Does this not double the output in the cases it's trying to catch? e.g.
<Question><hs_Para><Number>Q2</Number>.
<Uin>[908984]</Uin>
<Member><B>Mr
Steve Reed</B> (Croydon North) (Lab):</Member>
<QuestionText></QuestionText>I
add my condolences to those already expressed about the former Father
of the House, and I welcome
my<hs_TimeCode time="2017-03-01T12:21:31"></hs_TimeCode> new hon.
Friend the Member for Stoke-on-Trent Central (Gareth Snell) to his
place.</hs_Para><hs_Para>Young
black men who use mental health services are more likely than other
people to be subject to detention, extreme forms of medication and
severe physical restraint, and, in extreme cases, this has led to
death, including that of my constituent Seni Lewis. Too many black
people with mental ill health are afraid to seek treatment from a
service they fear will not treat them fairly. Will the Prime Minister
meet me and some of the affected families to discuss the need for an
inquiry into institutional racism in the mental health
service<hs_TimeCode time="2017-03-01T12:22:18"></hs_TimeCode>?</hs_Para></Question>
The following-sibling would catch the first para after QuestionText, and then this loop would catch it again.
Fix for the parser failing to pick up all the text if there is more than one hs_Para element instite a Question tag
Store the UID and HRSContentID of handled tags so we can later compare to a list of all IDs in the document
Get a list of all tag IDs in the document and compare to the list we've processed and throw an exception if they don't match.
9f0e0a4 to
d511b3e
Compare
Copes with tags that are mostly processed from inside another tag
c4476de to
f96f8f3
Compare
There's lots of tags that we don't directly parse as we're interested in sub tags or they are parsed as part of the parent. Mark these as seen.
We didn't use namespaces before so they weren't being parsed properly. Correct this and track the tags.
Make sure we are coping with questions where part of the question isn't in the tail of QuestionText but is in following tags. Also cope with oddities like multiple question number tags.
Clause tags actually relate to the text after them so ignore them at the top level and then go back and parse them as part of the following heading tag. Then add them as the first part of the first speech under the heading. Fixes #53
If there's more than one heading or procedure in a new debate tag then make those into paragraphs in the first speech of the debate.
rather than just parsing it all into a single line of text parse all the paragraphs and indents so that we try and retain a bit more structure.
Scans the list of seen files and then picks out the latest one and then re-parses that. Assumes that the files are ordered in date order in the list.
f96f8f3 to
97d679c
Compare
dracos
reviewed
Dec 13, 2017
| ) | ||
| for t in following_tags: | ||
| tag_name = self.get_tag_name_no_ns(t) | ||
| self.handle_tag(tag_name, t) |
Member
There was a problem hiding this comment.
I've adapted part of this commit in master to fix a recent issue. Note this doesn't fully work, in that any subsequent paragraphs would become a new no-speaker speech. What I've done in e8acc13 is make sure this uses new_speech() so current_speech is set and then they'll be attached correctly. This simplifies the function a bit too.
403ee7b to
0c4983b
Compare
bc05e4e to
cf4da9e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The parser now tracks all the tags it sees as it goes using tag IDs and then compares those to a list of IDs extracted using XPath. If there is a difference between the lists it throws an Exception.
There's also a number of parser improvements in here which were found in the process of making sure that it parsed things correctly:
It also adds a script to make re-parsing easier.
Fixes #54
Fixes #66