Skip to content

Sentence tokenizer not working on Full stop  #76

@shivambatra76

Description

@shivambatra76

I have given the following input to

from summa.preprocessing.textcleaner import clean_text_by_sentences as _clean_text_by_sentences.

text='''Ad sales boost Time Warner profit
Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.
'''
This is the output i have recieved from after preprocessing. As you can see the second sentence should get separated by full stop but instead it is only separating the sentence using space on a new line by enter key pressed.
Screenshot (28)

[Original unit: 'Ad sales boost Time Warner profit' --- Processed unit: 'ad sale boost time warner profit',
Original unit: 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales.' --- Processed unit: 'quarter profit media giant timewarn jump bn £m month decemb m year earlier firm biggest investor googl benefit sale high speed internet connect higher advert sale',
Original unit: 'TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.' --- Processed unit: 'timewarn said fourth quarter sale rose bn bn',
Original unit: 'Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.' --- Processed unit: 'profit buoy gain offset profit dip warner bros user aol']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions