Extracting portions of text from text file

I was trying to read the full book of abstracts from a conference earlier and finding it tedious to copy portions of desired paragraphs for my summary report to be fed into my simple auto-summarized module.

I came up with the following script that allows users to put a specific symbol such as “@” at the start and end of the paragraph to mark those paragraphs or sentences to be extracted. More than one portion can be selected and they can be returned as a list for further processing. For my case, each of the paragraph outputted will be auto summarized.

The following diagram illustrated the two different kinds of extraction.

Illustration of extraction type

The script scans all the lines of the text file, looking for the key_symbol (“@” in this case) and marks the index of the selected lines. The present method only use string “startwith” function. It can be expanded to be using regular expression.

Depending on the mode (overlapping or non-overlapping), it will calculate the portion of the text to be selected and output as a list which can be use for further processing.

Script can be found here.

 

One comment

Leave a comment