There are number sentences in the attached file. The words in these sentences are annotated with set of tags. I would like to have a script that splits the sentences into chunks based on the annotation tags (tag1 and tag2). The splitted chunks must be written into two files based the annotation tags. These files are tag1 file and tag2 file. Information such as sentence number, Chunk number and word index must be maintained.
To clarify I will use the example in the attached file:
There are 4 sentences:
sentence1: He is a good person .
sentence2: Thank you so much !!
sentence3: john likes to play with his friends .
sentence4: Netflix has almost 75 million global subscribers.
These sentences must be splitted into chunks and written into two files as following:
in the attached file there are 4 sentences:
sentence1: He is a good person .
sentence2: Thank you so much !!
sentence3: john likes to play with his friends .
sentence4: Netflix has almost 75 million global subscribers.
For sentences1, there are two chunks:
Chunk-1: He is >> written in tag1 file
Chunk-2: a good person . >>> written in tag2 file
For sentences2, there are two chunks:
Chunk-1:Thank you >> written in tag2 file
Chunk-2: so much !! >> written in tag1 file
For sentences3, there are four chunks:
Chunk-1: John Adam likes >> written in tag1 file
Chunk-2: to play >> written in tag2 file
Chunk-3: with his >> written in tag1 file
Chunk-4:friend :) >> written in tag2 file
For sentences4, there are four chunks:
Chunk-1:Netflix has >> written in tag1 file
Chunk-2: almost 75 million >> written in tag2 file
Chunk-3: global >> written in tag1 file
Chunk-4: subscribers >> written in tag2 file
As I mention above the following information must be maintained: sentence number, chunk number and word index.Maintaining these information is helpful to re-construct the sentences. So the script should be able to use the information from the two files (tag1 and tag2 files) to form the original file ( the attached file).
I'm attaching just a sample of sentences. I will test the script on the original file that includes a huge number of sentences.
you can write two scripts one for splitting into two files and the other for joining the two files to form the original file, or just write one script that can do the tasks.
Word-Index |
Word |
Tag |
0 |
He |
tag1 |
1 |
is |
tag1 |
2 |
a |
tag2 |
3 |
good |
tag2 |
4 |
person |
tag2 |
5 |
. |
punctuation |
0 |
Thank |
tag2 |
1 |
you |
tag2 |
2 |
so |
tag1 |
3 |
much |
tag1 |
4 |
!! |
punctuation |
0 |
John |
NE |
1 |
Adam |
NE |
2 |
likes |
tag1 |
3 |
to |
tag2 |
4 |
play |
tag2 |
5 |
with |
tag1 |
6 |
his |
tag1 |
7 |
friends |
tag2 |
8 |
:) |
emoticon |
0 |
Netflix |
NE |
1 |
has |
tag1 |
2 |
almost |
tag2 |
3 |
75 |
number |
4 |
million |
number |
5 |
global |
tag1 |
6 |
subscribers |
tag2 |