Segmentation - Sablayrolles/debates GitHub Wiki
Need to run coreNLP server with this:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000
Segmentation of dicourse in sentences tabular
import my_coreNLP.parseNLP as parseNLP
sentences = "It's a great thing for companies to expand. And when these people are going to put billions and billions of dollars into companies, and when they're going to bring $2.5 trillion back from overseas, where they can't bring the money back, because politicians like Secretary Clinton won't allow them to bring the money back, because the taxes are so onerous, and the bureaucratic red tape, so what -- is so bad. So what they're doing is they're leaving our country, and they're, believe it or not, leaving because taxes are too high and because some of them have lots of money outside of our country. And instead of bringing it back and putting the money to work, because they can't work out a deal. We have a president that can't sit them around a table and get them to approve something. And here's the thing. Republicans and Democrats agree but we have no leadership. And honestly, that starts with Secretary Clinton."
sNLP = parseNLP.StanfordNLP()
sentences_tab = sNLP.segmente(sentences) #segmentation par phrase
Segmentation of sentence in EDU with ponctuation.
import my_coreNLP.parseNLP as parseNLP
import my_coreNLP.segment as segment
sentences = "It's a great thing for companies to expand. And when these people are going to put billions and billions of dollars into companies, and when they're going to bring $2.5 trillion back from overseas, where they can't bring the money back, because politicians like Secretary Clinton won't allow them to bring the money back, because the taxes are so onerous, and the bureaucratic red tape, so what -- is so bad. So what they're doing is they're leaving our country, and they're, believe it or not, leaving because taxes are too high and because some of them have lots of money outside of our country. And instead of bringing it back and putting the money to work, because they can't work out a deal. We have a president that can't sit them around a table and get them to approve something. And here's the thing. Republicans and Democrats agree but we have no leadership. And honestly, that starts with Secretary Clinton."
sNLP = parseNLP.StanfordNLP()
sentences_tab = sNLP.segmente(sentences) #segmentation par phrase
sSpliter = segment.Spliter(sNLP)
#Use this to define where you specificily want to cut
#
#sSpliter = segment.Spliter(sNLP, list_punct_simple = [';',':','(',')'], list_punct_cmplx = ["--"])
EDU_punct_tab = []
for s in sentences_tab:
EDU_punct_tab.extend(sSpliter.punct_split(s)) #segmentation par symbole de ponctuation
Segmentation of EDU in smaller EDU with link words
import my_coreNLP.parseNLP as parseNLP
import my_coreNLP.segment as segment
sentences = "It's a great thing for companies to expand. And when these people are going to put billions and billions of dollars into companies, and when they're going to bring $2.5 trillion back from overseas, where they can't bring the money back, because politicians like Secretary Clinton won't allow them to bring the money back, because the taxes are so onerous, and the bureaucratic red tape, so what -- is so bad. So what they're doing is they're leaving our country, and they're, believe it or not, leaving because taxes are too high and because some of them have lots of money outside of our country. And instead of bringing it back and putting the money to work, because they can't work out a deal. We have a president that can't sit them around a table and get them to approve something. And here's the thing. Republicans and Democrats agree but we have no leadership. And honestly, that starts with Secretary Clinton."
sNLP = parseNLP.StanfordNLP()
sentences_tab = sNLP.segmente(sentences) #segmentation par phrase
sSpliter = segment.Spliter(sNLP)
#Use this to define where you specificily want to cut
#
#sSpliter = segment.Spliter(sNLP, list_punct_simple = [';',':','(',')'], list_punct_cmplx = ["--"])
EDU_punct_tab = []
for s in sentences_tab:
EDU_punct_tab.extend(sSpliter.punct_split(s)) #segmentation par symbole de ponctuation
#Use this to define where you specificily want to cut
#
#sSpliter = segment.Spliter(sNLP, list_punct_simple = [';',':','(',')'], list_punct_cmplx = ["--"])
#linkW = segment.LinksWords()
#linkW.add('and')
#linkW.add('also')
#linkW.add('although')
#linkW.add('as')
#linkW.add('as')
#linkW.add('because')
#
#sSpliter = segment.Spliter(sNLP, link_words=linkW)
EDUs = sSpliter.linkwords_split(EDU_punct_tab) #segmentation par link words and smaller ponctuation
- Sentences splitter : splitter on sentences
- Punctuation splitter : splitter on sentences and punctuation
- Linkword splitter : splitter on sentences, punctuation and linkwords
Size paragraph | Sentences splitter | Punctuation splitter | Linkword splitter |
---|---|---|---|
small | p = 1.0, r = 0.74 | p = 0.94, r = 0.83 | p = 0.89, r = 1.0 |
medium | p = 0.97, r = 0.74 | p = 0.92, r = 0.80 | p = 0.89, r = 0.81 |
long | p = 1.0, r = 0.64 | p = 1.0, r = 0.72 | p = 0.92, r = 0.88 |
Corpus | p = 0.99, r = 0.73 | p = 0.91, r = 0.75 | p = 0.89, r = 0.95 |
Some of examples files in can help you to use this project
Files | Description |
---|---|
example_s_segmentation.py | Segmentation of a paragraph in sentences tabular |
example_sp_segmentation.py | Segmentation of a paragraph in EDU tabular using punctuation |
example_spl_segmentation.py | Segmentation of a paragraph in EDU tabular using punctuation and linkwords |