Segmentation - Sablayrolles/debates GitHub Wiki

Debates wiki -- Segmentation

Requirements

Need to run coreNLP server with this:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000

Sentences segmentation

Segmentation of dicourse in sentences tabular

import my_coreNLP.parseNLP as parseNLP

sentences = "It's a great thing for companies to expand. And when these people are going to put billions and billions of dollars into companies, and when they're going to bring $2.5 trillion back from overseas, where they can't bring the money back, because politicians like Secretary Clinton won't allow them to bring the money back, because the taxes are so onerous, and the bureaucratic red tape, so what -- is so bad. So what they're doing is they're leaving our country, and they're, believe it or not, leaving because taxes are too high and because some of them have lots of money outside of our country. And instead of bringing it back and putting the money to work, because they can't work out a deal. We have a president that can't sit them around a table and get them to approve something. And here's the thing. Republicans and Democrats agree but we have no leadership. And honestly, that starts with Secretary Clinton."
 
sNLP = parseNLP.StanfordNLP()
sentences_tab = sNLP.segmente(sentences) #segmentation par phrase

EDU segmentation

Ponctuation

Segmentation of sentence in EDU with ponctuation.

import my_coreNLP.parseNLP as parseNLP
import my_coreNLP.segment as segment

sentences = "It's a great thing for companies to expand. And when these people are going to put billions and billions of dollars into companies, and when they're going to bring $2.5 trillion back from overseas, where they can't bring the money back, because politicians like Secretary Clinton won't allow them to bring the money back, because the taxes are so onerous, and the bureaucratic red tape, so what -- is so bad. So what they're doing is they're leaving our country, and they're, believe it or not, leaving because taxes are too high and because some of them have lots of money outside of our country. And instead of bringing it back and putting the money to work, because they can't work out a deal. We have a president that can't sit them around a table and get them to approve something. And here's the thing. Republicans and Democrats agree but we have no leadership. And honestly, that starts with Secretary Clinton."
 
sNLP = parseNLP.StanfordNLP()
sentences_tab = sNLP.segmente(sentences) #segmentation par phrase

sSpliter = segment.Spliter(sNLP)
#Use this to define where you specificily want to cut
#
#sSpliter = segment.Spliter(sNLP, list_punct_simple = [';',':','(',')'], list_punct_cmplx = ["--"])

EDU_punct_tab = []
for s in sentences_tab:
    EDU_punct_tab.extend(sSpliter.punct_split(s)) #segmentation par symbole de ponctuation

Ponctuation and link words

Segmentation of EDU in smaller EDU with link words

import my_coreNLP.parseNLP as parseNLP
import my_coreNLP.segment as segment

sentences = "It's a great thing for companies to expand. And when these people are going to put billions and billions of dollars into companies, and when they're going to bring $2.5 trillion back from overseas, where they can't bring the money back, because politicians like Secretary Clinton won't allow them to bring the money back, because the taxes are so onerous, and the bureaucratic red tape, so what -- is so bad. So what they're doing is they're leaving our country, and they're, believe it or not, leaving because taxes are too high and because some of them have lots of money outside of our country. And instead of bringing it back and putting the money to work, because they can't work out a deal. We have a president that can't sit them around a table and get them to approve something. And here's the thing. Republicans and Democrats agree but we have no leadership. And honestly, that starts with Secretary Clinton."
 
sNLP = parseNLP.StanfordNLP()
sentences_tab = sNLP.segmente(sentences) #segmentation par phrase

sSpliter = segment.Spliter(sNLP)
#Use this to define where you specificily want to cut
#
#sSpliter = segment.Spliter(sNLP, list_punct_simple = [';',':','(',')'], list_punct_cmplx = ["--"])
EDU_punct_tab = []
for s in sentences_tab:
    EDU_punct_tab.extend(sSpliter.punct_split(s)) #segmentation par symbole de ponctuation
    
#Use this to define where you specificily want to cut
#
#sSpliter = segment.Spliter(sNLP, list_punct_simple = [';',':','(',')'], list_punct_cmplx = ["--"])
#linkW = segment.LinksWords()
#linkW.add('and')
#linkW.add('also')
#linkW.add('although')
#linkW.add('as')
#linkW.add('as')
#linkW.add('because')
#
#sSpliter = segment.Spliter(sNLP, link_words=linkW)
EDUs = sSpliter.linkwords_split(EDU_punct_tab) #segmentation par link words and smaller ponctuation

Precision and recall for splitters

  • Sentences splitter : splitter on sentences
  • Punctuation splitter : splitter on sentences and punctuation
  • Linkword splitter : splitter on sentences, punctuation and linkwords
Size paragraph Sentences splitter Punctuation splitter Linkword splitter
small p = 1.0, r = 0.74 p = 0.94, r = 0.83 p = 0.89, r = 1.0
medium p = 0.97, r = 0.74 p = 0.92, r = 0.80 p = 0.89, r = 0.81
long p = 1.0, r = 0.64 p = 1.0, r = 0.72 p = 0.92, r = 0.88
Corpus p = 0.99, r = 0.73 p = 0.91, r = 0.75 p = 0.89, r = 0.95

Examples

Some of examples files in can help you to use this project

Files Description
example_s_segmentation.py Segmentation of a paragraph in sentences tabular
example_sp_segmentation.py Segmentation of a paragraph in EDU tabular using punctuation
example_spl_segmentation.py Segmentation of a paragraph in EDU tabular using punctuation and linkwords

Home wiki file : Home

⚠️ **GitHub.com Fallback** ⚠️