WeSearch_DescriptiveStatistics - delph-in/docs GitHub Wiki
Beginning by reproducing the methodology of Baldwin et al. (2013) using the WDC, with the following exceptions:
- tokenisation using REPP. Punctuation removed from tokens
Language Mix
According to langid.py (Lui and Baldwin, 2012) 100% of the WDC is English. Reassuring.
References
Baldwin, T., Cook, P., Lui, M., MacKinlay, A., and Wang, L. (2013). "How Noisy Social Media Text, How Diffrnt Social Media Sources?" in Proceedings of the International Joint Conference on Natural Language Processing, pp. 356-364.
Lui, M and Baldwin, T. (2012). "langid.py: An off-the-shelf language identification tool" in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Demo Session, pp.25-30.