SciELO - Scientific Electronic Library Online

 
vol.32 número2Pairwise networks for feature ranking of a geomagnetic storm modelBenign interpolation of noise in deep learning índice de autoresíndice de materiabúsqueda de artículos
Home Pagelista alfabética de revistas  

Servicios Personalizados

Articulo

Indicadores

Links relacionados

  • En proceso de indezaciónCitado por Google
  • En proceso de indezaciónSimilares en Google

Compartir


South African Computer Journal

versión On-line ISSN 2313-7835
versión impresa ISSN 1015-7999

Resumen

ORIOLA, Oluwafemi  y  KOTZE, Eduan. Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter. SACJ [online]. 2020, vol.32, n.2, pp.56-79. ISSN 2313-7835.  http://dx.doi.org/10.18489/sacj.v32i2.847.

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.CATEGORIES: Computing computing ~ Natural language processing and sentiment analysis Computing methodologies ~ Text classification and information extraction

Palabras clave : South Africa; Twitter; abusive language; machine learning; semi-supervised learning.

        · texto en Inglés     · Inglés ( pdf )

 

Creative Commons License Todo el contenido de esta revista, excepto dónde está identificado, está bajo una Licencia Creative Commons