Scielo RSS <![CDATA[South African Computer Journal]]> http://www.scielo.org.za/rss.php?pid=2313-783520200002&lang=es vol. 32 num. 2 lang. es <![CDATA[SciELO Logo]]> http://www.scielo.org.za/img/en/fbpelogp.gif http://www.scielo.org.za <![CDATA[<b>Editorial: More Covid</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200001&lng=es&nrm=iso&tlng=es <![CDATA[<b>Guest Editorial: FAIR 2019 special issue</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200002&lng=es&nrm=iso&tlng=es <![CDATA[<b>Clustering Residential Electricity Consumption Data to Create Archetypes that Capture Household Behaviour in South Africa</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200003&lng=es&nrm=iso&tlng=es Clustering is frequently used in the energy domain to identify dominant electricity consumption patterns of households, which can be used to construct customer archetypes for long term energy planning. Selecting a useful set of clusters however requires extensive experimentation and domain knowledge. While internal clustering validation measures are well established in the electricity domain, they are limited for selecting useful clusters. Based on an application case study in South Africa, we present an approach for formalising implicit expert knowledge as external evaluation measures to create customer archetypes that capture variability in residential electricity consumption behaviour. By combining internal and external validation measures in a structured manner, we were able to evaluate clustering structures based on the utility they present for our application. We validate the selected clusters in a use case where we successfully reconstruct customer archetypes previously developed by experts. Our approach shows promise for transparent and repeatable cluster ranking and selection by data scientists, even if they have limited domain knowledge. CATEGORIES: •Computing methodologies ~ Cluster analysis •Applied computing ~ Engineering <![CDATA[<b>Pairwise networks for feature ranking of a geomagnetic storm model</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200004&lng=es&nrm=iso&tlng=es Feedforward neural networks provide the basis for complex regression models that produce accurate predictions in a variety of applications. However, they generally do not explicitly provide any information about the utility of each of the input parameters in terms of their contribution to model accuracy. With this in mind, we develop the pairwise network, an adaptation to the fully connected feedforward network that allows the ranking of input parameters according to their contribution to model output. The application is demonstrated in the context of a space physics problem. Geomagnetic storms are multi-day events characterised by significant perturbations to the magnetic field of the Earth, driven by solar activity. Previous storm forecasting efforts typically use solar wind measurements as input parameters to a regression problem tasked with predicting a perturbation index such as the 1-minute cadence symmetric-H (Sym-H) index. We re-visit the task of predicting Sym-H from solar wind parameters, with two 'twists': (i) Geomagnetic storm phase information is incorporated as model inputs and shown to increase prediction performance. (ii) We describe the pairwise network structure and training process - first validating ranking ability on synthetic data, before using the network to analyse the Sym-H problem.CATEGORIES: • Computing methodologies ~ Neural networks • Applied computing ~ Earth and atmospheric sciences <![CDATA[<b>Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200005&lng=es&nrm=iso&tlng=es Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.CATEGORIES: • Computing computing ~ Natural language processing and sentiment analysis • Computing methodologies ~ Text classification and information extraction <![CDATA[<b>Benign interpolation of noise in deep learning</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200006&lng=es&nrm=iso&tlng=es The understanding of generalisation in machine learning is in a state of flux, in part due to the ability of deep learning models to interpolate noisy training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about the bias-variance tradeoff in learning. We expand upon relevant existing work by discussing local attributes of neural network training within the context of a relatively simple framework. We describe how various types of noise can be compensated for within the proposed framework in order to allow the deep learning model to generalise in spite of interpolating spurious function descriptors. Empirically, we support our postulates with experiments involving overparameterised multilayer perceptrons and controlled training data noise. The main insights are that deep learning models are optimised for training data modularly, with different regions in the function space dedicated to fitting distinct types of sample information. Additionally, we show that models tend to fit uncorrupted samples first. Based on this finding, we propose a conjecture to explain an observed instance of the epoch-wise double-descent phenomenon. Our findings suggest that the notion of model capacity needs to be modified to consider the distributed way training data is fitted across sub-units.CATEGORIES: • Computing methodologies ~ Machine learning • Computing methodologies ~ Neural networks • Theory of computation ~ Sample complexity and generalisation bounds <![CDATA[<b>Using Summary Layers to Probe Neural Network Behaviour</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200007&lng=es&nrm=iso&tlng=es No framework exists that can explain and predict the generalisation ability of deep neural networks in general circumstances. In fact, this question has not been answered for some of the least complicated of neural network architectures: fully-connected feedforward networks with rectified linear activations and a limited number of hidden layers. For such an architecture, we show how adding a summary layer to the network makes it more amenable to analysis, and allows us to define the conditions that are required to guarantee that a set of samples will all be classified correctly. This process does not describe the generalisation behaviour of these networks, but produces a number of metrics that are useful for probing their learning and generalisation behaviour. We support the analytical conclusions with empirical results, both to confirm that the mathematical guarantees hold in practice, and to demonstrate the use of the analysis process.CATEGORIES: • Computing methodologies ~ Neural networks • Theory of computation ~ Machine learning theory <![CDATA[<b>Ht-index for empirical evaluation of the sampled graph-based Discrete Pulse Transform</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200008&lng=es&nrm=iso&tlng=es The Discrete Pulse Transform (DPT) makes use of LULU smoothing to decompose a signal into block pulses. The most recent and effective implementation of the DPT is an algorithm called the Roadmaker's Pavage, which uses a graph-based algorithm that produces a hierarchical tree of pulses as its final output, shown to have important applications in artificial intelligence and pattern recognition. Even though the Roadmaker's Pavage is an efficient implementation, the theoretical structure of the DPT results in a slow, deterministic algorithm. This paper examines the use of the spectral domain of graphs and designing graph filter banks to downsample the algorithm. We investigate the extent to which this speeds up the algorithm and allows parallel processing. Converting graph signals to the spectral domain can also be a costly overhead, so methods of estimation for filter banks are examined, as well as the design of a good filter bank that may be reused without needing recalculation. The sampled version requires different hyperparameters in order to reconstruct the same textures of the image as the original algorithm, selected previously either through trial and error (subjective) or grid search (costly) which prevented studying the results on many images effectively. Here an objective and efficient way of deriving similar results between the original Roadmaker's Pavage and our proposed Filtered Roadmaker's Pavage is provided. The method makes use of the Ht-index which separates the distribution of information in the graph at scale intervals by recursively calculating averages on decreasing subsections of the scale data stored. This has enabled empirical research using benchmark datasets providing improved results. The results of these empirical tests showed that using the Filtered Roadmaker's Pavage algorithm consistently runs faster, using less computational resources, while having a positive SSIM (structural similarity) with low variance. This provides an informative and faster approximation to the nonlinear DPT, a property which is not standardly achieveable.CATEGORIES: • Mathematics of Computing ~ Probability and Statistics • Probability and Statistics ~ Probabilistics Algorithms <![CDATA[<b>Algorithmic definitions for KLM-style defeasible disjunctive Datalog</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200009&lng=es&nrm=iso&tlng=es Datalog is a declarative logic programming language that uses classical logical reasoning as its basic form of reasoning. Defeasible reasoning is a form of non-classical reasoning that is able to deal with exceptions to general assertions in a formal manner. The KLM approach to defeasible reasoning is an axiomatic approach based on the concept of plausible inference. Since Datalog uses classical reasoning, it is currently not able to handle defeasible implications and exceptions. We aim to extend the expressivity of Datalog by incorporating KLM-style defeasible reasoning into classical Datalog. We present a systematic approach for extending the KLM properties and a well-known form of defeasible entailment: Rational Closure. We conclude by exploring Datalog extensions of less conservative forms of defeasible entailment: Relevant and Lexicographic Closure. We provide algorithmic definitions for these forms of defeasible entailment and prove that the definitions are LM-rational.CATEGORIES: • Theory of computation ~ Automated reasoning • Theory of computation ~ Logic and databases • Computing methodologies ~ Nonmonotonic, default reasoning and belief revision <![CDATA[<b>Defeasibility applied to Forrester's paradox</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200010&lng=es&nrm=iso&tlng=es Deontic logic is a logic often used to formalise scenarios in the legal domain. Within the legal domain there are many exceptions and conflicting obligations. This motivates the enrichment of deontic logic with not only the notion of defeasibility, which allows for reasoning about exceptions, but a stronger notion of typicality that is based on defeasibility. KLM-style defeasible reasoning is a logic system that employs defeasibility while Propositional Typicality Logic (PTL) is a logic that does the same for the notion of typicality. Deontic paradoxes are often used to examine logic systems as the paradoxes provide undesirable results even if the scenarios seem intuitive. Forrester's paradox is one of the most famous of these paradoxes. This paper shows that KLM-style defeasible reasoning and PTL can be used to represent and reason with Forrester's paradox in such a way as to block undesirable conclusions without completely sacrificing desirable deontic properties.CATEGORIES: • Theory of computation ~ Logic • Theory of computation ~ Semantics and reasoning <![CDATA[<b>DDLV: A system for rational preferential reasoning for Datalog</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200011&lng=es&nrm=iso&tlng=es Datalog is a powerful language that can be used to represent explicit knowledge and compute inferences in knowledge bases. Datalog cannot, however, represent or reason about contradictory rules. This is a limitation as contradictions are often present in domains that contain exceptions. In this paper, we extend Datalog to represent contradictory and defeasible information. We define an approach to efficiently reason about contradictory information in Datalog and show that it satisfies the KLM requirements for a rational consequence relation. We introduce DDLV, a defeasible Datalog reasoning system that implements this approach. Finally, we evaluate the performance of DDLV.CATEGORIES: • Computing methodologies ~ Artificial intelligence • Theory of computation ~ Logic <![CDATA[<b>Exchanging image processing and OCR components in a Setswana digitisation pipeline</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200012&lng=es&nrm=iso&tlng=es As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMa-gick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.CATEGORIES: • Applied computing ~ Optical character recognition • Computing methodologies ~ Image processing <![CDATA[<b>Decoding the underlying cognitive processes and related support strategies utilised by expert instructors during source code comprehension</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200013&lng=es&nrm=iso&tlng=es Many novice programmers fail to comprehend source code and its related concepts in the same way that their instructors do. As emphasised in the Decoding the Disciplines (DtDs) framework, each discipline (including Computer Science) has its own unique set of mental operations. However, instructors often take certain important mental operations for granted and do not explain these 'hidden' steps explicitly when modelling problem solutions. A clear understanding of the underlying cognitive processes and related support strategies employed by experts during source code comprehension (SCC) could ultimately be utilised to help novice programmers to better execute the cognitive processes necessary to efficiently comprehend source code. Positioned within Step 2 of the DtDs framework, this study employed decoding interviews and observations, followed by narrative data analysis, to identify the underlying cognitive processes and related support (though often 'hidden') strategies utilised by a select group of experienced programming instructors during an SCC task. The insights gained were then used to formulate a set of important cognitive-related support strategies for efficient SCC. Programming instructors are encouraged to continuously emphasise strategies like these when modelling their expert ways of thinking regarding efficient SCC more explicitly to their novice students.CATEGORIES: • Social and professional topics ~ Computer science education <![CDATA[<b>A survey of benchmarking frameworks for reinforcement learning</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200014&lng=es&nrm=iso&tlng=es Reinforcement learning has recently experienced increased prominence in the machine learning community. There are many approaches to solving reinforcement learning problems with new techniques developed constantly. When solving problems using reinforcement learning, there are various difficult challenges to overcome. To ensure progress in the field, benchmarks are important for testing new algorithms and comparing with other approaches. The reproducibility of results for fair comparison is therefore vital in ensuring that improvements are accurately judged. This paper provides an overview of different contributions to reinforcement learning benchmarking and discusses how they can assist researchers to address the challenges facing reinforcement learning. The contributions discussed are the most used and recent in the literature. The paper discusses the contributions in terms of implementation, tasks and provided algorithm implementations with benchmarks. The survey aims to bring attention to the wide range of reinforcement learning benchmarking tasks available and to encourage research to take place in a standardised manner. Additionally, this survey acts as an overview for researchers not familiar with the different tasks that can be used to develop and test new reinforcement learning algorithms.CATEGORIES: • Computing methodologies ~ Reinforcement learning <![CDATA[<b>Invited Lecture: Notions of'Theory' and their Practical Consequences in the Discipline of Software 'Engineering' (including Information Systems Design)</b>]]> http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2313-78352020000200015&lng=es&nrm=iso&tlng=es Reinforcement learning has recently experienced increased prominence in the machine learning community. There are many approaches to solving reinforcement learning problems with new techniques developed constantly. When solving problems using reinforcement learning, there are various difficult challenges to overcome. To ensure progress in the field, benchmarks are important for testing new algorithms and comparing with other approaches. The reproducibility of results for fair comparison is therefore vital in ensuring that improvements are accurately judged. This paper provides an overview of different contributions to reinforcement learning benchmarking and discusses how they can assist researchers to address the challenges facing reinforcement learning. The contributions discussed are the most used and recent in the literature. The paper discusses the contributions in terms of implementation, tasks and provided algorithm implementations with benchmarks. The survey aims to bring attention to the wide range of reinforcement learning benchmarking tasks available and to encourage research to take place in a standardised manner. Additionally, this survey acts as an overview for researchers not familiar with the different tasks that can be used to develop and test new reinforcement learning algorithms.CATEGORIES: • Computing methodologies ~ Reinforcement learning