George K. Thiruvathukal
Title/s: Professor of Computer Science
Director of CS Department Computing, Visiting Faculty at Argonne National Laboratory
Specialty Area: high performance & distributed computing, cyber-physical systems, software engineering, programming languages and systems, history of computing, computational and data science, computing education, and ethical/legal/social issues in CS.
Office #: Doyle 301
External Webpage: https://thiruvathukal.com/E-Commons: https://works.bepress.com/gkthiruvathukal
George K. Thiruvathukal holds PhD (1995) and MS (1990) in Computer Science from Illinois Institute of Technology and a BA (1988) in Computer Science and Physics with a Mathematics Minor from Lewis University in Romeoville, IL. He is Professor of Computer Science at Loyola University Chicago and Visiting Faculty at Argonne National Laboratory.
For a more detailed bio-sketch, see Dr. Thiruvathukal's website, thiruvathukal.com.
For a list of publications, see Dr. Thiruvathukal's Digital Commons page, works.bepress.com/gkthiruvathukal/.
PublicationsTests as Maintainable Assets Via Auto-generated Spies: A case study involving the Scala collections library's Iterator trait
In testing stateful abstractions, it is often necessary to record interactions, such as method invocations, and express assertions over these interactions. Following the Test Spy design pattern, we can reify such interactions programmatically through additional mutable state. Alternatively, a mocking framework, such as Mockito, can automatically generate test spies that allow us to record the interactions and express our expectations in a declarative domain-specific language. According to our study of the test code for Scala’s Iterator trait, the latter approach can lead to a significant reduction of test code complexity in terms of metrics such as code size (in some cases over 70% smaller), cyclomatic complexity, and amount of additional mutable state required. In this tools paper, we argue that the resulting test code is not only more maintainable, readable, and intentional, but also a better stylistic match for the Scala community than manually implemented, explicitly stateful test spies.
Many computational theories have been developed to improve artificial phonetic classification performance from linguistic auditory streams. However, less attention has been given to psycholinguistic data and neurophysiological features recently found in cortical tissue. We focus on a context in which basic linguistic units–such as phonemes–are extracted and robustly classified by humans and other animals from complex acoustic streams in speech data. We are especially motivated by the fact that 8-month-old human infants can accomplish segmentation of words from fluent audio streams based exclusively on the statistical relationships between neighboring speech sounds without any kind of supervision. In this paper, we introduce a biologically inspired and fully unsupervised neurocomputational approach that incorporates key neurophysiological and anatomical cortical properties, including columnar organization, spontaneous micro-columnar formation, adaptation to contextual activations and Sparse Distributed Representations (SDRs) produced by means of partial N-Methyl-D-aspartic acid (NMDA) depolarization. Its feature abstraction capabilities show promising phonetic invariance and generalization attributes. Our model improves the performance of a Support Vector Machine (SVM) classifier for monosyllabic, disyllabic and trisyllabic word classification tasks in the presence of environmental disturbances such as white noise, reverberation, and pitch and voice variations. Furthermore, our approach emphasizes potential self-organizing cortical principles achieving improvement without any kind of optimization guidance which could minimize hypothetical loss functions by means of–for example–backpropagation. Thus, our computational model outperforms multiresolution spectro-temporal auditory feature representations using only the statistical sequential structure immerse in the phonotactic rules of the input stream.
As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory access and data sharing are becoming performance bottlenecks. Cloud computing employs a data processing paradigm typically built on a loosely connected group of low-cost computing nodes without relying upon shared storage and/or memory. Apache Spark is a popular engine for large-scale data analysis in the cloud, which we have successfully deployed via job submission scripts on production clusters.
In this paper, we describe common parallel analysis dataflows for both Message Passing Interface (MPI) and cloud based applications. We developed an effective benchmark to measure the performance characteristics of these tasks using both types of systems, specifically comparing MPI/C-based analyses with Spark. The benchmark is a data processing pipeline representative of a typical analytics framework implemented using map-reduce. In the case of Spark, we also consider whether language plays a role by writing tests using both Python and Scala, a language built on the Java Virtual Machine (JVM). We include performance results from two large systems at Argonne National Laboratory including Theta, a Cray XC40 supercomputer on which our experiments run with 65,536 cores (1024 nodes with 64 cores each). The results of our experiments are discussed in the context of their applicability to future HPC architectures. Beyond understanding performance, our work demonstrates that technologies such as Spark, while typically aimed at multi-tenant cloud-based environments, show promise for data analysis needs in a traditional clustering/supercomputing environment.
Background: Developers face challenges in building high-quality research software due to its inherent complexity. These challenges can reduce the confidence users have in the quality of the result produced by the software. Use of a defined software development process, which divides the development into distinct phases, results in improved design, more trustworthy results, and better project management. Aims: This paper focuses on gaining a better understanding of the use of software development process for research software. Method: We surveyed research software developers to collect information about their use of software development processes. We analyze whether and demographic factors influence the respondents' use of and perceived value in defined process. Results: Based on 98 responses, research software developers appear to follow a defined software development process at least some of the time. The respondents also have a strong positive perception about the value of following processes. Conclusions: To produce high-quality and reliable research software, which is critical for many research domains, research software developers must follow a proper software development process. The results indicate a positive perception of value about using defined development processes that should lead to both short-term benefits through improved results and long-term benefits through more maintainable software.
A recent editorial in Nature Methods, “Giving Software its Due”, described challenges related to the development of research software and highlighted, in particular, the challenge of software publication and citation. Here, we call attention to a system that we have developed that enables community-driven software review, publication, and citation: The Journal of Open Source Software (JOSS) is an open-source project and an open access journal that provides a light-weight publishing process for research software. Focused on and based in open platforms and on a community of contributors, JOSS evidently satisfies a pressing need, having already published more than 500 articles in approximately three years of existence.
This paper shows how students can be guided to integrate elementary mathematical analyses with motion planning for typical educational robots. Rather than using calculus as in comprehensive works on motion planning, we show students can achieve interesting results using just simple linear regression tools and trigonometric analyses. Experiments with one robotics platform show that use of these tools can lead to passable navigation through dead reckoning even if students have limited experience with use of sensors, programming, and mathematics.
Computer vision relies on labeled datasets for training and evaluation in detecting and recognizing objects. The popular computer vision program, YOLO ("You Only Look Once"), has been shown to accurately detect objects in many major image datasets. However, the images found in those datasets, are independent of one another and cannot be used to test YOLO's consistency at detecting the same object as its environment (e.g. ambient lighting) changes. This paper describes a novel effort to evaluate YOLO's consistency for large-scale applications. It does so by working (a) at large scale and (b) by using consecutive images from a curated network of public video cameras deployed in a variety of real-world situations, including traffic intersections, national parks, shopping malls, university campuses, etc. We specifically examine YOLO's ability to detect objects in different scenarios (e.g., daytime vs. night), leveraging the cameras' ability to rapidly retrieve many successive images for evaluating detection consistency. Using our camera network and advanced computing resources (supercomputers), we analyzed more than 5 million images captured by 140 network cameras in 24 hours. Compared with labels marked by humans (considered as "ground truth"), YOLO struggles to consistently detect the same humans and cars as their positions change from one frame to the next; it also struggles to detect objects at night time. Our findings suggest that state-of-the art vision solutions should be trained by data from network camera with contextual information before they can be deployed in applications that demand high consistency on object detection.
The human brain is the most complex object created by evolution in the known universe. Yet, how much of this complexity is devoted to exclusively carrying out its algorithmic capabilities and how much of it has been inherited from biological paths of evolution in order to work properly in its physical environment? What if the information processing properties of the brain could be reduced to a few simple columnar rules replicated throughout the neocortex? In our research project we seek for those principles by means of the elaboration of computational models of the neocortex.
This repository holds three files:
- File S1 Data. Cortical Spectro-Temporal Model (CSTM) experimental results data. It contains a spreadsheet including all the numerical results returned by the experiments as well as the complete Statistical Significance tests conducted in this work.
- File S2 Data. CSTM complementary experimental results data. It contains a spreadsheet that includes all the numerical results returned by a complementary set of experiments.
- File S1 Appendix. Computational Setup and Complementary Experiments. An appendix including the Computational Setup of our CSTM, which describes its object oriented inheritance structure as well as the parallelization strategy used for its implementation in High Performance Computing (HPC) resources. We also include Strong and Weak scaling tests of our implementation on HPC resources. An appendix including a battery of complementary experiments showing the classification accuracy levels of different instances of the EL in the CSTM.
ZIP files of folders containing all the datasets (audio file corpora) employed in our research to train the Encoder Layer (EL) and the SVMs and to test the complete CSTM. This folder includes a set of 840 corpora which are distributed in 2 corpora for each configuration organized by 2 sets of synthesized voices, 3 syllabic conditions (i.e. mono-, di- and tri-syllabic English words) and 10 completely different vocabularies all distributed in 6 acoustic variants, beyond the original version of the corpora.
The 6 acoustic variants corresponds to: two levels of white noise (19.8 dB and 13.8 dB Signal to Noise Ratio (SNR) average Root Mean Square (RMS) power rate), two levels of reverberation (Reveberation-Time 60 dB (RT-60) value of 0.61 seconds and 1.78 seconds) and variations of pitch on both directions (from E to G and from E to C).