Jiasi Shen: Papers

Peer-Reviewed Publications

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim. TBD. A Survey on Large Language Models for Code Generation. To appear in TOSEM. https://arxiv.org/abs/2406.00515.
Jiasi Shen, Martin Rinard, Nikos Vasilakis. 2022. POSTER: Automatic Synthesis of Parallel Unix Commands and Pipelines with KumQuat. 27th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), poster.
Abstract Paper (pdf) DOI Slides (pptx)

We present KumQuat, a system for automatically generating data-parallel implementations of Unix shell commands and pipelines. The generated parallel versions split input streams, execute multiple instantiations of the original pipeline commands to process the splits in parallel, then combine the resulting parallel outputs to produce the final output stream. KumQuat automatically synthesizes the combine operators, with a domain-specific combiner language acting as a strong regularizer that promotes efficient inference of correct combiners. We present experimental results that show that these combiners enable the effective parallelization of our benchmark scripts.
Paper (pdf)
Nikos Vasilakis, Achilles Benetopoulos, Shivam Handa, Alizee Schoen, Jiasi Shen, Martin Rinard. 2021. Supply-Chain Vulnerability Elimination via Active Learning and Regeneration. ACM SIGSAC Conference on Computer and Communications Security (CCS).
Abstract Paper (pdf) DOI

Software supply-chain attacks target components that are integrated into client applications. Such attacks often target widely-used components, with the attack taking place via operations (for example, file system or network accesses) that do not affect those aspects of component behavior that the client observes. We propose new active library learning and regeneration (ALR) techniques for inferring and regenerating the client-observable behavior of software components. Using increasingly sophisticated rounds of exploration, ALR generates inputs, provides these inputs to the component, and observes the resulting outputs to infer a model of the component's behavior as a program in a domain-specific language. We present Harp, an ALR system for string processing components. We apply Harp to successfully infer and regenerate string-processing components written in JavaScript and C/C++. Our results indicate that, in the majority of cases, Harp completes the regeneration in less than a minute, remains fully compatible with the original library, and delivers performance indistinguishable from the original library. We also demonstrate that Harp can eliminate vulnerabilities associated with libraries targeted in several highly visible security incidents, specifically event-stream, left-pad, and string-compare.
Paper (pdf)
Jiasi Shen, Martin Rinard. 2021. Active Learning for Inference and Regeneration of Applications that Access Databases. ACM Transactions on Programming Languages and Systems (TOPLAS).
Abstract Paper (pdf) DOI

We present Konure, a new system that uses active learning to infer models of applications that retrieve data from relational databases. Konure comprises a domain-specific language (each model is a program in this language) and associated inference algorithm that infers models of applications whose behavior can be expressed in this language. The inference algorithm generates inputs and database configurations, runs the application, then observes the resulting database traffic and outputs to progressively refine its current model hypothesis. Because the technique works with only externally observable inputs, outputs, and database configurations, it can infer the behavior of applications written in arbitrary languages using arbitrary coding styles (as long as the behavior of the application is expressible in the domain-specific language). Konure also implements a regenerator that produces a translated Python implementation of the application that systematically includes relevant security and error checks.
Paper (pdf)
Jürgen Cito¹, Jiasi Shen¹, Martin Rinard. 2020. An Empirical Study on the Impact of Deimplicitization on Comprehension in Programs Using Application Frameworks. 17th International Conference on Mining Software Repositories (MSR), registered report.
Abstract Paper (pdf) DOI Talk

Background: Application frameworks, such as Ruby on Rails, introduce abstractions with the goal of simplifying development for particular application domains, such as web development. While experts enjoy increased productivity due to these abstractions, the flow of the programs is often hard to understand for non-experts and newcomers due to implicit flow and concealed lower level action that seems like "magic".
Objective: We conjecture that converting these implicit flows into an explicit and unified form can help non-experts comprehend the programs using these frameworks. We call the process of unifying distributed, implicit flows into a single routine deimplicitization.
Method: We want to conduct an experiment that studies the impact of deimplicitization on program comprehension. Particularly, we want to study how software developers with different expertise (novices/students, framework experts/professional developers) can answer comprehension questions differently with respect to time and correctness, under the treatments of either a deimplicitized version of the program in Python or the original version of the program in Ruby on Rails.
Paper (pdf)
José Cambronero, Thurston Dang, Nikos Vasilakis, Jiasi Shen, Jerry Wu, Martin Rinard. 2019. Active Learning for Software Engineering. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!).
Abstract Paper (pdf) DOI

Software applications have grown increasingly complex to deliver the features desired by users. Software modularity has been used as a way to mitigate the costs of developing such complex software. Active learning-based program inference provides an elegant framework that exploits this modularity to tackle development correctness, performance and cost in large applications. Inferred programs can be used for many purposes, including generation of secure code, code re-use through automatic encapsulation, adaptation to new platforms or languages, and optimization. We show through detailed examples how our approach can infer three modules in a representative application. Finally, we outline the broader paradigm and open research questions.
Paper (pdf)
José Cambronero¹, Jiasi Shen¹, Jürgen Cito¹, Elena Glassman, Martin Rinard. 2019. Characterizing Developer Use of Automatically Generated Patches. IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), short paper.
Abstract Paper (pdf) DOI

We present a study that characterizes the way developers use automatically generated patches when fixing software defects. Our study tasked two groups of developers with repairing defects in C programs. Both groups were provided with the defective line of code. One was also provided with five automatically generated and validated patches, all of which modified the defective line of code, and one of which was correct. Contrary to our initial expectations, the group with access to the generated patches did not produce more correct patches and did not produce patches in less time. We characterize the main behaviors observed in experimental subjects: a focus on understanding the defect and the relationship of the patches to the original source code. Based on this characterization, we highlight various potentially productive directions for future developer-centric automatic patch generation systems.
Paper (pdf)
Jiasi Shen, Martin Rinard. 2019. Using Active Learning to Synthesize Models of Applications that Access Databases. Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
Abstract Paper (pdf) DOI Code Slides (pptx) Talk

We present Konure, a new system that uses active learning to infer models of applications that access relational databases. Konure comprises a domain-specific language (each model is a program in this language) and associated inference algorithm that infers models of applications whose behavior can be expressed in this language. The inference algorithm generates inputs and database configurations, runs the application, then observes the resulting database traffic and outputs to progressively refine its current model hypothesis. Because the technique works with only externally observable inputs, outputs, and database configurations, it can infer the behavior of applications written in arbitrary languages using arbitrary coding styles (as long as the behavior of the application is expressible in the domain-specific language). Konure also implements a regenerator that produces a translated Python implementation of the application that systematically includes relevant security and error checks.
Paper (pdf)
Martin Rinard, Jiasi Shen, Varun Mangalick. 2018. Active Learning for Inference and Regeneration of Computer Programs that Store and Retrieve Data. Proceedings of the 2018 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!).
Abstract Paper (pdf) DOI Slides (pptx)

As modern computation platforms become increasingly complex, their programming interfaces are increasingly difficult to use. This complexity is especially inappropriate given the relatively simple core functionality that many of the computations implement. We present a new approach for obtaining software that executes on modern computing platforms with complex programming interfaces. Our approach starts with a simple seed program, written in the language of the developer's choice, that implements the desired core functionality. It then systematically generates inputs and observes the resulting outputs to learn the core functionality. It finally automatically regenerates new code that implements the learned core functionality on the target computing platform. This regenerated code contains boilerplate code for the complex programming interfaces that the target computing platform presents. By providing a productive new mechanism for capturing and encapsulating knowledge about how to use modern complex interfaces, this new approach promises to greatly reduce the developer effort required to obtain secure, robust software that executes on modern computing platforms.
Paper (pdf)
Jiasi Shen, Martin Rinard. 2017. Robust Programs with Filtered Iterators. Proceedings of 2017 ACM SIGPLAN International Conference on Software Language Engineering (SLE). Distinguished Artifact Award.
Abstract Paper (pdf) DOI Artifact Slides (pptx) Talk

We present a new language construct, filtered iterators, for robust input processing. Filtered iterators are designed to eliminate many common input processing errors while enabling robust continued execution. The design is inspired by (1) observed common input processing errors and (2) successful strategies implemented by human developers fixing input processing errors. Filtered iterators decompose inputs into input units and atomically and automatically discard units that trigger errors. Statistically significant results from a developer study demonstrate the effectiveness of filtered iterators in enabling developers to produce robust input processing code without common input processing defects.
Paper (pdf)
Jun Sun, Yizhou Duan, Qi Zhang, Jiasi Shen, Zongming Guo. 2015. Towards Rate-Distortion Analysis of General Source Distributions: Property and Principles. IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

(1: Equal contribution.)

Patents

Jiasi Shen, Homer Strong, Daniel George Peebles, Neha Rungta. 2022. Generating Access Management Policies from Example Requests. United States Patent (Patent No.: US 11,483,353 B1).
Abstract Paper (pdf)

Access management policies may be generated from example requests. An access management policy may be received. One or more example requests that have expected results when evaluated with respect to the access management policy may be received. Updates to the access management policy may be determined that cause the expected results to occur when a new version of the access management policy based on the updates is enforced. The new version of the access management policy may be generated based on the updates.
Paper (pdf)

Theses

Jiasi Shen. 2022. Program Inference and Regeneration via Active Learning. Ph.D. Thesis, Massachusetts Institute of Technology.
Abstract Paper (pdf) DOI Benchmarks

Software now plays a central role in numerous aspects of human society. Current software development practices involve significant developer effort in all phases of the software life cycle, including the development of new software, detection and elimination of defects and security vulnerabilities in existing software, maintenance of legacy software, and integration of existing software into more contexts, with the quality of the resulting software still leaving much to be desired. The goal of my research is to improve software quality and reduce costs by automating tasks that currently require substantial manual engineering effort.
I present a novel approach for program inference and regeneration, which takes an existing program, learns its core functionality as a black box, builds a model that captures this functionality, and uses the model to generate a new program. The new program delivers the same core functionality but is potentially augmented or transformed to eliminate defects, systematically introduce safety or security checks, or operate successfully in different environments.
This research enables the rejuvenation and retargeting of existing software and provides a powerful way for developers to express program functionality that adapts flexibly to a variety of contexts. For instance, one benefit is enabling new development methodologies that work with simple prototype implementations as specifications, then use regeneration to automatically obtain clean, efficient, and secure implementations. Another benefit is automatically improving program comprehension and producing cleaner code, making the code more transparent and the developers more productive. A third benefit is automatically extracting the human knowledge crystallized and encapsulated in legacy software systems and retargeting it to new languages and platforms, including languages and platforms that provide more powerful features.
In this thesis, I present two systems that implement this approach for database-backed programs.
Paper (pdf)
Jiasi Shen. 2015. RIFL: A Language with Filtered Iterators. Master's Thesis, Massachusetts Institute of Technology.
Abstract DOI

RIFL is a new programming language that enables developers to write only common-case code to robustly process structured inputs. RIFL eliminates the need to manually handle errors with a new control structure, filtered iterators. A filtered iterator treats inputs as collections of input units, iterates over the units, uses the program itself to filter out unanticipated units, and atomically updates program state for each unit. Filtered iterators can greatly simplify the development of robust programs. We formally define filtered iterators in RIFL. The semantics of filtered iterators ensure that each input unit affects program execution atomically. Our benchmarks show that using filtered iterators reduces an average of 41.7% lines of code, or 58.5% conditional clauses and 33.4% unconditional computation, from fully manual implementations.

Preprints

Tanapoom Sermchaiwong, Jiasi Shen. TBD. Dynamic Graph-based Fingerprinting of In-browser Cryptomining. In submission. arXiv:2505.02493.
Shangyu Li, Juyong Jiang, Tiancheng Zhao, Jiasi Shen. TBD. OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification. In submission. arXiv:2504.20964.
Jingyi Chen, Songqiang Chen, Jialun Cao, Jiasi Shen, Shing-Chi Cheung. TBD. When LLMs Meet API Documentation: Can Retrieval Augmentation Aid Code Generation Just as It Helps Developers?. In submission. arXiv:2503.15231.
Sizhe Zhong, Jiasi Shen. TBD. Automated algorithm for refactoring data structure usages. In submission. https://hdl.handle.net/1783.1/137754.
Jiasi Shen, Martin Rinard, Nikos Vasilakis. TBD. Automatic Synthesis of Parallel Unix Commands and Pipelines with KumQuat. In submission. arXiv:2012.15443.
Jiasi Shen, Martin Rinard. TBD. Active Loop Detection for Applications that Access Databases. In submission. https://hdl.handle.net/1721.1/138144.