BugsC++

Approach	Venue	Language	Remarks
GenProg	ICSE'09	C	ICSE MIP Award
SemFix	ICSE'13	C	ICSE MIP Award
PAR	ICSE'13	Java
RSRepair	ICSE'14	C
SPR	FSE'15	C
DirectFix	ICSE'15	C
Prophet	POPL'16	C
Angelix	ICSE'16	C

Benchmark	Language	CLI?	Real-World Bugs?	# Projects
Defecst4J	Java	O	O	17
C-Pack-IPAs	C	X	X	-
CodeFlaws	C	X	X	-
ITSP	C	X	X	-
DBGBench	C	X	O	2
ManyBugs	C	X	O	9

As Arie just mentioned, I am going to talk a little bit about a benchmark, a benchmark for automated debugging. This work was carried out in partnership with industry collaborators, Minhyuk Kwon and Kyunghwa Choi, from Suresoft Technologies. Gabin An and Shin Yoo are from KAIST. And I am Jooyong from UNIST.

Here is an one-line summary of this talk. Essentially, we propose a defects4J-style benchmark for C and C++ programs.

Before we proceed, let me talk a little bit about the name of the benchmark. For simplicity in pronunciation, I will call it 'Bugsy' omitting ++. Has anyone watched this movie?

This movie is based on a true story of a gangster named Benjamin Siegel. His fellow gang members commonly called him Bugsy due to his violent temper. They said he is as crazy as a bedbug. What a perfect name for a bug benchmark!

Alright. Enough of names. Let's move on to the main topic. Needless to say, benchmarks are very important. So most scientific and engineering fields have their own benchmarks. Machine learning has MNIST and SupeGLUE among many others. The systems folks commonly use SPEC and more recently. MLPerf. In our automated debugging field, we have several benchmarks such as SIR, ManyBugs and Defects4J, to name a few. The question is: are they good enough?

One issue we have to think about is a language problem. Whenever someone comes up with a new FL approach or a new APR approach, these approaches are typically evaluated with only one language. And which language is most commonly used these days?

As you know very well, Java is the most commonly used language these days when evaluating automated debugging approaches. This pie chart shows how often Java and C are used in the papers listed in the program-repair.org website. Clearly, Java is disproportionately used in the evaluation of APR. Although not shown here, the situation is similar in FL.

However, this proportion does not match the popularity of the languages. According to the latest TIOBE Index, C is a little more popular than Java. And the same goes to C++.

# Are We Going Against the Current? ![bg left:50%](./img/against.jpeg) ---

Well, in fact, that was not the case in the early days of APR and FL research. For example, this table shows representative early-day APR approaches, including two ICSE MIP award winners, GenProg and SemFix. At that time, C programs were most commonly looked at with an exception of PAR.

Going back to the question of why researchers favor Java over C/C++, we think the answer is deeply related with Defects4J.

What is so nice about Defects4J? First of all, Defects4J is very easy to use since it provides many handy CLI. Using those CLIs, you can easily check out buggy and fixed versions, build them and run their associated tests. Another nice thing about Defects4J is that it contains diverse real-world bugs. Defecst4J currently contains 835 bugs from 17 open-source projects. Lastly, you can easily reproduce the same results as long as you use Java 1.8.

Let's compare Defects4J with existing C benchmarks. Most notably, none of the existing C benchmarks provide CLIs. And, many of them do not contain real-world bugs. And they do not provide as diverse bugs as Defects4J does. All in all, there has been no benchmark for C/C++ programs that can match Defects4J.

But, not anymore. We introduce Bugsy that is easy to use and contains diverse real-world bugs. It also supports reproducibility by dockerizing each subject program in the benchmark.

Let's see how to use Bugsy. Bugsy is written as a Python script. This is the command you can use to check out the first buggy version of a subject program called cpp_peglib. And this is the command to build the program. As usual, build takes some time, so I am not going to run it here. I already ran it beforehand. So, let's move on. This is the command to test the subject program using test cases betweeen 1 and 4.

There's more you can do with Bugsy. You can easily measure the code coverage by giving the --coverage option. When this option is turned on, the subject program will be instrumented with gcov. As an example, let's run the first test case with the coverage option enabled. It takes a little bit longer than before because we are now running the instrumented code. And here is the result. We now have all these gcov-generated files from which we can get code coverage information.

There's still more you can do with Bugsy. Let's say you want to evaluate APR tools that fix memory errors. Then, you can easily search for memory-error bugs using this search command. Our GitHub page maintains a list of bug types you can search for.

Lastly, it would be worthwhile to mention that unlike the existing C benchmarks, Bugsy also contains C++ bugs for these subjects.

So, this is it. We offer Bugsy. Enjoy!

BugsC++: A Highly Usable Real World Defect Benchmark for C/C++

Gabin An, Minhyuk Kwon, Kyunghwa Choi, Jooyong Yi, Shin Yoo

KAIST, Suresoft Technologies, UNIST

BugsC++ Defects4C++: A Highly Usable Real World Defect Benchmark for C/C++

BugsC++ /bʌgsi/

Benchmarks Drive Technology Advancement

Language Problem

APR/FL Researchers Have Fallen in Love with Java?

TIOBE Index for August 2023

Why do researchers favor Java over C/C++?

In the Early Days of APR

Why do researchers favor Java over C/C++?

Why do researchers favor Java over C?

Strengths of Defects4J

Defects4J vs. Existing C Benchmarks

BugsC++

BugsC++ Demo

Code Coverage

Bug Types

C++ Bugs

Towards More Balanced Evaluation