BugsC++: A Highly Usable Real World Defect Benchmark for C/C++

Gabin An, Minhyuk Kwon, Kyunghwa Choi, Jooyong Yi, Shin Yoo

KAIST, Suresoft Technologies, UNIST

BugsC++ Defects4C++: A Highly Usable Real World Defect Benchmark for C/C++

BugsC++ /bʌgsi/

Benjamin Siegel (1906-1947)

"one of the most infamous and feared gangsters of his day"

Fellow gang members commonly called him "Bugsy" due to his violent temper -- he was "as crazy as a bedbug"

Benchmarks Drive Technology Advancement

  • Machine Learning: MNIST, SuperGLUE, ...
  • Systems: SPEC, MLPerf, ...
  • Automated Debugging (FL, APR): SIR, ManyBugs, Defects4J, ...

Language Problem

  • New FL/APR approaches are typically evaluated with only one language.

APR/FL Researchers Have Fallen in Love with Java?

TIOBE Index for August 2023

  • Java: 9.49%
  • C: 11.27%
  • C++: 10.65%

Why do researchers favor Java over C/C++?

In the Early Days of APR

Approach Venue Language Remarks
GenProg ICSE'09 C ICSE MIP Award
SemFix ICSE'13 C ICSE MIP Award
PAR ICSE'13 Java
RSRepair ICSE'14 C
DirectFix ICSE'15 C
Prophet POPL'16 C
Angelix ICSE'16 C

Why do researchers favor Java over C/C++?

  • Answer: Defects4J

Why do researchers favor Java over C?

  • Answer: Defects4J
  • Researchers have fallen in love with Defects4J.

Strengths of Defects4J

  • Easy to use
    • It provides handy CLIs.
      • check out buggy and fixed versions, build them, and run associated tests.
  • Diverse real-world bugs
    • It contains 835 bugs from 17 open-source projects.
  • Reproducibility
    • Java 1.8

Defects4J vs. Existing C Benchmarks

Benchmark Language CLI? Real-World Bugs? # Projects
Defecst4J Java O O 17
C-Pack-IPAs C X X -
CodeFlaws C X X -
DBGBench C X O 2
ManyBugs C X O 9


  • Easy to use
    • It provides handy CLIs.
      • check out buggy and fixed versions, build them, and run associated tests.
  • Diverse real-world bugs
  • Reproducibility
    • via Docker

BugsC++ Demo

$ python bugscpp/bugscpp.py checkout cpp_peglib 1 --buggy
$ python bugscpp/bugscpp.py build ./cpp_peglib/buggy-1
$ python3 bugscpp/bugscpp.py test ./cpp_peglib/buggy-1 -c 1-4 --output-dir=test_result
$ cat test_result/cpp_peglib-buggy-1-1/1.test
$ cat test_result/cpp_peglib-buggy-1-4/4.test
$ cat test_result/cpp_peglib-buggy-1-4/4.output

Code Coverage

$ python bugscpp/bugscpp.py build ./cpp_peglib/buggy-1 --coverage
$ python3 bugscpp/bugscpp.py test ./cpp_peglib/buggy-1 -c 1 --output-dir=test_result --coverage
$ ls test_result/cpp_peglib-buggy-1-1

Bug Types

$ python bugscpp/bugscpp.py search memory-error
$ python bugscpp/bugscpp.py search single_line

C++ Bugs

cpp_peglib, cppcheck, exiv2, proj, yaml_cpp

Towards More Balanced Evaluation


As Arie just mentioned, I am going to talk a little bit about a benchmark, a benchmark for automated debugging. This work was carried out in partnership with industry collaborators, Minhyuk Kwon and Kyunghwa Choi, from Suresoft Technologies. Gabin An and Shin Yoo are from KAIST. And I am Jooyong from UNIST.

Here is an one-line summary of this talk. Essentially, we propose a defects4J-style benchmark for C and C++ programs.

Before we proceed, let me talk a little bit about the name of the benchmark. For simplicity in pronunciation, I will call it 'Bugsy' omitting ++. Has anyone watched this movie?

This movie is based on a true story of a gangster named Benjamin Siegel. His fellow gang members commonly called him Bugsy due to his violent temper. They said he is as crazy as a bedbug. What a perfect name for a bug benchmark!

Alright. Enough of names. Let's move on to the main topic. Needless to say, benchmarks are very important. So most scientific and engineering fields have their own benchmarks. Machine learning has MNIST and SupeGLUE among many others. The systems folks commonly use SPEC and more recently. MLPerf. In our automated debugging field, we have several benchmarks such as SIR, ManyBugs and Defects4J, to name a few. The question is: are they good enough?

One issue we have to think about is a language problem. Whenever someone comes up with a new FL approach or a new APR approach, these approaches are typically evaluated with only one language. And which language is most commonly used these days?

As you know very well, Java is the most commonly used language these days when evaluating automated debugging approaches. This pie chart shows how often Java and C are used in the papers listed in the program-repair.org website. Clearly, Java is disproportionately used in the evaluation of APR. Although not shown here, the situation is similar in FL.

However, this proportion does not match the popularity of the languages. According to the latest TIOBE Index, C is a little more popular than Java. And the same goes to C++.

# Are We Going Against the Current? ![bg left:50%](./img/against.jpeg) ---

Well, in fact, that was not the case in the early days of APR and FL research. For example, this table shows representative early-day APR approaches, including two ICSE MIP award winners, GenProg and SemFix. At that time, C programs were most commonly looked at with an exception of PAR.

Going back to the question of why researchers favor Java over C/C++, we think the answer is deeply related with Defects4J.

What is so nice about Defects4J? First of all, Defects4J is very easy to use since it provides many handy CLI. Using those CLIs, you can easily check out buggy and fixed versions, build them and run their associated tests. Another nice thing about Defects4J is that it contains diverse real-world bugs. Defecst4J currently contains 835 bugs from 17 open-source projects. Lastly, you can easily reproduce the same results as long as you use Java 1.8.

Let's compare Defects4J with existing C benchmarks. Most notably, none of the existing C benchmarks provide CLIs. And, many of them do not contain real-world bugs. And they do not provide as diverse bugs as Defects4J does. All in all, there has been no benchmark for C/C++ programs that can match Defects4J.

But, not anymore. We introduce Bugsy that is easy to use and contains diverse real-world bugs. It also supports reproducibility by dockerizing each subject program in the benchmark.

Let's see how to use Bugsy. Bugsy is written as a Python script. This is the command you can use to check out the first buggy version of a subject program called cpp_peglib. And this is the command to build the program. As usual, build takes some time, so I am not going to run it here. I already ran it beforehand. So, let's move on. This is the command to test the subject program using test cases betweeen 1 and 4.

There's more you can do with Bugsy. You can easily measure the code coverage by giving the --coverage option. When this option is turned on, the subject program will be instrumented with gcov. As an example, let's run the first test case with the coverage option enabled. It takes a little bit longer than before because we are now running the instrumented code. And here is the result. We now have all these gcov-generated files from which we can get code coverage information.

There's still more you can do with Bugsy. Let's say you want to evaluate APR tools that fix memory errors. Then, you can easily search for memory-error bugs using this search command. Our GitHub page maintains a list of bug types you can search for.

Lastly, it would be worthwhile to mention that unlike the existing C benchmarks, Bugsy also contains C++ bugs for these subjects.

So, this is it. We offer Bugsy. Enjoy!