This post is an extension of my paper titled Redundant State Detection for Dynamic Symbolic Execution which was presented at USENIX Technical 2013.

1 Implementation

The state space reduction analysis was implemented by augmenting a copy of the open source version of KLEE, a symbolic virtual machine capable of automatically generating test inputs for complex programs such as device drivers, network drivers, and utility programs written in C. For the remainder of this post, this unmodified open source version of KLEE is referred to as KLEE-Base, and the augmentation of KLEE with the redundant state detector is referred to as KLEE-Reduce. This chapter discusses practical implementation issues needed amplify the effectiveness of the redundant state detector.

1.1 Searcher

The KLEE-Base system uses a searcher that interleaves nearest-uncovered-line search with random path selection. In contrast, the KLEE-Reduce system interleaves depth first search with random path selection in order to bias the execution to deeper parts of the program. This depth bias is needed because the detector can only compare a running state to a terminated state. The more states terminate, the more opportunities the detector has to match states.

1.2 Hybrid Execution

KLEE-Reduce matches a running state to all terminated states every time the running state executes a basic block. Because of this expensive matching, KLEE-Reduce executes fewer instructions per second than KLEE-Base. In addition, it uses more memory per generated state because it keeps snapshots of previous states in memory for matching. Note that while it uses more memory per generated state, KLEE-Reduce quite often uses less overall memory than KLEE-Base because KLEE-Reduce generates fewer states. Despite this overhead, KLEE-Reduce’s effectiveness at pruning states enables it to reach coverage an order of magnitude faster on most programs, reach higher coverage, and in many cases, terminate execution as discussed in Section 2. However, for a handful of benchmarks, this overhead inhibits KLEE-Reduce from reaching the last few uncovered lines because it cannot explore enough paths before the time limit and memory limit. To remedy this, KLEE-Reduce runs in two phases. The first phase runs with the detector enabled until either no new coverage is reached for 3 minutes or 3 GB of memory is used. Then, the second phase disables the detector, deallocates its data structures, and continues symbolically executing the program until the time limit or memory limit.

1.3 Performance Metrics

The following simple metrics gave useful insight into how the optimization was performing:

  • Percent statement coverage
  • Memory used
  • Memory used per generated state
  • Number of instructions executed
  • Number of instructions executed per second
  • Percent solver time
  • Individual query solver time
  • Number of states generated
  • Number of states running

The values of these metrics varied greatly over time during symbolic execution. Thus, visualizing them in a time chart, side-by-side with the base system was essential. Section 2 shows several examples of these time charts.

2 Experiments

This section discusses in the detail the performance of KLEE-Reduce compared to KLEE-Base on five benchmarks: join, mkdir, mktemp, rmdir, and csplit.

join

Figure 1 above shows the states generated by KLEE-Base and KLEE-Reduce on the join benchmark. The vertical, dashed gray line indicates the start of phase 2 of the hybrid execution when the detector is disabled as discussed in Section 1.2.

join

Figure 2 above shows the coverage reached and states generated on the join benchmark, the ninth largest of the 63 benchmarks in this evaluation. KLEE-Base reaches its maximum coverage of 84% in 49 minutes by generating 156,000 states and executing 28 million instructions, while KLEE-Reduce reaches its maximum coverage of 87% in 9 minutes by generating only 7,600 states and executing only 1.7 million instructions. Neither system terminates the symbolic execution.

join

Figure 3 above shows the total percent time spent in solver by KLEE-Base and KLEE-Reduce on the join benchmark. Observe how the percent solver time of KLEE-Base plateaus after about 10 minutes into the symbolic execution. In contrast, KLEE-Reduce’s percent solver time continues to increase because KLEE-Reduce’s searcher is biased to explore deeper parts of the program as discussed in Section 1.1, and thus encounters harder, more time-consuming queries.

join

Figure 4 above is a scatter plot of individual constraint solver times by KLEE-Base and KLEE-Reduce on the join benchmark superimposed over total percent time spent in solver.

join.

Figure 5 shows the number of symbolically executed instructions per second. KLEE-Base’s instructions per second is significantly higher than that of KLEE-Reduce even after the detector is disabled because KLEE-Reduce encounters harder queries.

join

Figure 6 shows memory (GB) used over time by KLEE-Base and KLEE-Reduce on the join benchmark

join

Figure 7 shows memory (KB) per generated state for KLEE-Base and KLEE-Reduce on the join benchmark. The metric is calculated by dividing the total memory used by the total number of states generated. KLEE-Base’s memory per generated state decreases over time because of KLEE’s copy-on-write optimization, which only creates new memory objects when a write is performed. Otherwise, when a state is forked, the two child states share the same memory objects. This optimization is an example of the sophistication and maturity of the KLEE system.

mkdir

Figure 8 shows the coverage reached and states generated on the mkdir benchmark. KLEE-Base reaches 100% coverage after 40 minutes generating 68,210 states, whereas KLEE-Reduce reaches 100% coverage within the first 10 seconds generating only 573 states.

mktemp

Figure 9 shows the coverage reached and states generated on the mktemp benchmark. KLEE-Base reaches a maximum coverage of 94.22% after 12 minutes and continues executing until the time limit is reached, generating 154,000 states. In contrast, KLEE-Reduce reaches a higher coverage of 96.53% within the first 8 seconds generating 3,200 states, and then immediately terminates, thereby proving that the remaining uncovered lines are dead code with respect to the modelled environment.

rmdir

Figure 10 shows the coverage reached and states generated on the rmdir benchmark. KLEE-Base reaches a maximum coverage of 76.87% after 17 minutes and continues executing until the time limit is reached generating 290,000 states, whereas KLEE-Reduce reaches a higher coverage of 78.91% within the first 5 seconds, generating 360 states.

csplit

Figure 11 shows the coverage reached and states generated on the csplit benchmark. KLEE-Base reaches a maximum coverage of 71.39% after 30 minutes, whereas KLEE-Reduce reaches a coverage of only 67.66%. This figure shows that KLEE-Reduce encounters harder, more time-consuming queries because of the depth bias explained in Section 1, and thus spends more of its time in the constraint solver.