Caper: helper tooling for tcpdump and BPF

If you use or develop packet filters then you might be interested in a survey we recently ran to poll the current usage of -- and perceived problems with -- traffic filtering and packet monitoring tools. The results will help guide new academic research and will be shared with the research community. Contact me to find out more.

What is Caper? Caper is a freely-available, open-source tool for processing packet filters. Filters are used as the first processing stage when capturing, diverting, or dropping network traffic. Caper processes filters by converting them among different representations to clarify their meaning. The meaning of a filter is the set of network traffic it matches---that's rather abstract, so we express the meaning using representation. A representation of a filter simply consists of a system of expressing that meaning using syntax. As we'll see below, different representations have different purposes.

What does a filter look like? Filters are written in a special, concise language such as those used by tcpdump and Wireshark. An example filter is dst port http, which filters for network traffic whose destination port is http -- that is, network traffic that is trying to reach an HTTP (web) server. That syntax is an example representation of a filter. This type of representation is called a tcpdump or pcap expression. In practice, filters can be complex but this page will base its description on that simple filter.

How do filters work? They get converted into a different representation that consists of a sequence of machine instructions. These instructions are applied to each packet to check if it matches the filter or not. For example, on a standard Ethernet/IPv4 network the intended meaning of dst port http is converted to:

     l000: ldh [12]                        ; Inspect the "ethertype" field, i.e., the two bytes starting from the 12th byte in the frame.
     l001: jeq #0x800 , l002 , unmatched   ; If that field's value is IPv4 then proceed to the next line, otherwise go to "unmatched" (last line).
     l002: ldb [23]                        ; Inspect the "protocol" field in the IPv4 packet.
     l003: jeq #0x6 , l004 , unmatched     ; If that field's value is TCP then proceed to the next line, otherwise go to "unmatched".
     l004: ldh [20]                        ; Inspect the "flags" and "fragment offset" fields in the IPv4 packet.
     l005: jset #0x1fff , unmatched , l006 ; If this is not the first or only fragment then go to "unmatched", otherwise proceed.
     l006: ldxb 4 * ([14] & 0xf)           ; Calculate the size of the IPv4 header, to work out where the TCP header starts.
     l007: ldh [x + 16]                    ; Inspect the "destination port" field in the TCP header.
     l008: jeq #0x50 , matched , unmatched ; If the field's value is 80 (or 0x50 in hexadecimal) then match the packet, otherwise don't.
  matched: ret #1514                       ; A packet is kept if it matches the filter: return the full Ethernet frame.
unmatched: ret #0                          ; A packet that doesn't match the filter is discarded: return nothing.
This type of representation is called BPF assembly. (This is a predecessor of eBPF, which you might have heard of.) In each line, the text following the semicolon are comments. They briefly describe what that line is doing. (Find out in more detail what the above code means).
But the two representations of are actually unequal -- they represent different filters. dst port http matches more packets than the BPF code above -- the two filters have different meanings. The two filters differ in two ways: (1) By default, tcpdump will generate IPv6-related code unless the expression specifies a different network protocol. dst port http does not specify IPv4. (2) By default, tcpdump will generate code for other protocols that have "port" fields -- we usually carry HTTP over only TCP, but tcpdump caters for the general case.
For such a simple example, this might all seem a bit academic, but for complex filters it doesn't get easier: reading the BPF code representation is more difficult than reading the tcpdump filter representation. The devil is in the detail.

What problem does Caper solve? Caper helps to make filters fully understandable without having to read their BPF code. For example, it expands dst port http to the following tcpdump representation, turning its implicit meaning into explicit syntax:

        ether proto \ip 
        (ip proto \sctp   &&   sctp dst port 80
              ||   ip proto \tcp   &&   tcp dst port 80
              ||   ip proto \udp   &&   udp dst port 80)
        ether proto \ip6
        (ip6 proto \sctp   &&   sctp dst port 80
              ||   ip6 proto \tcp   &&   tcp dst port 80
              ||   ip6 proto \udp   &&   udp dst port 80)

For the BPF assembly shown above, the corresponding tcpdump expression is ip and tcp and dst port 80 or, equivalently, ether proto \ip && ip proto \tcp && tcp dst port 80 . (Find out how to read tcpdump expression syntax).

Why does this matter? Caper helps with reasoning about equivalences. (Find out in more detail how Caper works).
Reasoning about filters helps you get exactly the filter you want. In the above example, even if you only match against TCP packets, the BPF assembly would be unnecessarily checking for an SCTP match---this check would fail for every packet. If the BPF assembly that captures your intended meaning is shorter than what your tcpdump expression produces, then it will usually require less time to execute. This allows you to process more packets in the same amount of time. For high-throughput links, the effort saved for each packet quickly adds up.

How do I use it? The simplest approach is to use Caper online at the wonderful BPF Exam and BPF Simulator sites. You can also download the source code from the Caper repo and build and run it locally. The repo contains scripts that automate this for you by creating a fresh container or VM that contain the required dependencies.

Is there such a thing as Filter Filtering? Glad you asked. Click here to play Filter Filtering. It involves matching different representations of the same filter. For example, matching the English description to BPF code, or to the implicit or expanded tcpdump expressions.

How can I learn more about pcap and BPF? Is this related to eBPF? We're preparing a tutorial, please check back later.

How can I contribute?. There is a wide variety of contributions that could really help this project! These are some entry-level ideas:

  • Documentation: Helping with making a Caper tutorial.
  • Documentation: Putting together usage examples of Caper in various network scenarios.
  • Packaging: Packaging Caper for opam, Debian, Homebrew, etc.
  • Coding: Reducing the number of compiler warnings.
  • Coding: Adding test cases.
  • Coding: Making Caper scripts more portable across OSs (e.g., use of #! arguments), and applying Shellcheck to find potential issues.
  • Documentation: Improving this website (including its accessibility).

Acknowledgement. Huge thanks to Caper's contributors for their careful improvements, and to the BPF Exam project for feedback on Caper. The template for this site is based on the default Apache2 page on Debian, and the invocation of Caper is based on that in BPF Exam. The quiz below benefited from the following features:

  • Hyunsuk Bang's BPF code generator.
  • Marelle León's English-to-pcap converter.

See anything that needs fixing? Please send feedback and suggestions to Nik Sultana.

Filter Filtering

"Filter Filtering" involves matching two representations of the same filter. Consider the following representation:

((ethernet that has a proto of IPv4, and IPv4 that has a proto of tcp), or (ethernet that has a proto of IPv6, and IPv6 that has a proto of tcp)), and (tcp that has a destination port of domain, or tcp that has a destination port of ftp, or tcp that has a destination port of ftp-data)

Which of the expressions below does the above filter correspond to?