Calculating Bandwidth and Binning Time

I found Michael Baker’s blog here and wanted to summarize it here specifically for the calculating bandwidth part.

Start with Packetpig custom loaders that allow you to access specific information in packet captures. There are a number of them but for this blog we’ll use;

  • Packetloader() allows you to access protocol information (Layer-3 and Layer-4) from packet captures.
  • SnortLoader() inspects traffic using Snort Intrusion Detection software.

The Packetloader() provides access to IP, TCP and UDP headers for each packet in the capture. A great example of it’s use is the ‘binning.pig‘ script. This script allows you to calculate the bandwidth used by TCP and UDP packets as well as total bandwidth at any period you define. You might want to calculate these totals every minute, hour, day, week or month to produce a graph.

Firstly run the binning script using the following command.

./ -x local -r data/web.pcap -f pig/examples/binning.pig

Then open up output/binning/part-r-00000 in a text editor to see the output.

Now let’s walk through the script. Firstly let’s include all the jar’s required for Packetpig and binning.pig to run;

%DEFAULT includepath pig/include.pig
RUN $includepath;
Then the amount of time you want to bin your values into.  In this case we are specify every minute (60 seconds);
%DEFAULT time 60
--%DEFAULT time 3600
Then load the data out of the packet captures into quite a large schema using the Packetloader();
packets = load '$pcap' using com.packetloop.packetpig.loaders.pcap.packet.PacketLoader() AS (

This is a very rich data model and through leveraging the timestamp (ts), size of the IP packet (ip_total_length), and size of the TCP (tcp_len) and UDP (udp_len) we can calculate total and respective bandwidths at any interval.  The beauty of pig is that we can easily hone in on specific hosts by grouping on the Source IP, Destination IP and Destination Port – but let’s keep things simple in this post.

The ip_proto field allows be to filter all packets based on protocol. TCP is IP protocol 6 and UDP is IP protocol 17.

tcp = FILTER packets BY ip_proto == 6;
udp = FILTER packets BY ip_proto == 17;

Once filtered we can bin each packet into a time period and then project a summary of the data with the size of all TCP packets in that time period (bin) summed.

tcp_grouped = GROUP tcp BY (ts / $time * $time);
tcp_summary = FOREACH tcp_grouped GENERATE group, SUM(tcp.tcp_len) AS tcp_len;

And then the same for UDP.

udp_grouped = GROUP udp BY (ts / $time * $time);
udp_summary = FOREACH udp_grouped GENERATE group, SUM(udp.udp_len) AS udp_len;

To calculate total bandwidth of all IP packets we bin all packets using the same time period and then sum ip_total_length.

bw_grouped = GROUP packets BY (ts / $time * $time);
bw_summary = FOREACH bw_grouped GENERATE group, SUM(packets.ip_total_length) AS bw;
The output we were looking for is basically comma separated values for timestamp, tcp bandwidth, udp bandwidth and total bandwidth. This is produced by a final join and projection.
joined = JOIN tcp_summary BY group, udp_summary BY group, bw_summary BY group;
summary = FOREACH joined GENERATE tcp_summary::group, tcp_len, udp_len, bw;

It may seem a little cryptic but basically the JOIN statement is joining using the group that all the summaries share which is the time period. If you ILLUSTRATE the joined variable you will see the data is there but not in the format we are looking for.

| joined | tcp_summary::group:int | tcp_summary::tcp_len:long | udp_summary::group:int | udp_summary::udp_len:long | bw_summary::group:int | bw_summary::bw:long |
| | 1322644980 | 2080 | 1322644980 | 81 | 1322644980 | 2305 |

However the summary projection generates the output the way we want it and we store that in a CSV format using PigStorage(‘,’).

STORE summary INTO '$output/binning' USING PigStorage(',');
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *