I was having problems with binaries I was downloading with a particular application the other day. As part of the debugging process at one point, I was taking packet captures with Wireshark inside the client LAN, at the client router’s WAN, and tcpdump from the server. I was then reassembling the file from the stream in each packet capture and comparing them to see where the corruption was occurring relative to the copy that resided on the server.
To accomplish this, I was going to the HTTP GET message packet in Wireshark. Then I would right-click on the packet and select Follow Stream. Next I would select only the direction of traffic from the server to the client (since this was a download). Then I would make sure RAW was selected and save the file. Finally I would open the file up in a hex editor, remove the HTTP header that winds up prepended to the file, and save it. Annnnd then the file was corrupted.
Doing a binary diff of a valid copy of the file with the reconstructed file using 010 Editor I could see that the only differences were several small sections of the file with values like these spaced throughout the file:
Hex: 0D 0A 31 30 30 30 0D 0A
and one of these at the end of the file:
Hex: 0D 0A 00 00 0D 0A
I confirmed that each of the packet captures at the various points along the way all had the same result. Where the heck was this random data getting injected into my stream and better still, why?!
The first clue that it wasn’t truly random data was the \r\n values. Carriage Return – Line Feed (CRLF) is a staple demarcation value in the HTTP protocol. My second clue was that the values were typically 1000 and 0. Although respresented with ASCII codes in the file, if you interpret them as hex they are 4096 and 0. When doing buffered I/O a 4K buffer is very common as is getting a 0 back from a read function when you reach EOF.
As it turns out, the particular behavior I was seeing was a feature of the HTTP/1.1 Protocol called Chunked Transfer Encoding. The wikipedia article does a great job explaining it, but basically it allows for content to be sent prior to knowing the exact size of that content. It does this by prepending the size to the each chunk:
The size of each chunk is sent right before the chunk itself so that a client can tell when it has finished receiving data for that chunk. The data transfer is terminated by a final chunk of length zero.
Ah-ha! So my naïve manual file reconstruction from the Wireshark packet capture of the HTTP download was flawed. Or was it? I checked the file on disk and sure enough it too had these extra data values present.
Once again, Wikipedia to the rescue (emphasis mine):
For version 1.1 of the HTTP protocol, the chunked transfer mechanism is considered to be always acceptable, even if not listed in the TE request header field, and when used with other transfer mechanisms, should always be applied last to the transferred data and never more than one time
The server was utilizing chunked transfer encoding but the application I was using wasn’t fully HTTP/1.1 compliant and was thus doing a naïve reconstruction just like me! So, if you find yourself doing file reconstruction from packet captures of HTTP downloads, make sure you take chunked transfer encoding into account.