Where Did Part Of My File Go?

I’ve primarily been a Windows software developer for the last 12 years. During that time, I’ve written lots of web sites, desktop applications, and server applications. I recently changed jobs and am now working mainly on an IBM mainframe and Unix. I occasionally get to do some Windows applications, but they are few and far between. Lucky for me, today was one of the days when I got to work on a Windows application. It was a simple job moving a file from a server, doing some minor processing and then FTPing it to a Unix server for final processing.

In my attempt to be a good developer, I spent a significant portion of my time on the application testing and placing try catches to make the application as safe as possible from any issues. As part of the safety I decided to validate that each file I FTPed to the Unix server did indeed get there. After I finished the FTP, I did a directory listing “ls -l” equivalent to “dir” in DOS. The listing came back with all the files in the directory and their size.

I wrote a loop to compare the file sizes between the Windows files on the local machine, and the Unix files transferred. To my surprise, none of the files I transferred were the same size! I was perplexed by this. After downloading the plain text .csv file from the Unix server to a different folder, I checked the properties, and it was indeed smaller that the original file uploaded. I tried to open it figuring that it would fail, but to my surprise, it opened correctly. I looked at a sampling of the lines in the file and they had the same data.

I was quite frustrated at this and Googled around looking for answers. There were many posts about encoding of files being different. Research in this realm brought me no closer to the answer. The files both appeared to be encoded ANSI. I decided to resort to the lowest level of debugging I could think of. I downloaded a hex editor to look inside the files and see what they had for data. There in the hex, I compared file next to file and found the issue that has been eluding me for a while.

Windows terminates its lines with CHR(13) CHR(10) [Carriage Return, Line Feed]
Unix terminates its lines with CHR(10) only [Line Feed].

I was loosing one byte of size per new line in the file I uploaded. It seems that Unix converted my file upon FTP upload. When downloaded and tested on my Windows machine, the smaller file was able to be processed by Microsoft Excel opening up the .csv file. This explains why the file size was smaller, but all of the data was still assessable.

Windows Hex Output. Note the highlighted square “OD”. That is hex 0x0D chr(13) for carriage return. It is followed by 0x0A chr(10) for new line.

Note the Unix file below in the same position only has the 0x0A chr(10) for newline, but the carriage return has been stripped out. This accounts for the file size difference.

In order to continue my quest for few errors and validation of my data, I wrote a method I’ll share with you to get the Unix file size.

/// Get the size of a file on a Unix system. This entails counting all of the
///CHR(13) charcters and subtracting that from the overall
/// size as Unix doesn’t use them in a newline. Stupid Unix!
/// </summary>
/// <returns></returns>
private static long GetUnixFileSize(string fileName)
{
long windowsFileSize = new FileInfo(fileName).Length;
long unixFileSize = windowsFileSize;

    using (StreamReader sr = new StreamReader(fileName, true))
    {
        char[] c = new char[1];
        const char char13 = (char)13;
        while (sr.Read(c, 0, 1) == 1)
        {
            //If we find a Carriage Return, decrement the count.
            if (c[0] == char13)
                unixFileSize–;
        }//while (sr.Read(c, 1, 1))
    }//using (StreamReader sr = new StreamReader(fileName, true))

return unixFileSize;
}

Hope this saves some of you time!

Hogan

CodeProject

Related