Announcement

Collapse
No announcement yet.

Reading the BIC file: a programmer's perspective

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reading the BIC file: a programmer's perspective

    So, I have been working on a little tool to read in the Civilization III scenario files, and represent the data in them in some kind of nicely structured way. And that kind of has me stumped. While I am able to extract what I need, the structured part kind of bothers me. So, here is a question to all of you, who have dealt with that: how do you read the data from the BIC file and store it?

    There are three major approaches to reading the file I have identified:

    1. The straightforward approach: hard-code the strcuture of the file into your program, and read the file sequentially. In other words, your code would look like this:

    Read 4 bytes
    put that into the file identifier variable

    Read 4 bytes
    check that this says VER#

    Read 4 bytes
    Put that into the number of headers variable

    Read 4 bytes
    Put that into the header length variable

    Read 4 bytes
    Put that into the major version variable

    etc, etc, ad infinitum

    This approach seems like a very big pain in the @$$, because you need to essentially put all of the structure of the file into this one big bolb of code. Of course, you could break it up into pieces, and then from your main reading method say:

    ReadFileIdentiefier();
    ReadVersion();
    ReadBuildings();
    etc;

    but that still doesn't really eliminate the problem.

    2. Read the file and when a known header is encountered, read the section. So, your code would look like this:

    Read four bytes.
    If they are BLDG, read the building section.
    If they are ESPN, read the espionage section.
    etc...

    This does improve on the code a little bit, because now you are sort of independant of the order of sections, and if the developers decide to add a new section to the bic file, that won't break your code - it will just ignore the new section. However, that still does not solve all of the problems: you are still dependant on the hard-coded format of one entry. For instance, for a citizen entry, you would still have to read it the same way you would have read the whole file:

    Read four bytes
    that is the length of data

    read 32 bytes
    that is the name of the citizen

    read 32 bytes
    that is the pedia entry for the citizen

    etc.
    etc.

    (Note: the numbers may not be correct, as I am doing this from memory. It is not my intention to provide the BIC format document here, though, but rather to just illustrate the point.)

    3. Sort of a variation on #2: read through the whole file, searching for four consecutive capital letters followed by a number. These conditions are only satisfied by section identifiers, such as BLDG, CTZN, etc.. Store the positions of the headers in a vector. Then, go through the positions, identify what kind of section it is, and read the deata from there appropriatly. This approach still has the same problem as #2: you hard-code the format of a single entry of data for each type.

    Now, then the question is: why is it even so important to be concerned with hard-coding the format of the bic file? Well, the problem here is that the format apparently changes quite a bit. And especially if we want to read all the different versions of the BIC file, we need quite a bit of flexibility.

    Soooo, the way I am reading and storing the data is this:

    I have a separate class for each type of entry. Each class has the data member corresponding to the data format of the entry. So, for instance, a class for a citizen would look like this:

    Code:
    		public int Length; 
    		public int DefaultCitizen;
    		public string SingularName;
    		public string CivilopediaEntry;
    		public string PluralName;
    		public int Prerequisite;
    		public int LuxuryBonus;
    		public int ResearchBonus;
    		public int TaxBonus;
    Then, when I encounter a known section, I get all of the data members of the class through reflection, and read the data according to them. The problem here, however, is that the order in which I read the data is important, for obvious reasons. That is no big deal if the compiler does not perform any optimizations during compilation, because then all of the data members of the class are returned in the exact same order as they were declared in the source. If I compile with optimizations, however, the order of the data members is changed, and I can no longer rely on this system to support keeping track of the file format.

    So, that's that. Now it's your turn to share the experience.
    XBox Live: VovanSim
    xbox.com (login required)
    Halo 3 Service Record (I fail at FPS...)
    Spore page

  • #2
    I take a similar approach to your #2 in perl. Note that the hardcoding of the sections does not actually seem like that big a deal to me because they normally try to maintain forward compatibility. What I do, algorithmically is this:
    1. Read the first 4 bytes to see if it is BIC/BICX. If not, run it through the decompressor and try again. If still not, give up.
    2. Read the next 4 bytes. If those correspond to one of the 27 sections I think I can handle (e.g. VER# or BLDG but not ZZJY) I hand it off to a generic parser function telling that function what the section ID is. If they don't correspond, I spit out a warning but hand it off to the generic parser anyhow. Since the generic parser always gets the next data, I could probably just read it there, but this is legacy from my original code which was similar to option 1 and I'm not changing it if it still works
    3. The Generic Parser is now in control. It reads the 4 bytes for the number of things to process and then loops over them, reading the length of the entry and then calling a specific parser to handle the details of the entry. (e.g. if this is the BLDG section, a parseBLDG() function is called during each loop iteration storing the first 64 bytes as a 'description' string, the next 32 bytes as a 'name' string, etc.). The specific parsers return the number of bytes they actually processed and the Generic parser then error-checks this, spitting out warnings on mismatches and shoving any unused data into an 'excess' area of the data structure for that entry. The generic parser makes a valiant attempt at handling unknown sections by trying to simply store the entire data length as a hex string in a manner similar to the excess data section on known sections.
    4. The specific parsers are pretty well hardcoded. They assume the data is in a given format and parse it accordingly. They should probably check the given entry length to make sure they don't try to read more than is there, but it's currently not that robust. In a half-dozen or fewer cases where I know of major differences between version 4.01 and 11.18 I make checks against the VER# entry to see what version this is and handle it accordingly.

      Several assumptions are made here. I assume the VER# section has already been handled, I assume that all BICs are essentially the same as well as all BIXes, etc. These kinds of shortcuts can be taken because the userbase is small. 99% of the things I parse are going to be created by the editor and I know what to expect from it; most people who have programs that write BICS are going to try and keep the format as close to that produced by the editor (and posted here) as well to be sure that the game engine reads it correctly.


    As for data storage, perl allows me great flexibility and I use that to my advantage. The entire BIC file is stored in a single hash. The first level entries are one for each section type. The second level entries are a count of how many things of this type I have and an individual entry for each thing. The third level then are broken up by data type simply to make it easier for me to read the values out on a printed dump of the structure, the fourth level is essentially a variable name and the fifth level is the actual data; normally this is a scalar but in some situations this is a hash as well. So, if I want to know how many BLDG definitions there are, I check $BIC{'BLDG'}{'count'} and if I want to know what the movement cost of the first terrain type is I check $BIC{'TERR'}{0}{'value'}{'movement_cost'}.

    I think option two makes far more sense than option 3 because you are less likely to run into problems. The only way the format will change is if a game patch changes it. These are infrequent occurrances and odds are you will know about them and be able to adjust if necessary. Here's an admittedly contrived example to show the problem with this option. Say I add a Technocrat citizen and I just slap TECH into the name entries on it as a placeholder. You now run the possibility of mistakenly thinking that's the start of the TECH section. It seems to me, the odds of that happening are far more likely than the odds of the section definitions changing so completely that well-written option 2 code breaks.

    If they really mess with the BIC format in the future (like for example when they removed some initial unknown values in CULT) I will simply adjust to it and/or cut off support for the old versions. I don't think you really need all that much flexibility; it's not like you're getting a patch a week with format changes. I am quite willing to only support vanilla version 1.29f and tailor my PTW support to the latest patch as well, handling small inconsistencies with previous versions as they show up and are reported. Granted, my scripts will have a far smaller user base once they are released than something like Gramphos' Multitool, but I think the same general approach would apply.

    Comment


    • #3
      In C3MT the approch is like your #2, with the addition that it ensures that it always reads the the length of a section. That way it handles many versions by just stop reading when the length is over, or skipping bytes at the end when the length isn't reached, but all known data is loaded.

      However, the system I use to load the data isn't optimized for the filestructure, and can't handle a large expansion of the format. Therefore I'm planning to recreate the BIC/BIX handling part of my tool. But I'm currently following the concept: "If it works, don't change it." However, I think I'll have to change it by the time Conquest appear, so I should probably start by building up a new system before it arrives.

      One of the problems that I deal with is that the file format isn't very VB friendly...
      Creator of the Civ3MultiTool

      Comment


      • #4
        Thanks for the responses, pdescobar and Gramphos.

        So, it looks like there isn't any particularly easy way to have the flexibilty in the code, yet, it also seems there isn't THAT much need for it. Well, I guess then I can leave my code for loading the BIC alone for now.
        XBox Live: VovanSim
        xbox.com (login required)
        Halo 3 Service Record (I fail at FPS...)
        Spore page

        Comment


        • #5
          Yes. I don't think we will see any big changes to the BIC format before Conquest gets out. But by then I'll probably need to have my code changed to a little more flexible way of loading, or the time to make C3MT compatible with Conquest might be extra long, or I'll have to do some really bad workarounds to be able to keep building on my current system, which got close to it's limits by the release of PTW.
          Creator of the Civ3MultiTool

          Comment


          • #6
            In my "tinker" program, I am also using perl.

            I took the approach of using perl's OO programming capabilities and created a Base class that handles all of the basic file I/O.

            What I do is create an array of format strings. These format strings are the same strings used by pack and unpack. So you'll end up with an array something like ("a4", "V", "V","C","v")... If you know perl, these will make sence. ;-) Then using this array, I read in the appropriate number of bytes from the file, and then unpack it into another array at the same position.

            So in the derived classes, this format array is defined as well as any "access" functions for particular data fields.

            I currently only have the terrain section completed, as I was tinkering with the concept of a new map generation algorithm.


            I would imagine, with C++ you could do something similiar.

            There are really only a small number of data types.

            fixed length text
            4 byte integer (little endian, intel/vax order)
            2 byte integer (little endian)
            1 byte int/char (I treat them as unsigned for now)

            I think that was it.

            Comment


            • #7
              BlueWlvrn: I am pleasantly surprised to see that someone else is using perl to play around with the BICs It sounds like you've got a "cleaner" method for dealing with the data than I use though as it'll be a real pain for me to add writing the BIC data back out (9,000,000 pack statements would be added to undo my 9,000,000 unpacks with my current method. )

              To go a bit off-topic, how are you handling resource allocation on your map generator? The standard game allocator confuses me greatly

              Comment


              • #8
                Originally posted by pdescobar

                To go a bit off-topic, how are you handling resource allocation on your map generator? The standard game allocator confuses me greatly
                I haven't actually gotten that far. I was mostly tinkering with the landscape first. (And coming up w/a clean way to read and write the BIC)

                I haven't done much with it in a week or so.

                Comment

                Working...
                X