GDA Release, October 14, 2002. 1. LICENSE This software is provided subject to the following terms and conditions, which you should read carefully before using the software. Using this software indicates your acceptance of these terms and conditions. If you do not agree with these terms and conditions, do not use the software. Copyright (c) 2002, Agere Systems All rights reserved. Redistribution and use in source or binary forms, with or without modifications, are permitted provided that the following conditions are met: o Redistributions of source code must retain the above copyright notice, this list of conditions and the following Disclaimer in comments in the code as well as in the documentation and/or other materials provided with the distribution. o Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following Disclaimer in the documentation and/or other materials provided with the distribution. o Neither the name of Agere Systems nor the names of the contributors may be used to endorse or promote products derived from this software without specific prior written permission. Disclaimer THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, INFRINGEMENT AND THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. ANY USE, MODIFICATION OR DISTRIBUTION OF THIS SOFTWARE IS SOLELY AT THE USER'S OWN RISK. IN NO EVENT SHALL AGERE SYSTEMS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, INCLUDING, BUT NOT LIMITED TO, CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 2. DESCRIPTION GDA is a tool for scalable analysis of C code. Built into this distribution of gda is an implementation of Andersen's aliasing analysis using sub-transitive graphs, as well as a dependence analysis for tracking dependencies between variables, fields etc. In addition, this release contains separate source code for demand-driven pointer analysis -- see Section 6 and ddpa/README. Central to the design of gda is a database view of programs. The first phase of gda is a compilation process that parses source files and extracts assignments and function calls/returns/definitions, and writes an object file that is basically an indexed database structure of these basic program components. Complex assignments are broken down into primitive ones by introducing temporary variables. The elements of these database files involve variables/fields and (typically) at most one operation. These databases can be used for a wide variety of analyses, including essentially any analysis that reasons about the flow of data from one variable/field to another e.g. aliasing analysis, dependence analysis, constant propagation analysis etc. The format of the databases follows the design of elf executable files, and consists of header section followed by a arbitrary number of data sections. One advantage of this format is that it supports the addition of new sections with new program information in a way that does not break previous generations of gda-based tools. The intent of the database infrastructure is to maximize infrastructure reuse during analysis implementation. Client analyses interact with the databases using a generic interface. Critical to performance is the indexing scheme used. This scheme allows most analyses to locate necessary information in one lookup (e.g. variables/fields are reference by the offset of their definition block within the database). It also supports load-on-demand schemes to reduce memory requirements of client analyses i.e. clients do not need to load the entire database into memory. 3. BUILDING GDA These build notes assume a Linux environment and have been tested on Redhat Linux 7.0. gda was originally developed under MIPS/Irix and then ported to Linux. The source should still compile under MIPS/Irix (but might require a tweak of compilation flags). gda consists of three components. gdac: this is the component that parses C code and generates the database files; it is written in sml and uses the ckit C parser (see ml/) gdal: this component links database files together (see c/) gda: this tool performs Anderson's analysis and a dependence analysis that track dependencies (see c/) To build gdac, you will need smlnj 110 or later -- see http://cm.bell-labs.com/cm/cs/what/smlnj/. 1. cd to ml/ 2. start sml 3. at the sml prompt, type: CM.make() 4. the resulting build will generate a file gdac.MACHINE-TYPE, where MACHINE-TYPE is your machine/os configuration 5. if you are running Linux then skip to step 6. if you are NOT running Linux then edit gdac and replace ./gdac.x86-linux by ./gdac.MACHINE-TYPE, where MACHINE-TYPE is your machine/os configuration 6. run ./gdac and you should get: Usage: gdac -f configfile or gdac file Use -help flag for more options To build gdal and gda 1. cd to c/ 2. type make 3. run ./gdal and you should get: Usage: "gdal [-d] [-q] [-o targetfile] [-f configfile] [sourcefile1] [sourcefile2] ..." Where: targetfile is the name of the output file ("gdal.gda" by default), sourcefile1 sourcefile2 ... are input files, configfile is a file containing the names of other input files. 4. run ./gda and you should get: usage: gda - "file.gda" for interactive mode or: gda "file.gda" ["targets" ["nontargets"]] in batch mode flags in batch mode: -q: quiet mode -qq: quiet mode -nr: no results The directory test/ contains a trivial test of gda. To use this test, run: % cd ml % ./gdac ../test/test.c % cd ../test % ../c/gda - test.gda ? lvals ? seed x ? go ? deps ? quit % For help and information about options: gdac: use the "-help option" gdal: running gdal without arguments gives usage information. gda: running "gda -" puts gda into interactive mode; typing "help" at the interactive prompt lists the commands available. A full transcript of this test is given in test/test.transcript Documentation on the various options for the tools is given in Appendices I-IV. 4. PRE-COMPILED BINARIES The directory pre-compiled-bin/ contains pre-compiled versions of gda, gdal and gdac. Building from scratch is strongly recommended; these binaries are provided if that is not possible. These were built using Standard ML of New Jersey, Version 110.0.7, September 28, 2000 on Redhat Linux 7.0. Copy them across using: % cp pre-compiled-bin/gdac.x86-linux ml % cp pre-compiled-bin/gda c % cp pre-compiled-bin/gdal c 5. CODE OVERVIEW ml/ This directory contains the ml sources for gdac compiler. This is built on top of ckit (in ml/ckit/). drive.sml: top level function; parses options; calls ckit on individual sources. Note: the exportFn at the end of drive.sml, which generates the gdac "binary" (see documentation for smlnj). read-config.sml: reads a configuration file Note: only readConfig and getFileNames are used at present; the rest of the code is no longer used by gda. GDAFile.sml: writes gda database file track.sml: walks over ckit Ast trees and builds list of primitive dependencies. output.sml: library for writing binary files. types.sml: some basic types used througout gdac c/ This contains code for both gda and gdal (the linker). Note that the linker just collects the primitive dependencies from a collection of database files and re-indexes them. The output of gdal is is another database file. The current gdal is functional, but not fast -- it is the slowest part of the system at present, and no effort has been put into speeding it up (any volunteers?). hashtable.c list.c bigarray.c, storage.c: generic code for various kinds of storage needs. mapfile.c: generic code for mapping a file into memory. fatal.c: error handling code. main.c: top-level code for gda. set.c: implementation of points-to sets using the sub-stransitive graph algorithm. aliasing.c: points-to analysis code. analyze.c: dependence analysis code. targets.c: code for initializing and managing targets (for the dependence analysis). database.c: code for reading the gda database files. print.c: code for printing analysis results e.g. points-to sets, dependencies, ... link.c: top-level code for linker. readGda.c writeGda.c: access to GDA files within linker. registry.c: datastructure for linking together references across different database files. file.c: some low-level helpers for writing database files. ddpa/ Demand-driven pointer analysis code -- section Section 6 and ddpa/README. 6. DEMAND-DRIVEN POINTER ANALYSIS The directory ddpa/ contains complete source code for an implementation of demand-driven pointer analysis based on GDA. See ddpa/README for more details. 7. RELATED PAPERS Postscript for these papers appears in the papers/ directory. "Analysis of Large Code Bases: The Compile-Link-Analyze Model", Nevin Heintze, unpublished report, November 1999. [papers/cla.ps] "Ultra-fast Aliasing Analysis using CLA: A Million Lines of C Code in a Second" Nevin Heintze and Olivier Tardieu, PLDI'01. [papers/pldi01-1m.ps] "Demand-Driven Pointer Analysis" Nevin Heintze and Olivier Tardieu, PLDI'01. [papers/pldi01-ddpa.ps] APPENDIX I: gda interactive mode options * help -- print list of commands * quit -- end gda interactive session * show [strings] [types] [locations] [anonymous] * hide [strings] [types] [locations] [anonymous] -- These two commands control how dependency chains are printed out. "show" explicitly requests that the information be printed and "hide" explicitly requests that the information be hidden. "strings" determines whether to print names of variables/fields, "types" determines whether to print types of variables/fields, "locations" determines whether to print locations of variables/fields, and "anonymous" determines whether to anonyous variables (e.g. temporary variables introduced by gdac). The default is to print strings, types and loctions, but not to print omit anonymous variables. seed ["pattern"] ... -- Sets the target(s) for dependence analysis. As a simple example, "seed w" indicates that the dependence analysis should find all variables/fields dependent on variables/fields with name "x" (i.e. "x" is the target variable). A list of patterns can be given. Each pattern has the format: name[/type][] where name is a variable or field name, type is a C type, and location is a file location (either of form file:n or file:n-m e.g. "eg.c:34" or "eg.c"34-40"). All but the name component are optional. skip ["pattern"] ... -- Specifies variables/fields to be ignored. Multiple patterns can be specified. As for the seed command, each pattern has the format name[/type][] load "file" -- Load seed patterns from a file loadn "file" -- Load skip patterns from a file. reset -- Reset the state of the system. limit "number" -- For commands that potentially generate many lines of output (e.g. allvars, depvars, lvals, etc), a limit is placed on the number of items printed. More items can be printed using the "next" command. The default number of items printed is 20; the limit command changes this default limit count. forward -- Perform forward dependency analysis. backward -- Perform backward dependency analysis. unification -- Perform simultaneous backward and forward dependence analysis. go -- Run dependence analysis (aliasing analysis is run when the interactive loop is started). filter ["pattern"] -- Only show dependencies that match this filter pattern next -- Print next block of items. allvars -- Print all variables/fields (subject to filter and skip settings) depvars -- Print dependent variables/fields. targets -- Print target variables/fields. nontargets -- Print dependent variables/fields that are not targets. lvals -- Print variables/fields that have non-empty points-to sets. chains -- Print the chains of dependencies (i.e. for each dependent variable, print the chain of dependencies that shows why this variable is dependent on the target). deps -- Only prints out the first item in the dependence chain (i.e. the item that this variable/field is directly dependent on) revdeps -- Print out the reverse of the command "deps" (i.e. view the output of deps as a set of edges x -> y, then revdeps prints out the edges y -> x). APPENDIX II: gda batch mode options Invoking gda with an empty command line gives the following usage information: usage: gda - "file.gda" for interactive mode or: gda "file.gda" ["targets" ["nontargets"]] in batch mode flags in batch mode: -q: quiet mode -qq: quiet mode -nr: no results The second usage line is for batch mode. In this mode, a gda file is given and optionally target and non-target files are given. If no target file is given, the points-to sets are printed. The flags -q and -qq are identical; both supress statistics information such as the size of the points-to sets, targets and dependencies. The flag -nr suppresses points-to set and dependency output. APPENDIX III: gdac options Invoking gdac -help gives the following usage information: Usage: gdac -f configfile or gdac file Additional Flags: -malloc mallocName: treats function mallocName as a malloc function; each location of a call to mallocName generates a new lval. -noFields: field of structs and unions are ignored e.g. e.f1, e.f2, ... are treated as e. -commonTypeFields: common initial sequences of fields in structs are identified (common initial sequence == same sequence of types). -commonSizeFields: common initial sequences of fields in structs are identified (common initial sequence == same sequence of types sizes). -alwaysTrackLvals: track lvals through all arithmetic operations. -outputDir dir: creates .gda files in directory dir. -debug: dumps dependencies on stdout. -pid: uses pids instead of names. -noStrings: ignores strings. gdac can be run on an individual source file, or it can be run using the "-f" option with a configfile that lists the set of sources to be compiled. There are a number of other options that control how the compilation proceeds, as listed above. The -malloc flag controls which functions are treated as memory allocation functions -- each occurrence of such a function generates a new lval that will be tracked by the points-to analysis. The -noFields flag ignores fields of structs and unions. The -commonTypeFields deals with the issue that structs can be safely inter-cast under certain circumstances. With this flag set, the gdac compiler generates the same variable for two fields f1 and f2 in two different structs S1 and S2 if f1 and f2 have the same type and the sequence of types in S1 preceeding f1 is the same as the sequence of sizes of types in S2 preceeding f2. The -commonSizeFields flag is similar to -commonTypeFields but implements an even weaker condition -- two fields f1 and f2 in two different structs S1 and S2 are considered interchangeable if f1 and f2 have the same size and the sequence of sizes of types in S1 preceeding f1 is the same as the the sequence of sizes of types in S2 preceeding f2. The -alwasyTrackLvals tracks lvals through all arithmetic operations -- gdac's default behaviour is to ignore operations such as *, ||, && etc. that in practice do not yield pointer values. The -outputDir flag controls where the .gda files are generated. The -debug flag generates diagnostic information about gdac compilation. The -pid flag uses pids instead of names to identify variables/fields. A pid is a unique integer used to identify a variable; pids are generated within ckit. The "-noStrings" flag controls whether strings are treated as pointers and should be tracked during points-to analysis (see the StringConst case in ml/track.sml). APPENDIX IV: gdal options Invoking gdal with an empty command line gives the following usage information: Usage: "gdal [-d] [-q] [-o targetfile] [-f configfile] [sourcefile1] [sourcefile2] ..." Where: targetfile is the name of the output file ("gdal.gda" by default), sourcefile1 sourcefile2 ... are input files, configfile is a file containing the names of other input files. gdal can be used in two modes. It can be called with a series of database files (sourcefile1, ...), and it will link the database files together to produce one database file. By default the output database is gdal.gda; this can be changed by the "-o" flag. Alternatively, a configuration file that lists the input database files can be specified by the "-f" flag. The "-q" flag suppresses statistics about the generated database file. The "-d" flag generates debug diagnostics for the generated database file.