GDA Release, October 14, 2002.


1. LICENSE

  This software is provided subject to the following terms
  and conditions, which you should read carefully before
  using the software. Using this software indicates your
  acceptance of these terms and conditions.  If you do not
  agree with these terms and conditions, do not use the
  software. 
  
  Copyright (c) 2002, Agere Systems
  All rights reserved. 
  
  Redistribution and use in source or binary forms, with or
  without modifications, are permitted provided that the
  following conditions are met: 
  
  o Redistributions of source code must retain the above
    copyright notice, this list of conditions and the
    following Disclaimer in comments in the code as well as
    in the documentation and/or other materials provided with
    the distribution.
  o Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the following
    Disclaimer in the documentation and/or other materials
    provided with the distribution.
  o Neither the name of Agere Systems nor the names of the
    contributors may be used to endorse or promote products
    derived from this software without specific prior written
    permission. 
  
  Disclaimer 
  
  THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED
  WARRANTIES, INCLUDING, BUT NOT LIMITED TO, INFRINGEMENT AND THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
  PURPOSE ARE DISCLAIMED. ANY USE, MODIFICATION OR DISTRIBUTION OF
  THIS SOFTWARE IS SOLELY AT THE USER'S OWN RISK. IN NO EVENT SHALL
  AGERE SYSTEMS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, INCLUDING, BUT NOT LIMITED
  TO, CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
  OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 


2. DESCRIPTION

GDA is a tool for scalable analysis of C code.  Built into this distribution of
gda is an implementation of Andersen's aliasing analysis using sub-transitive
graphs, as well as a dependence analysis for tracking dependencies between
variables, fields etc.  In addition, this release contains separate source code
for demand-driven pointer analysis -- see Section 6 and ddpa/README.

Central to the design of gda is a database view of programs.  The first phase of
gda is a compilation process that parses source files and extracts assignments
and function calls/returns/definitions, and writes an object file that is
basically an indexed database structure of these basic program components.
Complex assignments are broken down into primitive ones by introducing temporary
variables.  The elements of these database files involve variables/fields and
(typically) at most one operation.  These databases can be used for a wide
variety of analyses, including essentially any analysis that reasons about the
flow of data from one variable/field to another e.g. aliasing analysis,
dependence analysis, constant propagation analysis etc.  The format of the
databases follows the design of elf executable files, and consists of header
section followed by a arbitrary number of data sections.  One advantage of this
format is that it supports the addition of new sections with new program
information in a way that does not break previous generations of gda-based
tools.

The intent of the database infrastructure is to maximize infrastructure reuse
during analysis implementation.  Client analyses interact with the databases
using a generic interface.  Critical to performance is the indexing scheme
used.  This scheme allows most analyses to locate necessary information in one
lookup (e.g. variables/fields are reference by the offset of their definition block
within the database).  It also supports load-on-demand schemes to reduce memory
requirements of client analyses i.e. clients do not need to load the entire
database into memory.

3. BUILDING GDA

These build notes assume a Linux environment and have been tested on Redhat
Linux 7.0.  gda was originally developed under MIPS/Irix and then ported to
Linux.  The source should still compile under MIPS/Irix (but might require a
tweak of compilation flags).

gda consists of three components.

gdac: this is the component that parses C code and generates the database files;
      it is written in sml and uses the ckit C parser (see ml/)
gdal: this component links database files together (see c/)
gda:  this tool performs Anderson's analysis and a dependence analysis that
      track dependencies (see c/)

To build gdac, you will need smlnj 110 or later -- see
http://cm.bell-labs.com/cm/cs/what/smlnj/.

  1. cd to ml/
  2. start sml
  3. at the sml prompt, type: CM.make()
  4. the resulting build will generate a file gdac.MACHINE-TYPE, where
     MACHINE-TYPE is your machine/os configuration
  5. if you are running Linux then skip to step 6.
     if you are NOT running Linux then
       edit gdac and replace ./gdac.x86-linux by ./gdac.MACHINE-TYPE, where
       MACHINE-TYPE is your machine/os configuration
  6. run ./gdac and you should get:
       Usage: gdac -f configfile or gdac file
       Use -help flag for more options

To build gdal and gda
 
  1. cd to c/
  2. type make
  3. run ./gdal and you should get:
       Usage: "gdal [-d] [-q] [-o targetfile] [-f configfile] [sourcefile1] [sourcefile2] ..."
       Where: targetfile is the name of the output file ("gdal.gda" by default),
              sourcefile1 sourcefile2 ... are input files,
              configfile is a file containing the names of other input files.
  4. run ./gda and you should get:
       usage: gda - "file.gda" for interactive mode
          or: gda "file.gda" ["targets" ["nontargets"]] in batch mode
       
       flags in batch mode:
        -q:  quiet mode
        -qq: quiet mode
        -nr: no results

The directory test/ contains a trivial test of gda.
To use this test, run:
% cd ml
% ./gdac ../test/test.c 
% cd ../test
% ../c/gda - test.gda
? lvals
? seed x
? go
? deps
? quit
%

For help and information about options:

gdac: use the "-help option"
gdal: running gdal without arguments gives usage information.
gda: running "gda -" puts gda into interactive mode; typing "help"
     at the interactive prompt lists the commands available.


A full transcript of this test is given in test/test.transcript

Documentation on the various options for the tools is given in Appendices I-IV.


4. PRE-COMPILED BINARIES

The directory pre-compiled-bin/ contains pre-compiled versions of gda, gdal and
gdac.  Building from scratch is strongly recommended; these binaries are
provided if that is not possible.  These were built using 
Standard ML of New Jersey, Version 110.0.7, September 28, 2000
on Redhat Linux 7.0.  Copy them across using:

 % cp pre-compiled-bin/gdac.x86-linux ml
 % cp pre-compiled-bin/gda c
 % cp pre-compiled-bin/gdal c



5. CODE OVERVIEW

ml/
 This directory contains the ml sources for gdac compiler.  This is built on top
 of ckit (in ml/ckit/). 

 drive.sml: top level function; parses options; calls ckit on individual sources.
            Note: the exportFn at the end of drive.sml,
                  which generates the gdac "binary" (see documentation for smlnj).

 read-config.sml: reads a configuration file
            Note: only readConfig and getFileNames are used at present; the rest
                  of the code is no longer used by gda.

 GDAFile.sml: writes gda database file
 
 track.sml: walks over ckit Ast trees and builds list of primitive dependencies.

 output.sml: library for writing binary files.

 types.sml: some basic types used througout gdac

c/
 This contains code for both gda and gdal (the linker).
 Note that the linker just collects the primitive dependencies from 
 a collection of database files and re-indexes them.  The output of gdal is
 is another database file.  The current gdal is functional, but not fast -- 
 it is the slowest part of the system at present, and no effort has been
 put into speeding it up (any volunteers?).

 hashtable.c list.c bigarray.c, storage.c: generic code for various kinds of storage needs.

 mapfile.c: generic code for mapping a file into memory.

 fatal.c:  error handling code.

 main.c: top-level code for gda.

 set.c: implementation of points-to sets using the sub-stransitive graph
        algorithm.

 aliasing.c: points-to analysis code.
 
 analyze.c: dependence analysis code.

 targets.c: code for initializing and managing targets (for the dependence analysis).

 database.c: code for reading the gda database files.

 print.c: code for printing analysis results e.g. points-to sets, dependencies, ...

 link.c: top-level code for linker.

 readGda.c writeGda.c: access to GDA files within linker.

 registry.c: datastructure for linking together references across different
             database files.

 file.c: some low-level helpers for writing database files.

ddpa/
 Demand-driven pointer analysis code -- section Section 6 and ddpa/README.

6. DEMAND-DRIVEN POINTER ANALYSIS

The directory ddpa/ contains complete source code for an implementation of
demand-driven pointer analysis based on GDA.  See ddpa/README for more details.


7. RELATED PAPERS

Postscript for these papers appears in the papers/ directory.

"Analysis of Large Code Bases: The Compile-Link-Analyze Model", Nevin Heintze,
unpublished report, November 1999. [papers/cla.ps]

"Ultra-fast Aliasing Analysis using CLA: A Million Lines of C Code in a Second"
Nevin Heintze and Olivier Tardieu,  PLDI'01. [papers/pldi01-1m.ps]

"Demand-Driven Pointer Analysis" Nevin Heintze and Olivier Tardieu,  PLDI'01.
[papers/pldi01-ddpa.ps]


APPENDIX I: gda interactive mode options

* help
 -- print list of commands
* quit 
 -- end gda interactive session
* show [strings] [types] [locations] [anonymous]
* hide [strings] [types] [locations] [anonymous]
 -- These two commands control how dependency chains are printed out.
    "show" explicitly requests that the information be printed and "hide"
    explicitly requests that the information be hidden.
    "strings" determines whether to print names of variables/fields, 
    "types" determines whether to print types of variables/fields,
    "locations" determines whether to print locations of variables/fields,
    and "anonymous" determines whether to anonyous variables (e.g. temporary
    variables introduced by gdac).  The default is to print strings, types and
    loctions, but not to print omit anonymous variables.

seed ["pattern"] ...
 -- Sets the target(s) for dependence analysis.  As a simple example, "seed w"
    indicates that the dependence analysis should find all variables/fields
    dependent on variables/fields with name "x" (i.e. "x" is the target
    variable).  A list of patterns can be given.  Each pattern has the format:

       name[/type][<location>]
   
    where name is a variable or field name, type is a C type, and location is a
    file location (either of form file:n or file:n-m e.g. "eg.c:34" or "eg.c"34-40").
    All but the name component are optional. 
    
skip ["pattern"] ...
 -- Specifies variables/fields to be ignored.  Multiple patterns can be
    specified.  As for the seed command, each pattern has the format
     name[/type][<location>]
 
load "file"
 -- Load seed patterns from a file

loadn "file"
 -- Load skip patterns from a file.

reset
 -- Reset the state of the system.

limit "number"
 -- For commands that potentially generate many lines of output (e.g. allvars,
    depvars, lvals, etc), a limit is placed on the number of items printed.
    More items can be printed using the "next" command.
    The default number of items printed is 20; 
    the limit command changes this default limit count.

forward
 -- Perform forward dependency analysis.

backward
 -- Perform backward dependency analysis.

unification
 -- Perform simultaneous backward and forward dependence analysis.

go
 -- Run dependence analysis (aliasing analysis is run when the
    interactive loop is started).

filter ["pattern"]
 -- Only show dependencies that match this filter pattern

next
 -- Print next block of items.

allvars
 -- Print all variables/fields (subject to filter and skip settings)

depvars
 -- Print dependent variables/fields.

targets
 -- Print target variables/fields.

nontargets
 -- Print dependent variables/fields that are not targets.

lvals
 -- Print variables/fields that have non-empty points-to sets.

chains
 -- Print the chains of dependencies (i.e. for each dependent variable, print
    the chain of dependencies that shows why this variable is dependent on the target).

deps
 -- Only prints out the first item in the dependence chain (i.e. the item that
    this variable/field is directly dependent on)

revdeps
 -- Print out the reverse of the command "deps" (i.e. view the output of deps as
    a set of edges x -> y, then revdeps prints out the edges y -> x).


APPENDIX II: gda batch mode options

Invoking gda with an empty command line gives the following usage information:

  usage: gda - "file.gda" for interactive mode
     or: gda "file.gda" ["targets" ["nontargets"]] in batch mode

  flags in batch mode:
   -q:  quiet mode
   -qq: quiet mode
   -nr: no results

The second usage line is for batch mode.  In this mode, a gda file is given and
optionally target and non-target files are given.  If no target file is given,
the points-to sets are printed.  The flags -q and -qq are identical; both
supress statistics information such as the size of the points-to sets, targets
and dependencies.  The flag -nr suppresses points-to set and dependency output.

APPENDIX III: gdac options

Invoking gdac -help gives the following usage information:

Usage: gdac -f configfile or gdac file
Additional Flags:
   -malloc mallocName: treats function mallocName as a malloc function;
                       each location of a call to mallocName generates a new lval.

   -noFields:          field of structs and unions are ignored
                       e.g. e.f1, e.f2, ... are treated as e.

   -commonTypeFields:  common initial sequences of fields in structs are identified
                       (common initial sequence == same sequence of types).

   -commonSizeFields:  common initial sequences of fields in structs are identified
                       (common initial sequence == same sequence of types sizes).

   -alwaysTrackLvals:  track lvals through all arithmetic operations.

   -outputDir dir:     creates .gda files in directory dir.

   -debug:             dumps dependencies on stdout.

   -pid:               uses pids instead of names.

   -noStrings:         ignores strings.


gdac can be run on an individual source file, or it can be run using the "-f"
option with a configfile that lists the set of sources to be compiled.  
There are a number of other options that control how the compilation proceeds,
as listed above.  The -malloc flag controls which functions are treated as
memory allocation functions -- each occurrence of such a function generates a
new lval that will be tracked by the points-to analysis.  The -noFields flag
ignores fields of structs and unions.

The -commonTypeFields deals with the issue that structs can be safely inter-cast
under certain circumstances.  With this flag set, the gdac compiler generates
the same variable for two fields f1 and f2 in two different structs S1 and S2 if
f1 and f2 have the same type and the sequence of types in S1 preceeding f1 is
the same as the sequence of sizes of types in S2 preceeding f2.  The
-commonSizeFields flag is similar to -commonTypeFields but implements an even
weaker condition -- two fields f1 and f2 in two different structs S1 and S2 are
considered interchangeable if f1 and f2 have the same size and the sequence of
sizes of types in S1 preceeding f1 is the same as the the sequence of sizes of
types in S2 preceeding f2.

The -alwasyTrackLvals tracks lvals through all arithmetic operations -- gdac's
default behaviour is to ignore operations such as *, ||, && etc. that in
practice do not yield pointer values.

The -outputDir flag controls where the .gda files are generated.
The -debug flag generates diagnostic information about gdac compilation.
The -pid flag uses pids instead of names to identify variables/fields.
A pid is a unique integer used to identify a variable; pids are
generated within ckit. 
The "-noStrings" flag controls whether strings are treated as pointers and
should be tracked during points-to analysis (see the StringConst case in
ml/track.sml).

APPENDIX IV: gdal options

Invoking gdal with an empty command line gives the following usage information:

Usage: "gdal [-d] [-q] [-o targetfile] [-f configfile] [sourcefile1] [sourcefile2] ..."
Where: targetfile is the name of the output file ("gdal.gda" by default),
       sourcefile1 sourcefile2 ... are input files,
       configfile is a file containing the names of other input files.

gdal can be used in two modes.  It can be called with a series of database files
(sourcefile1, ...), and it will link the database files together to produce one
database file.  By default the output database is gdal.gda; this can be changed
by the "-o" flag.  Alternatively, a configuration file that lists the input
database files can be specified by the "-f" flag.  The "-q" flag suppresses
statistics about the generated database file.  The "-d" flag generates
debug diagnostics for the generated database file.