|
Past Research Accomplishments
The Cornell Program Synthesizer
The Cornell Program Synthesizer [TR81],
which Tim Teitelbaum
and I started working on in 1978, was the first interactive, WYSISYG
(What You See Is What You Get) programming environment for a
strongly-typed, block-structured, imperative programming language. It
had a tightly integrated combination of facilities for programmers to
create, edit, execute, and debug their PL/C programs. In other words,
by 1979, we had created the first version of an interactive programming
environment similar to modern tools, such as
Visual Studio
and Eclipse.
At the time that it was created (i.e., 1978-79), the Cornell Program Synthesizer
was a radically new kind of programming tool. Along with the
Mentor
system of Donzeau-Gouge, Kahn, Huet, and Lang, it showed that
the kind of direct-manipulation programming tools that had been
pioneered in the SmallTalk
and InterLisp systems
could be created for a mainstream programming language (i.e., one that was
strongly-typed, block-structured, and had imperative programming constructs).
The Cornell Program Synthesizer was used successfully in introductory
programming classes by about 20,000 students over a five-year period
at Cornell, Rutgers, Princeton, UCLA, and several other universities.
Technology Transfer
The Cornell Program Synthesizer was distributed to researchers at
about 110 universities, research institutes, and industry
laboratories.
Through lectures that were given about the system and copies that were
distributed, it is possible to trace its direct influence on several
commercial products, including DEC's
Language Sensitive Editor,
Rational's Ada programming environment,
and Symantec's
LightspeedC and LightspeedPascal programming environments.
It is more difficult to connect the dots to the similar products that
hundreds of thousands or millions of people use on a daily basis today;
however, together with the initial paper about the system
[TR81],
the Cornell Program Synthesizer was very influential: as a model for
programming tools of the future, it helped to spark widespread
interest in the question of how to create better tools for developing,
analyzing, testing, and debugging programs.
[Back to the top]
The Synthesizer Generator
My Ph.D. dissertation [R84]
investigated the problem of automating the creation of programming
tools similar to the Cornell Program Synthesizer, but for languages
other than PL/I.
The goal of the research was to create an editor-generator program to
enable editors for different languages to be created automatically
from a language description. In the solution I proposed, language
descriptions are specified by giving
- a grammar for the language,
- sets of ``attributes'' (annotations) to be attached to nodes of the
grammar's derivation trees, and
- rules that describe how information from one node's attributes
affects the attributes of other nodes.
In essence, each editor can be thought of as a specialized
``spreadsheet'' for documents written in the given language. After an
editing operation changes the document's underlying tree, some of the
tree's attributes may no longer be consistent with one another;
attributes must be updated to reestablish consistency (similar to the
way values are updated in a spreadsheet program). A major result of
my Ph.D. research was an optimal algorithm that I devised to solve the
attribute-updating problem
[R82,
RTD83,
R84
].
As part of my Ph.D. research, I developed a prototype implementation
of an editor-generating system that made use of the ideas described
above. I spent 1982-85 as a post-doc at Cornell (in residence during
1982-83 at INRIA, the national institute for computer science in
France) collaborating with my advisor, Tim Teitelbaum,
on the design and implementation of the Synthesizer Generator,
a full-scale version of such a system
[RT84,
RT87,
RT88a
].
The Synthesizer Generator creates a language-specific editor from an
input specification that defines a language's syntax, display format,
transformation rules for restructuring ``documents'' written in the
language, and attribution rules that codify context-sensitive
relationships between document elements. From this specification, the
Synthesizer Generator creates a specialized editor for manipulating
documents according to these rules. The Synthesizer Generator has
proven to be of great utility in a wide range of applications,
including program editors, text-preparation systems, and tools that
verify the correctness of proofs (for several different kinds of
mathematical and programming logics).
Technology Transfer
Beginning in August 1985, the Synthesizer Generator was made available
to universities and industrial research organizations.
Since distribution began it has been licensed
to about 330 sites in 23 countries. Two books describing the system,
which Teitelbaum and I co-authored, were published by Springer-Verlag
in 1988 [RT88a,
RT88b].
Also in 1988, Teitelbaum and I founded a company
(GrammaTech, Inc.) to
commercialize the
Synthesizer Generator technology.
Grammatech subsequently created two products for the Ada programming language
based on the Synthesizer Generator,
Ada-ASSURED
and Ada-Utilities,
and, since the mid-90's, these have been used in the development of almost every major
Department of Defense system written in Ada.
[Back to the top]
Incremental Computing
The attribute-updating algorithm that I developed as part of my
Ph.D. research is an example of an
``incremental updating algorithm''---an algorithm that makes use
of the solution to one problem instance to
find the solution to a nearby problem instance.
This kind of problem has remained one of my interests, and is
something that I have returned to occasionally over the years
in a variety of contexts, including
interactive program-development environments
[R82,
RTD83,
R84,
RMT86,
R88],
graph algorithms
[RR94a,
RR94b,
RR96a,
RR96b],
and the construction of abstract transformers for program-analysis
problems
[RSL03].
Incremental computing was the subject of G. Ramalingam's
Ph.D. research
[R96],
which was carried out under my supervision.
Technology Transfer
The incremental shortest-hyperpath algorithm from
[RR96b]
has been used by AT&T; to improve the performance of Open Shortest Path
First (OSPF), the most commonly used intra-domain Internet routing
protocol.
The incremental shortest-hyperpath algorithm is used as a subroutine
in a heuristic search algorithm for optimizing OSPF weights (for a
given set of projected demands) and thereby reduce congestion
[Fortz and Thorup 04]
The speed-up obtained by using our incremental algorithm was reported to be a factor of
15
to
20.
In 2004, Mikkel Thorup told me that AT&T; was using their results to adjust
capacities in the AT&T; network when they had to schedule link downtimes
for maintenance. Apparently this happens quite frequently: he indicated
that on the order of a dozen links had to be serviced each day.
[Back to the top]
Program Slicing
Starting in 1986, I carried out a great deal of work exploring how
program slicing
(an operation originally proposed in 1979 by the late Mark Weiser)
can serve as the basis for semantics-based program-manipulation operations.
The slice of a program with respect to a set of program elements
S is
a projection of the program that includes only program elements that
might affect (either directly or transitively) the values of the
variables used at members of S. Slicing allows one to find
semantically meaningful decompositions of programs, where the
decompositions consist of elements that are not textually contiguous.
Program slicing is a fundamental operation that can aid in solving
many software-engineering problems
[HR92].
For instance, it has applications in program understanding, maintenance, debugging,
testing,
differencing [HR91],
specialization [RT96],
reuse, and merging of program variants
[HPR89].
The projects that I carried out with my students and co-workers were aimed at
(Click here
for the home page of the Wisconsin Program-Slicing Project.)
The conference paper on interprocedural slicing that
Susan Horwitz,
David Binkley,
and I wrote [HRB88]
was selected for inclusion in a special SIGPLAN collection of the
50 most influential papers from the SIGPLAN Conference on Programming
Language Design and Implementation from 1979 to 1999:
-
Horwitz, S., Reps, T., and Binkley, D.,
Interprocedural slicing using dependence graphs.
20 Years of the ACM SIGPLAN Conference on Programming Language Design
and Implementation (1979 - 1999): A Selection, K.S. McKinley, ed.,
ACM SIGPLAN Notices 39, 4 (April 2004), 232-243.
A journal version of
[HRB88]
appeared as
[HRB90].
(Of all my papers, this is the most highly-cited;
e.g., see Google Scholar
and Citeseer.)
A retrospective on [HRB88]
was published as
-
Horwitz, S., Reps, T., and Binkley, D.,
Retrospective: Interprocedural slicing using dependence graphs.
20 Years of the ACM SIGPLAN Conference on Programming Language Design
and Implementation (1979 - 1999): A Selection, K.S. McKinley, ed.,
ACM SIGPLAN Notices 39, 4 (April 2004), 229-231.
[retrospective (PS);
retrospective (PDF)]
In more recent work, my students, colleagues, and I have developed even
better methods for program analysis and slicing based on a formalism called
weighted pushdown systems.
Technology Transfer
In collaboration with Genevieve Rosay and Susan Horwitz,
I built a system, called the
Wisconsin Program-Slicing Tool,
that implemented slicing of ANSI C programs.
Although the Wisconsin Program-Slicing Tool
was distributed from 1996-2000 for not-for-profit
research purposes under license from the University of Wisconsin,
it is no longer being distributed.
However, there is a commercial tool available, named
CodeSurfer,
that is derived from the Wisconsin implementation and is available from
GrammaTech, Inc.,
which licensed the technology from the Wisconsin Alumni Research Foundation (a
not-for-profit foundation that uses royalty income from the
technologies it licenses to support research at the University of
Wisconsin).
GrammaTech has improved on the Wisconsin implementation considerably
(both in terms of speed and space efficiency), and extended it to
support C++.
GrammaTech also offers
CodeSurfer to academic researchers
on very favorable terms.
In 2007, Tavis Ormandy of Gentoo Linux said the following about CodeSurfer:
``Our team is responsible for investigating and verifying security problems
in any of the 10,000+ open source programs we distribute. Codesurfer has
made a dramatic impact on the time it takes us to isolate and understand
security flaws, allowing us to traverse complex and unfamiliar codebases
easily and trace problems to their source.''
A group at SAIC/NASA had this to say about CodeSurfer:
``Our group analyzes many mission-critical software projects to
reduce defects and ensure that the software meets requirements. We
conducted a formal study to see if CodeSufer could improve our
software inspections. We found that CodeSurfer reduced inspection
time by an average of 30%. In addition, when using CodeSurfer the
number of defects found more than doubled.''
Slicing was also used in many commercial tools designed to address the
Year 2000 problem, including IBM's Visual Age for COBOL, where slicing
was used to support application understanding, program conversion,
and program structuring, as well as for the Year 2000 and other
code-renovation issues.
Since 2000, slicing has continued to live on in tools for software renovation (e.g., to
modify financial software to account for conversion to the Euro).
(A list of more than 20 such tools can be found
here;
another list of slicing tools is available
here.)
[Back to the top]
Interprocedural Dataflow Analysis and Context-Free-Language Reachability
Work that I carried out with Susan Horwitz, Mooly Sagiv, and David
Melski established some interesting connections between the work on
interprocedural program slicing and a number of other program-analysis
problems, such as classical interprocedural dataflow analysis.
In particular, we showed that a large class of program-analysis
problems can be solved by transforming them into instances of a
special kind of
graph-reachability problem,
called context-free-language reachability (CFL-reachability)
[RHS95,
HRS95,
R95b,
R96,
MR98,
R98,
R00
].
The notion of CFL-reachability was originated by
Mihalis Yannakakis,
who developed it in the context of
graph-theoretic methods in database theory.
Context-free-language reachability problems can be solved in
polynomial time by algorithms similar to the ones we originally developed
for interprocedural slicing [HRB90,
RHSR94].
David Melski and I studied the relationship [MR98]
between CFL-reachability and the program-analysis formalism of set constraints, which was
introduced by John Reynolds
and subsequently studied by
Alex Aiken and
Nevin Heintze.
The PowerPoint presentation prepared for the tutorial that I gave on
this material at PLDI '00 can be found
here.
Technology Transfer
In addition to the CFL-reachability-based slicing algorithms that are
used in
CodeSurfer,
the work on program analysis via CFL-reachability had direct impact on the
well-known SLAM tool
from Microsoft.
In particular, as described in
[BR01],
one of the key algorithms used in SLAM is based on the interprocedural dataflow-analysis
algorithm given in [RHS95].
The SLAM work even got on Bill Gates's radar screen.
For instance, in a keynote address at the Windows
Hardware Engineering Conference in 2002, Gates
touted the work on SLAM as follows:
``... software verification ... has been the Holy Grail of Computer
Science for many decades, but now in some very key areas, for
example, driver verification, we're building tools that can do
actual proof about the software and how it works in order to
guarantee the reliability.''
In more recent work, my students, colleagues, and I have developed even
better methods for interprocedural dataflow analysis based on a formalism called
weighted pushdown systems.
[Back to the top]
Techniques for Software Renovation
Mike Siff and I investigated techniques for converting C programs
to C++ programs that take advantage of the improved features
of C++, such as templates and classes.
For instance, in
[SR96,
S98]
we considered the problem of software generalization: Given a program
component C, create a parameterized program component
C' such that C' is usable in a wider variety of
syntactic contexts than C. Furthermore, C' should
be a semantically meaningful generalization of C; namely,
there must exist an instantiation of C' that is equivalent in
functionality to C.
We gave an algorithm that generalizes C functions via type inference.
The original functions operate on specific data
types; the result of generalization is a collection of C++ function
templates that operate on parameterized types. This version of the
generalization problem is useful in the context of converting existing
C programs to C++.
We also described a general technique for identifying modules in legacy
code [SR98,
S98].
The method is based on concept analysis---a branch of lattice
theory that can be used to identify similarities among a set of
objects based on their attributes. We showed how concept analysis can
identify potential modules using both ``positive'' and ``negative''
information, and presented an algorithmic framework to construct a
lattice of concepts from a program, where each concept represents a
potential module.
[Back to the top]
The Use of Program Profiling for Software Maintenance
In 1996, I was asked by the Defense Advanced Research Projects Agency
(DARPA) to plan a project aimed at reducing the impact of the Year
2000 problem on the Department of Defense. DARPA was particularly
interested in whether there were ``any techniques in the research
community that could be applied to the Y2K problem and have impact
beyond present commercial Y2K products and services.'' I spent a
semester as a kind of talent scout for DARPA to see what ideas might
be worth pursuing. The most exciting of the ideas that turned up
concerned a method for using path profiling as a heuristic to locate
some of the sites in a program where there are problematic date
manipulations [RBDL97]
(and more generally as a an aid for testing and debugging).
A path profiler instruments a program so that the number of times each
different loop-free path executes is accumulated during an execution
run. With such an instrumented program, each run of the program
generates a path spectrum for the execution---a distribution of the
paths that were executed during that run. A path spectrum is a
finite, easily obtainable characterization of a program's execution on
a dataset, and provides a behavior signature for a run of the program.
Our techniques are based on the idea of comparing path spectra from
different runs of the program. When different runs produce different
spectra, the spectral differences can be used to identify paths in the
program along which control diverges in the two runs. By choosing
input datasets to hold all factors constant except one, the divergence
can be attributed to this factor. The point of divergence itself may
not be the cause of the underlying problem, but provides a starting
place for a programmer to begin his exploration.
One application of this technique is in the ``Year 2000 Problem'' (i.e.,
the problem of fixing computer systems that use only 2-digit year
fields in date-valued data). In this context, path-spectrum
comparison provides a heuristic for identifying paths in a program
that are good candidates for being date-dependent computations.
A few years later, we found out that essentially the same idea had been discovered
earlier by J. Collofello and L. Cousin [CC87],
as well as by N. Wilde
(who has done extensive research on the subject, and uses the term
``software reconnaissance'',
e.g., see [W94,
WC96]).
[Back to the top]
Computational Differentiation and Computational Divided Differencing
Tools for computational differentiation transform a program that
computes a numerical function F(x) into a related program that
computes F'(x) (the derivative of F).
Louis Rall (of the UW Mathematics Department) and I
described how techniques similar to those used in computational-differentiation
tools can be used to implement other program transformations---in
particular, a variety of transformations for computational divided
differencing
[RR00,
RR03
].
[Back to the top]
|