package
1.0.1
Repository: https://github.com/atlantistechnology/sdt.git
Documentation: pkg.go.dev

# README

This small tool is a wrapper around the tree-sitter-cli framework See: https://github.com/tree-sitter/tree-sitter. The tree-sitter library provides parsing support for a large number languages, via subpackages which provide grammars for these various languages. Such grammars must each be installed using the framework, and rely on Rust, Node.js, and C, C++, and bash tooling.

For the current tool, we make a call to tree-sitter parse file.lang, then if language support is available for a given file, we massage the produced parse tree to include features needed by Semantic Diff Tool, and into a format easier to process by SDT. Specifically, tree-sitter might provide a parse tree such as:

% tree-sitter parse samples/hello0.c
(translation_unit [0, 0] - [15, 0]
  (comment [0, 0] - [0, 44])
  (comment [2, 0] - [2, 41])
  (preproc_include [3, 0] - [4, 0]
    path: (system_lib_string [3, 9] - [3, 18]))
 [...]
   (function_definition [7, 0] - [14, 1]
     type: (primitive_type [7, 0] - [7, 3])
     declarator: (function_declarator [7, 4] - [7, 10]
       declarator: (identifier [7, 4] - [7, 8])
 [...]
         (string_literal [11, 11] - [11, 24]))))
   (return_statement [13, 4] - [13, 13]
     (number_literal [13, 11] - [13, 12])))))

For SDT we want several things to be different, since our interest is simply in identifying which lines might contain semantically meaningful changes. Tree-sitter gives us both too much and too little for this goal, but in a manner where we can mechanically transform the tree to the desired form. The transformed format is (generally) compatible with that produced by gotree.

Differences in line number and column number are not semantically meaningful (in most programming languages), however changes in names and literals ARE important. Comments are discarded by treesit unless the env variable TREESIT_COMMENTS is set to a non-blank value.

To recover lines, we move their numbers to the lefthand column. We also fill in those literals within the source file that are important to us based on their line/col offset. For example (note that SDT wants 1-based line numbers whereas tree-sitter uses 0-based):

% treesit samples/hello0.c
SrcLn | Node
00001 | (translation_unit
00004 |   (preproc_include
00004 |     path: (system_lib_string <stdio.h>))
[...]
00008 |    (function_definition
00008 |      type: (primitive_type)
00008 |      declarator: (function_declarator
00008 |        declarator: (identifier main)
[...]
00012 |          (string_literal "Hello World"))))
00014 |    (return_statement
00014 |      (number_literal 0)))))