Categorygithub.com/google/licenseclassifier
modulepackage
2.0.0+incompatible
Repository: https://github.com/google/licenseclassifier.git
Documentation: pkg.go.dev

# README

License Classifier

Build status

Introduction

The license classifier is a library and set of tools that can analyze text to determine what type of license it contains. It searches for license texts in a file and compares them to an archive of known licenses. These files could be, e.g., LICENSE files with a single or multiple licenses in it, or source code files with the license text in a comment.

A "confidence level" is associated with each result indicating how close the match was. A confidence level of 1.0 indicates an exact match, while a confidence level of 0.0 indicates that no license was able to match the text.

Adding a new license

Adding a new license is straight-forward:

  1. Create a file in licenses/.

    • The filename should be the name of the license or its abbreviation. If the license is an Open Source license, use the appropriate identifier specified at https://spdx.org/licenses/.
    • If the license is the "header" version of the license, append the suffix ".header" to it. See licenses/README.md for more details.
  2. Add the license name to the list in license_type.go.

  3. Regenerate the licenses.db file by running the license serializer:

    $ license_serializer -output licenseclassifier/licenses
    
  4. Create and run appropriate tests to verify that the license is indeed present.

Tools

Identify license

identify_license is a command line tool that can identify the license(s) within a file.

$ identify_license LICENSE
LICENSE: GPL-2.0 (confidence: 1, offset: 0, extent: 14794)
LICENSE: LGPL-2.1 (confidence: 1, offset: 18366, extent: 23829)
LICENSE: MIT (confidence: 1, offset: 17255, extent: 1059)

License serializer

The license_serializer tool regenerates the licenses.db archive. The archive contains preprocessed license texts for quicker comparisons against unknown texts.

$ license_serializer -output licenseclassifier/licenses

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

# Packages

Package commentparser does a basic parse over a source file and returns all of the comments from the code.
No description provided by the author
Package serializer normalizes the license text and calculates the hash values for all substrings in the license.
Package stringclassifier finds the nearest match between a string and a set of known values.
No description provided by the author

# Functions

Archive is an OptionFunc to specify the location of the license archive file.
ArchiveBytes is an OptionFunc that provides the contents of the license archive file.
ArchiveFunc is an OptionFunc that provides a function that must return the contents of the license archive file.
CopyrightHolder finds a copyright notification, if it exists, and returns the copyright holder.
LicenseType returns the type the license has.
New creates a license classifier and pre-loads it with known open source licenses.
NewWithForbiddenLicenses creates a license classifier and pre-loads it with known open source licenses which are forbidden.
NormalizeEquivalentWords normalizes equivalent words that are interchangeable.
NormalizePunctuation takes all hyphens and quotes and normalizes them.
RemoveNonWords removes non-words from the string.
TrimExtraneousTrailingText removes text after an obvious end of the license and does not include substantive text of the license.

# Constants

The names come from the https://spdx.org/licenses website, and are also the filenames of the licenses in licenseclassifier/licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
DefaultConfidenceThreshold is the minimum confidence percentage we're willing to accept in order to say that a match is good.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
ForbiddenLicenseArchive is the name of the archive containing preprocessed forbidden license texts only.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
LicenseArchive is the name of the archive containing preprocessed license texts.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.
Canonical names of the licenses.

# Variables

LicenseTypes is a set of the types of licenses Google recognizes.
Normalizers is a list of functions that get applied to the strings before they are registered with the string classifier.
ReadLicenseDir reads directory containing the license files.
ReadLicenseFile locates and reads the license archive file.

# Structs

License is a classifier pre-loaded with known open source licenses.

# Type aliases

OptionFunc set options on a License struct.