Categorygithub.com/ullaakut/astronomer
modulepackage
0.4.0
Repository: https://github.com/ullaakut/astronomer.git
Documentation: pkg.go.dev

# README

Astronomer

Astronomer is a tool that fetches data from every GitHub user who starred a common repository and computes how likely it is that those users are real humans. The goal of Astronomer is to detect illegitimate GitHub stars from bot accounts, which could be used to artificially increase the popularity of an open source project.

Trust algorithm

Trust is computed based on many different factors:

  • The average amount of lifetime contributions among stargazers
  • The average amount of private contributions
  • The average amount of public created issues
  • The average amount of public authored commits
  • The average amount of public opened pull requests
  • The average amount of public code reviews
  • The average weighted contribution score (weighted by making older contributions more trustworthy)
  • Every 5th percentile, from 5 to 95, of the weighted contribution score
  • The average account age, older is more trustworthy

How to use it

Docker image

In order to use Astronomer, you'll need a GitHub token with repo read rights. You can generate one in your GitHub Settings > Developer settings > Personal Access Tokens. Make sure to keep this token secret. You will also need to have docker installed.

Run docker pull ullaakut/astronomer.

Then, use the astronomer docker image like such: docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/path/to/your/cache/folder:/data/" ullaakut/astronomer repositoryOwner/repositoryName -d

  • The -t flag allows you to get a colored output. You can remove it from the command line if you don't care about this.
  • The -e GITHUB_TOKEN=<your_token> option is mandatory. The GitHub API won't authorize any requests without it.
  • The -v "/path/to/your/cache/folder:/data/" option can be used to cache the responses from the GitHub API on your machine. This means that the next time you run a scan, Astronomer will simply update its cache with the new stargazers since your last scan, and compute the trust levels again. It is highly recommended to use cache if you plan on scanning popular repositories (more than 1000 stars) more than once.
  • The -d flag enables more detailed trust factor computation.

Binary

You can also install the go binary by enabling go modules and running go install github.com/ullaakut/astronomer. Make sure that your go version is at least 1.11.x.

You can verify your go version by running go version.

The astronomer binary will then be available in $GOPATH/bin/astronomer.

Arguments and options

  • It is required to specify a repository in the form repositoryOwner/repositoryName. This argument's position does not matter.
  • -c, --cachedir (string): Set the directory in which to store cache data (default: ./data)
  • -s, --stars: Maxmimum amount of stars to scan (picked randomly), if fast mode is enabled (default: 1000)
  • -d, --debug: Show more detailed trust factors, such as percentiles (default: false)
  • --fast: Enable fast mode in order to scan random stargazers instead of all of them (slightly less accurate) (default: true)
  • --scanFirstStars: Scan the first stars of the repository (overrides fast mode). Set amount of stars with --stars options (default: false)

Example use cases

Scanning a repository with detailed statistics

With the binary: astronomer --details ${repoOwner}/${repoName} With the docker image: docker run -e GITHUB_TOKEN="$GITHUB_TOKEN" -v "/tmp:/data" -t ullaakut/astronomer --details ${repoOwner}/${repoName}

Note that you can also use the -d flag instead of --details.

Run a full scan of a repository with detailed statistics

With the binary: astronomer --details --fast=false ${repoOwner}/${repoName} With the docker image: docker run -e GITHUB_TOKEN="$GITHUB_TOKEN" -v "/tmp:/data" -t ullaakut/astronomer --details --fast=false ${repoOwner}/${repoName}

Make sure to mount a directory that is safe from deletion (not /tmp like in this example) as the scan might take hours or days, and if it is stopped while fetching data, the cache folder will allow you to recover to your fetching state in a few seconds. If the cache is lost however, you will need to restart the scan.

Scanning the first 500 stars of a repository, with detailed statistics

With the binary: astronomer --scanFirstStars --stars="500" --details ${repoOwner}/${repoName} With the docker image: docker run -e GITHUB_TOKEN="$GITHUB_TOKEN" -v "/tmp:/data" -t ullaakut/astronomer --scanFirstStars --stars="500" --details ${repoOwner}/${repoName}

Upcoming features

In the future, Astronomer will hopefully:

  • Generate GitHub badges to show off that your repository is legit (thanks @emilevauge for the idea!)
  • Provide a web interface to request scans and get your badge

Keep in mind that scans are always going to be long for huge repositories. A 10K+ stars repository will take multiple hours to scan.

Examples

Questions & Answers

How accurate is this algorithm? Why does my repository has a low trust level?

Astronomer only attempts to estimate a trust level. The more stargazers there are on a repository, the more accurate it will be. Since the algorithm compares averages of the scanned repositories with global averages, if your repository has only two stars and that both are from new accounts with low contributions, it will seem extremely fishy to Astronomer, even if those are probably real stars. The goal of Astronomer is more orentied towards popular projects with thousands of stars, where the first few thousands might have been from bot accounts, used to boost the project's popularity.

Why would fake stars be an issue? The number of stars doesn't really matter.

Repositories with high amounts of stars, especially when they arrive in bursts, are often found in GitHub trending, they are also emailed to people who subscribed to the GitHub Explore daily newsletter. This means that an open source project can get actual users to use their software by bringing attention to it using illegitimate bot accounts. Many startups are known for choosing technologies to use based on GitHub stars, since they provide the comforting thought that the project is backed by a strong community. Unfortunately, as far as I know, GitHub currently does not attempt to prevent this from happening.

Why is Astronomer so slow? It's been scanning a project for hours.

If you disabled the fast mode, this is normal. It's fetching all contributions from each individual stargazer of this repository. In most cases, running the scan with fast mode enabled (which is the case by default) should never take more than 30mns to scan a repository, unless you have network issues. It will be slightly less accurate, but significantly faster.

With fast mode disabled, Astronomer needs to make a lot of queries to the GitHub API in order to fetch all of the user data. It typically needs to do one request per page of stargazers per year of contributions, (as of 2019 that's 11 requests per 30 users). The issue is that the GitHub API is rate limited to 5000 requests per hour, so for a scan of 25000 stars for example, about 9000 requests are required, which will result in at least a two hour scan (takes about 6 hours on my machine/network). I plan on contacting GitHub to try to get a token with more flexible rate limiting, since I believe this project is beneficial to their business, but I'm not confident this request will be accepted.

How can I contribute to this project?

If you have a strong math background, knowledge in statistics and analytics, or in general believe you could make the trust algorithm smarter, please contact me, or at least feel free to open a feature request describing what algorithm you think would work better. A feature that I would be especially interested in is computing the curve of percentile values for each trust factor and compare it to a reference curve, in order to detect inconsistencies.

If you are a software engineer or a web developer (or both), you could also participate in helping to build the next version of Astronomer: an API and a web application to let people scan whatever repositories they want for fake stars, and see previously generated reports through a web interface. It would make it easy for everyone to check whether or not a repository's stargazers are legit.

Also, if you have data to backup a claim that you have a better value for the good/bad constants (used to determine what is a good or bad value for a specific metric), feel free to reach out to me. This is an essential part of having a precise estimation of how legit a repository is, and improving these constants would improve the overall quality of the algorithm.

What is the end goal of this project?

Ideally, this should be a GitHub feature. The issue is that it's actually almost impossible to differentiate a bot account and the account of someone who just created a GH account to star a repository and show their support, which can lead to angry customers for GitHub if they chose to ban potentially illegitimate accounts. It's also very easy for people who make bot accounts to make them seem legit by creating private repositories with daily contributions, but this can also be detected to some extent, if it's a trend that ends up appearing.

What's the strange hardcoded skip in the query.go file?

Unfortunately there's an issue in the GitHub API, where this user has so many contributions that all API requests that would contain his contributions time out, consistently. Since he starred containous/traefik, I had to hardcode a skip in order to allow the scan to continue (since the GH API's only method of pagination is to use the cursor returned by the user node, I had to manually get his cursor value myself and hardcode it. Writing logic to handle this case generically whenever it happens would be possible but I'm not sure it's a priority right now). I've sent a support request to GitHub so when they fix it, I'll make sure to remove this skip.

Thanks

Thanks to the authors of spencerkimball/stargazers who greatly inspired the early design of this project 🙏

The original Go gopher was designed by Renee French.

License

Copyright 2019 Ullaakut

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.