Quick Start Guide: Web Scraper

In this guide you're going to create a Pachyderm pipeline to scrape web pages. We'll use a standard unix tool, wget, to do our scraping.

Setup

This guide assumes that you already have a Pachyderm cluster running and have configured pachctl to talk to the cluster. Detailed setup instructions can be found here.

Create a Repo

A Repo is the highest level primitive in pfs. Like all primitives in pfs, they share their name with a primitive in Git and are designed to behave analogously. Generally, a repo should be dedicated to a single source of data such as log messages from a particular service. Repos are dirt cheap so don't be shy about making them very specific.

For this demo we'll simply create a repo called “urls” to hold a list of urls that we want to scrape.

$ pachctl create-repo urls
$ pachctl list-repo
urls

Start a Commit

Now that we’ve created a repo we’ve got a place to add data. If you try writing to the repo right away though, it will fail because you can't write directly to a Repo. In Pachyderm, you write data to an explicit commit. Commits are immutable snapshots of your data which give Pachyderm its version control for data properties. Unlike Git though, commits in Pachyderm must be explicitly started and finished.

Let's start a new commit in the “urls” repo:

$ pachctl start-commit urls
6a7ddaf3704b4cb6ae4ec73522efe05f

This returns a brand new commit id. Yours should be different from mine. Now if we take a look inside our repo, we’ve created a directory for the new commit:

$ pachctl list-commit urls
6a7ddaf3704b4cb6ae4ec73522efe05f

A new directory has been created for our commit and now we can start adding data. Data for this example is just a single file with a list of urls. We've provided a sample file for you with just 3 urls, Google, Reddit, and Imgur. We're going to write that data as a file called “urls” in pfs.

# Write sample data into pfs
$ cat examples/scraper/urls | pachctl put-file urls 6a7ddaf3704b4cb6ae4ec73522efe05f urls

However, you'll notice that we can't read the file “urls” yet. This is because the commit hasn’t been completed yet and in a distributed environment could contain some dirty state.

$ pachctl get-file urls 6a7ddaf3704b4cb6ae4ec73522efe05f urls
File not found

Finish a Commit

Pachyderm won't let you read data from a commit until the commit is finished. This prevents reads from racing with writes. Furthermore, every write to pfs is atomic. Now let's finish the commit:

$ pachctl finish-commit urls 6a7ddaf3704b4cb6ae4ec73522efe05f

Now we can view the file:

$ pachctl get-file urls 6a7ddaf3704b4cb6ae4ec73522efe05f urls
www.google.com
www.reddit.com
www.imgur.com

However, we've lost the ability to write to this commit since finished commits are immutable. In Pachyderm, a commit is always either write-only when it's been started and files are being added, or read-only after it's finished.

Create a Pipeline

Now that we've got some data in our repo it's time to do something with it. Pipelines are the core primitive for Pachyderm's processing system (pps) and they're specified with a JSON encoding. We're going to create a pipeline that simply scrapes each of the web pages in “urls.”

+----------+     +---------------+     
|input data| --> |scrape pipeline| 
+----------+     +---------------+

The pipeline we're creating can be found at examples/scraper/scraper.json. The full content is also below.

{
  "pipeline": {
    "name": "scraper”
  },
  "transform": {
    "cmd": [ "wget",
        "--recursive",
        "--level", "1",
        "--accept", "jpg,jpeg,png,gif,bmp",
        "--page-requisites",
        "--adjust-extension",
        "--span-hosts",
        "--no-check-certificate",
        "--timestamping",
        "--directory-prefix",
        "/pfs/out",
        "--input-file", "/pfs/urls/urls"
    ],
    "acceptReturnCode": [4,5,6,7,8]
  },
  "parallelism": "1",
  "inputs": [
    {
      "repo": {
        "name": "urls"
      }
    }
  ]
}

In this pipeline, we’re just using wget to scrape the content of our input web pages. “level” indicates how many recursive links wget will retrieve. We currently have it set to 1, which will only scrape the home page, but you can crank it up later if you want.

Another important section to notice is that we read data from /pfs/urls/urls (/pfs/[input_repo_name]) and write data to /pfs/out/. We create a directory for each url in “urls” with all of the relevant scrapes as files.

Now let's create the pipeline in Pachyderm:

$ pachctl create-pipeline -f examples/scraper/scraper.json

What Happens When You Create a Pipeline

Creating a pipeline tells Pachyderm to run your code on every finished commit in a repo as well as all future commits that happen after the pipeline is created. Our repo already had a commit with the file “urls” in it so Pachyderm will automatically launch a job to scrape those webpages.

You can view the job with:

$ pachctl list-job
ID                                 OUTPUT                                  STATE
09a7eb68995c43979cba2b0d29432073   scraper/2b43def9b52b4fdfadd95a70215e90c9   JOB_STATE_RUNNING

Depending on how quickly you do the above, you may see JOB_STATE_RUNNING or JOB_STATE_SUCCESS.

Pachyderm jobs are implemented as Kubernetes jobs, so you can also see your job with:

$ kubectl get job
JOB                                CONTAINER(S)   IMAGE(S)             SELECTOR                                                         SUCCESSFUL
09a7eb68995c43979cba2b0d29432073   user           pachyderm/job-shim   app in (09a7eb68995c43979cba2b0d29432073),suite in (pachyderm)   1

Every pipeline creates a corresponding repo with the same name where it stores its output results. In our example, the pipeline was named “scraper” so it created a repo called “scraper” which contains the final output.

Reading the Output

There are a couple of different ways to retrieve the output. We can read a single output file from the “scraper” repo in the same fashion that we read the input data:

$ pachctl list-file urls 09a7eb68995c43979cba2b0d29432073 urls
$ pachctl get-file urls 09a7eb68995c43979cba2b0d29432073 urls/www.imgur.com/index.html

Using get-file is good if you know exactly what file you’re looking for, but for this example we want to just see all the scraped pages. One great way to do this is to mount the distributed file system locally and then just poke around.

Mount the Filesystem

First create the mount point:

$ mkdir ~/pfs

And then mount it:

# We background this process because it blocks.
$ pachctl mount ~/pfs &

This will mount pfs on ~/pfs you can inspect the filesystem like you would any other local filesystem. Try:

$ ls ~/pfs
urls

You should see the urls repo that we created.

Now you can simply ls and cd around the file system. Try pointing your browser at the scraped output files!

Processing More Data

Pipelines can be triggered manually, but also will automatically process the data from new commits as they are created. Think of pipelines as being subscribed to any new commits that are finished on their input repo(s).

If we want to re-scrape some of our urls to see if the sites of have changed, we can use the run-pipeline command:

$ pachctl run-pipeline scraper
fab8c59c786842ccaf20589e15606604

Next, let’s add additional urls to our input data . We're going to append more urls from “urls2” to the file “urls.”

We first need to start a new commit to add more data. Similar to Git, commits have a parental structure that track how files change over time. Specifying a parent is optional when creating a commit (notice we didn't specify a parent when we created the first commit), but in this case we're going to be adding more data to the same file “urls.”

Let's create a new commit with our previous commit as the parent:

$ pachctl start-commit urls -p 6a7ddaf3704b4cb6ae4ec73522efe05f
e2b8c59c786842ccaf20589e15606604

Append more data to our urls file in the new commit:

$ cat examples/scraper/urls2 | pachctl put-file urls e2b8c59c786842ccaf20589e15606604 urls

Finally, we'll want to finish our second commit. After it's finished, we can read “scraper” from the latest commit to see all the scrapes.

$ pachctl finish-commit urls e2b8c59c786842ccaf20589e15606604

Finishing this commit will also automatically trigger the pipeline to run on the new data we've added. We'll see a corresponding commit to the output “scraper” repo with data from our newly added sites.

$ pachctl list-file urls d161c59c786842ccaf20589e1525ecd5 urls

Next Steps

You've now got a working Pachyderm cluster with data and a pipelines! Here are a few ideas for next steps that you can expand on your working setup.

Add a bunch more urls and crank up the “level” in the pipeline. You’ll have to delete the old pipeline and re-create or give your pipeline and new name.
Add a new pipeline than does something interesting with the scraper output. Image or text processing could be fun. Just create a pipeline with the scraper repo as an input.

We'd love to help and see what you come up with so submit any issues/questions you come across or email at [email protected] if you want to show off anything nifty you've created!

# README