A Tour of Go #69 Exercise: Web Crawler

I can’t resist. The difficulty level of the exercise is similar to much of what I’ve posted on this blog and I just want to talk about it.

For any of you who don’t regularly follow the Go Blog or Go Nuts, I’m talking about the final exercise, #69, of the recently published A Tour of Go. This is a perfect little introduction to the language that starts with the most basic elements of the language and progresses through some of the best and most interesting elements of the language. There are exercises along the way. The first exercises are easy, then they become more challenging. I believe most people new to Go will find the final exercise in particular to be a satisfying challenge. If you haven’t taken the tour yet, I encourage you to do so and to attempt the exercises, especially #69, the subject of my post here.

Done? Good. :) Of the two TODO items, “Don’t fetch the same URL twice” is easy to add.  One quick and dirty solution is to replace the Crawl function with the following,

func Crawl(url string, depth int, fetcher Fetcher) {
    c2(url, depth, fetcher, map[string]bool{url: true})
}

func c2(url string, depth int, fetcher Fetcher, m map[string]bool) {
    // TODO: Fetch URLs in parallel.
    if depth <= 0 {
        return
    }
    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("found: %s %q\n", url, body)
    for _, u := range urls {
        if !m[u] {
            m[u] = true
            c2(u, depth-1, fetcher, m)
        }
    }
    return
}

This works, but some people might not like a couple of things here. It introduces a new function, c2, and c2 has a fourth parameter. Do you agonize about the clutter of another function or the overhead of pushing another parameter on the stack? There are other ways to do it. You could make the map a package variable. Bad idea. I won’t even show it. An excellent and popular technique is a recursive closure:

func Crawl(url string, depth int, fetcher Fetcher) {
    m := map[string]bool{url: true}
    var c2 func(string, int)
    c2 = func(url string, depth int) {
        // TODO: Fetch URLs in parallel.
        if depth <= 0 {
            return
        }
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("found: %s %q\n", url, body)
        for _, u := range urls {
            if !m[u] {
                m[u] = true
                c2(u, depth-1)
            }
        }
        return
    }
    c2(url, depth)
}

Alternatively, you could use a struct and a method:

func Crawl(url string, depth int, fetcher Fetcher) {
    c := &crawlHistory{fetcher, map[string]bool{url: true}}
    c.c2(url, depth)
}

type crawlHistory struct {
    fetcher Fetcher
    m       map[string]bool
}

func (c *crawlHistory) c2(url string, depth int) {
    // TODO: Fetch URLs in parallel.
    if depth <= 0 {
        return
    }
    body, urls, err := c.fetcher.Fetch(url)
    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("found: %s %q\n", url, body)
    for _, u := range urls {
        if !c.m[u] {
            c.m[u] = true
            c.c2(u, depth-1)
        }
    }
    return
}

Which of these forms appeal to you is likely to depend on what other languages you know. If you know a language with closures, you probably like the first one. If your experience is object oriented, the second probably looks good to you. It really doesn’t matter.

One elegant little change that can be made to the second version is to embed Fetcher in crawlHistory. Remove the field name from Fetcher so that the struct definition reads simply

type crawlHistory struct {
    Fetcher
    m map[string]bool
}

The call to Fetch can now be written

    body, urls, err := c.Fetch(url)

Embedding wasn’t covered in the tour, but that’s okay because we’re not going to complete this exercise without a couple of other things that weren’t covered in the tour.

Next on the TODO list is fetching URLs in parallel. We’re going to start multiple goroutines and let them crawl different urls found on the page. A pretty big thing not covered in the tour is that concurrent map access is not safe without synchronization. That means these multiple goroutines need to access to the map one at a time. If we don’t serialize access to the map, we risk wrong answers or even corrupt memory and a crash.

Also not covered is that fmt.Println should be synchronized as well. It turns out that we have no guarantee from the operating system even that output from multiple threads won’t get interleaved.

Finally, covered so subtly that you probably missed it is that a Go program terminates when main terminates, without waiting for other goroutines. (Take another look at the output of example 61.) We will start goroutines to crawl urls found on the page, but then we need to do something to wait for them to complete.

So we’re going to need lots of synchronization to do this exercise! The tour did mention the sync package, and there is stuff there to do the job, but we can do the exercise easily without the sync package so I won’t go into it just yet. Go channels, as shown in the tour, represent a very general purpose mechanism that easily handle all of the synchronization needs for this exercise.

The pattern I use for serializing access to the shared resources—the map and fmt.Println—uses channels with capacity=1. One channel is created for a resource that is to be shared and this one channel is passed to all goroutines that want to access the resource. When a goroutine wants to access the resource, it must take (some reference to) the resource from the channel, use the resource, and then return the resource to the channel. It’s like the salt shaker on the table. Only one person uses it at a time, and otherwise it sits in a place where everyone can reach it.

The pattern I use to wait for goroutine completion is that if a function launches goroutines, it must also create a channel for them to report completion. It counts the number of goroutines it launches, then waits for exactly that many completion reports.

Simple enough? Here’s the code:

// struct renamed crawlShared to reflect that it provides access to
// things that are shared across goroutines (the goroutines being
// instances of c2, specifically.)
type crawlShared struct {
    Fetcher     // embedded for Fetch method.

    // the "salt shaker" pattern:  a channel holds the shared map
    mapAccess   chan map[string]bool

    // about the same pattern, except that there is nothing obvious
    // to put in the channel.  instead, a boolean value is used as
    // a token.  the value is ignored.  all that matters is the
    // presence or absence of the token in the channel.
    printAccess chan bool
}

func Crawl(url string, depth int, fetcher Fetcher) {
    c := &crawlShared{
        fetcher,
        make(chan map[string]bool, 1),
        make(chan bool, 1),
    }

    // put the salt shaker on the table.  that is, put the map
    // in the channel, making it available to goroutines.
    c.mapAccess <- map[string]bool{url: true}

    // same with the token to serialize printing
    c.printAccess <- true

    // run goroutine to crawl top level url.
    // since we are starting exactly one goroutine here, we wait
    // for a single completion report.  receipt means that all
    // lower levels have also completed and it is safe to return
    // --and allow the caller to return, in this case, main().
    done := make(chan bool)
    go c.c2(url, depth, done)
    <-done
}

func (c *crawlShared) c2(url string, depth int, pageDone chan bool) {
    // the function has multiple return points.  all of them must
    // report goroutine completion by sending a value on pageDone.
    if depth <= 0 {
        pageDone <- true
        return
    }

    body, urls, err := c.Fetch(url)
    if err != nil {
        // here's how to print:
        // take the token (waiting for it if it's not there.)
        <-c.printAccess
        // do whatever you need to do while other goroutines are
        // excluded from printing.
        fmt.Println(err)
        // put the token back, allowing others to print again.
        c.printAccess <- true

        pageDone <- true
        return
    }

    // same sequence of steps to print found message
    <-c.printAccess
    fmt.Printf("found: %s %q\n", url, body)
    c.printAccess <- true

    // "found" means the url was fetched without error and that urls
    // on the fetched page are collected in the slice "urls."
    // synchronization to crawl these urls in parallel is implemeted
    // with the uDone channel.  create the channel, count the number of
    // goroutines started, then wait for exactly that many completions.
    uDone := make(chan bool)
    uCount := 0

    // salt shaker pattern for map access:  get the map from
    // the channel, and then hold it while iterating over urls.
    // this works with the assumption that all of the operations here
    // take trivial time compared to the relatively lengthy time
    // to fetch a url.  other than map access (which is what we need
    // exclusive access for!) the only operations are iterating over
    // a string slice, incrementing an integer, and starting
    // a goroutine.  these all run very fast so it is reasonable and
    // best to hold "the lock" that is, exclusive map access, while
    // running through this loop.
    m := <-c.mapAccess
    for _, u := range urls {
        if !m[u] {
            m[u] = true
            uCount++
            go c.c2(u, depth-1, uDone)
        }
    }
    c.mapAccess <- m

    // wait for the number of goroutines started just above.
    for ; uCount > 0; uCount-- {
        <-uDone
    }

    // and finally, report completion of this level
    pageDone <- true
}

So that’s kind of the safety scissors version. I think it shows a correct solution while introducing a minimum of concepts that were not covered in the tour.

I’ll leave you with this alternative that uses a few more tricks. The tricks require greater knowledge of the language and libraries, but make for code that is both more concise and faster. This version dispenses with the struct by going back to the recursive closure, trades printAccess for a little trick with the log package, trades mapAccess for a sync.Mutex, and replaces all the done channels with a single sync.WaitGroup. With the waitGroup there is no longer any need to run the first call to c2 as a goroutine, mm, and there’s a defer statement in there too.

import (
    "log"
    "os"
    "sync"
)

var fmt = log.New(os.Stdout, "", 0)

func Crawl(url string, depth int, fetcher Fetcher) {
    m := map[string]bool{url: true}
    var mx sync.Mutex
    var wg sync.WaitGroup
    var c2 func(string, int)
    c2 = func(url string, depth int) {
        defer wg.Done()
        if depth <= 0 {
            return
        }
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            fmt.Println(err)
            return
        }
        fmt.Printf("found: %s %q\n", url, body)
        mx.Lock()
        for _, u := range urls {
            if !m[u] {
                m[u] = true
                wg.Add(1)
                go c2(u, depth-1)
            }
        }
        mx.Unlock()
    }
    wg.Add(1)
    c2(url, depth)
    wg.Wait()
}

Oh, and without the fmt package, there’s a line in fakeFetcher.Fetch that now has to read

    return "", nil, os.NewError("not found: " + url)
About these ads
This entry was posted in golang. Bookmark the permalink.

4 Responses to A Tour of Go #69 Exercise: Web Crawler

  1. Michael says:

    Hey there Sonia. Thanks for posting your solution – it’s great being able to see different approaches. The salt-n-shaker approach of putting the map in the channel is quite interesting – I’d be keen to find out whether putting largish structs into a channel has drawbacks. If you’re interested, I’ve posted mine here – I serialized access to the map not by putting the map itself in a channel, but by updating/reading the map only via channel communication.

    • Sonia says:

      A map (on my machine anway) is 8 bytes. I presume 4 of those bytes are a pointer to something on the heap that contains the map data. So I wouldn’t consider a map a largish strut at all. I think 8 bytes can be moved around pretty quickly.

  2. andrelaszlo says:

    The “salt shaker” pattern fits nicely with the “don’t communicate by sharing memory – share memory by communicating” approach to concurrency described in this article:

    http://golang.org/doc/codewalk/sharemem/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s