Build & Deploy a reusable Web Scraper with Elixir Livebook

Aug 09, 2023

With recent updates, you can deploy a Livebook notebook as an app. I wanted a reusable web scraper where I can enter the url, and selectors for the details that I want and save the scraped data in a `txt` file. I also wanted to handle cases where I may need to scrape multiple pages e.g. an Amazon search page, and be able to get information from individual links nested on the main page.

You can fork and use the HuggingFace repo.

Here’s how I built the scraper:

Install the dependencies:
1. HTTPoison - to make `http` requests
2. Floki - html parser that allows you to get data from html nodes using css selectors
3. Kino - interactive widgets for Livebook. Kino allows you to use components like forms and button inside Livebook (!!!)
4. Jason - the more popular json library for Elixir

The first cell in Livebook lets you select the dependency from a searchable menu and install it directly.

Design a Form with Kino
We need inputs for:
1. main url and/or list of urls
2. selectors for each detail I want to extract from a page e.g. title, date
3. A toggle for `pagination` - if there are multiple pages with the same structure
4. Another toggle if I want to scrape links existing on the main url
5. Finally we need a button to send the form inputs and process them.
  Building a Form with Kino

Define a Scraper module

Handle multiple pages - There are 2 options to do this: 1/ pass a list of urls, 2/ pass a slug with a `#{}` placeholder where the page number can be substituted, enter the starting number, how much to increment to get the next page, and the direction of increment. This can handle urls like [“https://example.com/?page=1”, “https://example.com/?page=2”] as well as [“https://example.com/?page=2023”, “https://example.com/?page=2022”]

 def scrape_all(data) do
    url_list = to_string(data.url_list) |> String.trim()

   urls =
     if String.length(url_list) > 0 do
       String.split(url_list, ",") |> Enum.map(&String.trim(&1)) |> Enum.filter(fn x -> x != "" end)
     else
      url = data.main_url_slug |> to_string()
          
          num_pages = get_clean_input(data.num_pages, :int)

          start_from = get_clean_input(data.start_from, :int)

          page_change_delta = get_clean_input(data.page_change_delta, :int) 
          
          pages_range = 
            case data.page_change_direction do
              :increase -> Enum.to_list(start_from..start_from + (num_pages-1) * page_change_delta)
              :decrease -> Enum.to_list(start_from..start_from - (num_pages-1) * page_change_delta)
              _ -> []
            end
      
            pages_range
          |> Enum.map(
            fn page ->
            url
            |> String.replace("\#{}", Integer.to_string(page))
          end)
    end
  IO.inspect(urls)
    output = (for x <- urls, do: get_page(Map.merge(data, %{main_url: x}))) |> List.flatten()
    
    IO.inspect(length(output))
    output
  end

Handle the case where the information I need is on the main page or in nested urls, each of which needs to be opened and extracted.

def get_page(data) do
    # should run scrape on a single page

    if data.scrape_each_page == true do
      page_links =
        scrape(data.main_url, data.main_item_selector)
        |> Enum.map(fn x ->
          partial_url = Floki.attribute(x, "href") |> Enum.at(0) |> to_string()
          full_url = to_string(data.relative_url_prefix) <> partial_url
          full_url
        end)


      # go to each page, get the attributes from each page
      output =
        Enum.map(
          page_links,
          fn url ->
            body = scrape_page(url)

            get_attributes(
              body,
              data
            )
          end
        )
        |> Enum.filter(fn x -> x != nil end)
        
      output
    else
      # get_attributes for each item and then add the static information to each item
      items = scrape(data.main_url, data.main_item_selector)
    
      Enum.map(items, &get_attributes(&1, data))
      |> Enum.filter(fn x -> x != nil end)
    end
  end

Handle urls where certificates may be unsafe/ follow older security protocols - we do this by passing ssl options to the `HTTPoison.get` method

HTTP.get(url, [],  ssl: [
                   verify_fun:
                     {fn _, reason, state ->
                        case reason do
                          {:bad_cert, :cert_expired} -> {:valid, state}
                          {:bad_cert, :unknown_ca} -> {:valid, state}
                          {:extension, _} -> {:valid, state}
                          :valid -> {:valid, state}
                          :valid_peer -> {:valid, state}
                          error -> {:fail, error}
                        end
                      end, []}
                 ])

Get data from selectors

defp get_data_from_selector(item, selector) do
    Floki.find(item, selector)
    |> Floki.text()
    |> to_string()
    |> String.replace("Published:", "")
    |> String.replace("Recent Publication:", "")
    |> String.trim()
  end

The form inputs are not of type `string` by default, we use `get_clean_inputs` to handle this and any extra whitespace included by the user.

defp get_clean_input(item, type) do
    cleaned = to_string(item) |> String.trim()
    output = 
      case type do
        :int -> 
          {num, _} = Integer.parse(cleaned)
          num
        :str -> cleaned
        _ -> cleaned
      end

    output
  end

Listen to Form Submissions and Process Them

You can listen to events in Kino widgets with `Kino.listen` and create a download button with `Kino.Download.new` (Thanks Jonatan Klosko). Create a new frame in a Livebook cell:

frame = Kino.Frame.new()

Now add an event listen to handle form submissions. We use and if-else to handle cases with and without pagination

Kino.listen(
  scraper_form,
  fn event ->
    IO.inspect(event)
    # Create a function that generates the JSON output
    content_fun = fn ->
     
      output = 
        if event.data.has_pagination == true do
          Scraper.scrape_all(event.data)
            |> Jason.encode!()
        else 
          Scraper.get_page(event.data)
          |> Jason.encode!()
        end
        

      output
    end

    # Create the download button
    Kino.Frame.render(frame, Kino.Download.new(content_fun), to: event.origin)
  end
)

When the `Process` button is clicked, an `:ok` shows up in the cell output of the cell with the event listener and a `Download` button will appear in the cell above. The form is processed after clicking on download and a file download popup with show up. You can save files as `.txt` or `.json`.

Links:

aar2dee2’s Substack

Discussion about this post