Build & Deploy a reusable Web Scraper with Elixir Livebook
With recent updates, you can deploy a Livebook notebook as an app. I wanted a reusable web scraper where I can enter the url, and selectors for the details that I want and save the scraped data in a `txt` file. I also wanted to handle cases where I may need to scrape multiple pages e.g. an Amazon search page, and be able to get information from individual links nested on the main page.
You can fork and use the HuggingFace repo.
Here’s how I built the scraper:
Install the dependencies:
The first cell in Livebook lets you select the dependency from a searchable menu and install it directly.
Design a Form with Kino
We need inputs for:
main url and/or list of urls
selectors for each detail I want to extract from a page e.g. title, date
A toggle for `pagination` - if there are multiple pages with the same structure
Another toggle if I want to scrape links existing on the main url
Finally we need a button to send the form inputs and process them.
Building a Form with Kino
Define a Scraper module
Handle multiple pages - There are 2 options to do this: 1/ pass a list of urls, 2/ pass a slug with a `#{}` placeholder where the page number can be substituted, enter the starting number, how much to increment to get the next page, and the direction of increment. This can handle urls like [“https://example.com/?page=1”, “https://example.com/?page=2”] as well as [“https://example.com/?page=2023”, “https://example.com/?page=2022”]
def scrape_all(data) do url_list = to_string(data.url_list) |> String.trim() urls = if String.length(url_list) > 0 do String.split(url_list, ",") |> Enum.map(&String.trim(&1)) |> Enum.filter(fn x -> x != "" end) else url = data.main_url_slug |> to_string() num_pages = get_clean_input(data.num_pages, :int) start_from = get_clean_input(data.start_from, :int) page_change_delta = get_clean_input(data.page_change_delta, :int) pages_range = case data.page_change_direction do :increase -> Enum.to_list(start_from..start_from + (num_pages-1) * page_change_delta) :decrease -> Enum.to_list(start_from..start_from - (num_pages-1) * page_change_delta) _ -> [] end pages_range |> Enum.map( fn page -> url |> String.replace("\#{}", Integer.to_string(page)) end) end IO.inspect(urls) output = (for x <- urls, do: get_page(Map.merge(data, %{main_url: x}))) |> List.flatten() IO.inspect(length(output)) output end
Handle the case where the information I need is on the main page or in nested urls, each of which needs to be opened and extracted.
def get_page(data) do # should run scrape on a single page if data.scrape_each_page == true do page_links = scrape(data.main_url, data.main_item_selector) |> Enum.map(fn x -> partial_url = Floki.attribute(x, "href") |> Enum.at(0) |> to_string() full_url = to_string(data.relative_url_prefix) <> partial_url full_url end) # go to each page, get the attributes from each page output = Enum.map( page_links, fn url -> body = scrape_page(url) get_attributes( body, data ) end ) |> Enum.filter(fn x -> x != nil end) output else # get_attributes for each item and then add the static information to each item items = scrape(data.main_url, data.main_item_selector) Enum.map(items, &get_attributes(&1, data)) |> Enum.filter(fn x -> x != nil end) end end
Handle urls where certificates may be unsafe/ follow older security protocols - we do this by passing ssl options to the `HTTPoison.get` method
HTTP.get(url, [], ssl: [ verify_fun: {fn _, reason, state -> case reason do {:bad_cert, :cert_expired} -> {:valid, state} {:bad_cert, :unknown_ca} -> {:valid, state} {:extension, _} -> {:valid, state} :valid -> {:valid, state} :valid_peer -> {:valid, state} error -> {:fail, error} end end, []} ])
Get data from selectors
defp get_data_from_selector(item, selector) do Floki.find(item, selector) |> Floki.text() |> to_string() |> String.replace("Published:", "") |> String.replace("Recent Publication:", "") |> String.trim() end
The form inputs are not of type `string` by default, we use `get_clean_inputs` to handle this and any extra whitespace included by the user.
defp get_clean_input(item, type) do cleaned = to_string(item) |> String.trim() output = case type do :int -> {num, _} = Integer.parse(cleaned) num :str -> cleaned _ -> cleaned end output end
Listen to Form Submissions and Process Them
You can listen to events in Kino widgets with `Kino.listen` and create a download button with `Kino.Download.new` (Thanks Jonatan Klosko). Create a new frame in a Livebook cell:
frame = Kino.Frame.new()
Now add an event listen to handle form submissions. We use and if-else to handle cases with and without pagination
Kino.listen( scraper_form, fn event -> IO.inspect(event) # Create a function that generates the JSON output content_fun = fn -> output = if event.data.has_pagination == true do Scraper.scrape_all(event.data) |> Jason.encode!() else Scraper.get_page(event.data) |> Jason.encode!() end output end # Create the download button Kino.Frame.render(frame, Kino.Download.new(content_fun), to: event.origin) end )
When the `Process` button is clicked, an `:ok` shows up in the cell output of the cell with the event listener and a `Download` button will appear in the cell above. The form is processed after clicking on download and a file download popup with show up. You can save files as `.txt` or `.json`.
Links: