ProductPromotion
Logo

Elixir

made by https://0x3d.site

GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.
Crawly, a high-level web crawling & scraping framework for Elixir.  - GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.
Visit Site

GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.

GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.

Crawly

Module Version Hex Docs Total Download License Last Updated

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

  1. Elixir ~> 1.14
  2. Works on GNU/Linux, Windows, macOS X, and BSD.

Quickstart

  1. Create a new project: mix new quickstart --sup

  2. Add Crawly as a dependencies:

    # mix.exs
    defp deps do
        [
          {:crawly, "~> 0.17.2"},
          {:floki, "~> 0.33.0"}
        ]
    end
    
  3. Fetch dependencies: $ mix deps.get

  4. Create a spider

     # lib/crawly_example/books_to_scrape.ex
     defmodule BooksToScrape do
       use Crawly.Spider
    
       @impl Crawly.Spider
       def base_url(), do: "https://books.toscrape.com/"
    
       @impl Crawly.Spider
       def init() do
         [start_urls: ["https://books.toscrape.com/"]]
       end
    
       @impl Crawly.Spider
       def parse_item(response) do
         # Parse response body to document
         {:ok, document} = Floki.parse_document(response.body)
    
         # Create item (for pages where items exists)
         items =
           document
           |> Floki.find(".product_pod")
           |> Enum.map(fn x ->
             %{
               title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
               price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
               url: response.request_url
             }
           end)
    
         next_requests =
           document
           |> Floki.find(".next a")
           |> Floki.attribute("href")
           |> Enum.map(fn url ->
             Crawly.Utils.build_absolute_url(url, response.request.url)
             |> Crawly.Utils.request_from_url()
           end)
    
         %Crawly.ParsedItem{items: items, requests: next_requests}
       end
     end
    

    New in 0.15.0 :

    It's possible to use the command to speed up the spider creation, so you will have a generated file with all needed callbacks: mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape

  5. Configure Crawly

    By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls: (in file: config/config.exs)

    
     import Config
    
     config :crawly,
       closespider_timeout: 10,
       concurrent_requests_per_domain: 8,
       closespider_itemcount: 100,
    
       middlewares: [
         Crawly.Middlewares.DomainFilter,
         Crawly.Middlewares.UniqueRequest,
         {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
       ],
       pipelines: [
         {Crawly.Pipelines.Validate, fields: [:url, :title, :price]},
         {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
         Crawly.Pipelines.JSONEncoder,
         {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
       ]
    
    

    New in 0.15.0:

    You can generate example config with the help of the following command: mix crawly.gen.config

  6. Start the Crawl:

      iex -S mix run -e "Crawly.Engine.start_spider(BooksToScrape)"
    
  7. Results can be seen with:

    $ cat /tmp/BooksToScrape_<timestamp>.jl
    

Running Crawly without Elixir or Elixir projects

It's possible to run Crawly in a standalone mode, when Crawly is running as a tiny docker container, and spiders are just YMLfiles or elixir modules that are mounted inside.

Please read more about it here:

Need more help?

Please use discussions for all conversations related to the project

Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).

You can read more here:

Simple management UI (New in 0.15.0) {#management-ui}

Crawly provides a simple management UI by default on the localhost:4001

It allows to:

  • Start spiders
  • Stop spiders
  • Preview scheduled requests
  • View/Download items extracted
  • View/Download logs

NOTE: It's possible to disable the Simple management UI (and rest API) with the start_http_api?: false options of Crawly configuration.

You can choose to run the management UI as a plug in your application.

defmodule MyApp.Router do
  use Plug.Router

  ...
  forward "/admin", Crawly.API.Router
  ...
end

Crawly Management UI

Experimental UI [Deprecated]

Now don't have a possibility to work on experimental UI built with Phoenix and LiveViews, and keeping it here for mainly demo purposes.

The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders. Checkout the code from GitHub

Documentation

Roadmap

To be discussed

Articles

  1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
  2. Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html
  3. Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8
  4. Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
  5. What is web scraping, and why you might want to use it?
  6. Using Elixir and Crawly for price monitoring
  7. Building a Chrome-based fetcher for Crawly

Example projects

  1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example
  2. E-commerce websites: https://github.com/oltarasenko/products-advisor
  3. Car shops: https://github.com/oltarasenko/crawly-cars
  4. JavaScript based website (Splash example): https://github.com/oltarasenko/autosites

Contributors

We would gladly accept your contributions!

Documentation

Please find documentation on the HexDocs

Production usages

Using Crawly on production? Please let us know about your case!

Copyright and License

Copyright (c) 2019 Oleg Tarasenko

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

How to release:

  1. Update version in mix.exs
  2. Update version in quickstart (README.md, this file)
  3. Commit and create a new tag: git commit && git tag 0.xx.0 && git push origin master --follow-tags
  4. Build docs: mix docs
  5. Publish hex release: mix hex.publish

Articles
to learn more about the elixir concepts.

Resources
which are currently available to browse on.

mail [email protected] to add your project or resources here 🔥.

FAQ's
to know more about the topic.

mail [email protected] to add your project or resources here 🔥.

Queries
or most google FAQ's about Elixir.

mail [email protected] to add more queries here 🔍.

More Sites
to check out once you're finished browsing here.

0x3d
https://www.0x3d.site/
0x3d is designed for aggregating information.
NodeJS
https://nodejs.0x3d.site/
NodeJS Online Directory
Cross Platform
https://cross-platform.0x3d.site/
Cross Platform Online Directory
Open Source
https://open-source.0x3d.site/
Open Source Online Directory
Analytics
https://analytics.0x3d.site/
Analytics Online Directory
JavaScript
https://javascript.0x3d.site/
JavaScript Online Directory
GoLang
https://golang.0x3d.site/
GoLang Online Directory
Python
https://python.0x3d.site/
Python Online Directory
Swift
https://swift.0x3d.site/
Swift Online Directory
Rust
https://rust.0x3d.site/
Rust Online Directory
Scala
https://scala.0x3d.site/
Scala Online Directory
Ruby
https://ruby.0x3d.site/
Ruby Online Directory
Clojure
https://clojure.0x3d.site/
Clojure Online Directory
Elixir
https://elixir.0x3d.site/
Elixir Online Directory
Elm
https://elm.0x3d.site/
Elm Online Directory
Lua
https://lua.0x3d.site/
Lua Online Directory
C Programming
https://c-programming.0x3d.site/
C Programming Online Directory
C++ Programming
https://cpp-programming.0x3d.site/
C++ Programming Online Directory
R Programming
https://r-programming.0x3d.site/
R Programming Online Directory
Perl
https://perl.0x3d.site/
Perl Online Directory
Java
https://java.0x3d.site/
Java Online Directory
Kotlin
https://kotlin.0x3d.site/
Kotlin Online Directory
PHP
https://php.0x3d.site/
PHP Online Directory
React JS
https://react.0x3d.site/
React JS Online Directory
Angular
https://angular.0x3d.site/
Angular JS Online Directory