Popularity

5.0

Stable

Activity

0.0

Stable

Stars 130

Watchers 7

Forks 16

Last Commit almost 3 years ago

Programming language: Kotlin

License: MIT License

Tags: Web

Latest version: v0.4.4

krawler alternatives and similar libraries

Based on the "Web" category.
Alternatively, view krawler alternatives based on common mentions on social networks and blogs.

ktor

9.9 9.5 krawler VS ktor

Framework for quickly creating connected applications in Kotlin with minimal effort
javalin

9.7 9.1 krawler VS javalin

DISCONTINUED. A simple and modern Java and Kotlin web framework [Moved to: https://github.com/javalin/javalin]

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

Promo www.influxdata.com

apollo-android

9.4 9.7 krawler VS apollo-android

:robot: A strongly-typed, caching GraphQL client for the JVM, Android, and Kotlin multiplatform.
http4k

9.1 9.8 krawler VS http4k

The Functional toolkit for Kotlin HTTP applications. http4k provides a simple and uniform way to serve, consume, and test HTTP services.
GraphQL Kotlin

8.9 7.9 krawler VS GraphQL Kotlin

Libraries for running GraphQL in Kotlin
jooby

8.8 9.7 L3 krawler VS jooby

The modular web framework for Java and Kotlin
kotlinx.html

8.6 7.6 krawler VS kotlinx.html

Kotlin DSL for HTML
KVision

8.5 8.8 krawler VS KVision

Object oriented web framework for Kotlin/JS
kotless

8.4 0.0 krawler VS kotless

Kotlin Serverless Framework
spark-kotlin

8.3 0.0 krawler VS spark-kotlin

A Spark DSL in idiomatic kotlin // dependency: com.sparkjava:spark-kotlin:1.0.0-alpha
core

8.3 8.5 krawler VS core

A Kotlin web framework
skrape.it

7.9 6.4 krawler VS skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
hexagon

7.8 9.5 krawler VS hexagon

Hexagon is a microservices toolkit written in Kotlin. Its purpose is to ease the building of services (Web applications or APIs) that run inside a cloud platform.
kara

7.6 0.0 krawler VS kara

Kotlin Web Framework for the JVM
wasabi

7.5 0.0 krawler VS wasabi

An HTTP Framework
A pure Kotlin, UI framework

7.4 8.9 krawler VS A pure Kotlin, UI framework

A pure Kotlin UI framework for the Web (and desktop).
firefly

7.3 6.0 krawler VS firefly

Firefly is an asynchronous web framework for rapid development of high-performance web application.
fritz2

7.3 8.7 krawler VS fritz2

Easily build reactive web-apps in Kotlin based on flows and coroutines.
KGraphQL

7.1 0.0 krawler VS KGraphQL

DISCONTINUED. A GraphQL implementation written in Kotlin
vertx-lang-kotlin

6.8 7.5 krawler VS vertx-lang-kotlin

Vert.x for Kotlin
Kanary

6.6 0.0 krawler VS Kanary

A minimalist web framework for building REST APIs in Kotlin/Java.
vaadin-on-kotlin

5.8 8.7 krawler VS vaadin-on-kotlin

Writing full-stack statically-typed web apps on JVM at its simplest
alpas

5.7 0.0 krawler VS alpas

🚀 The Rapid and Delightful Kotlin Web Framework. Easy, elegant, and productive!
kraph

5.5 0.0 krawler VS kraph

GraphQL request string builder written in Kotlin
kovert

5.4 0.0 krawler VS kovert

The invisible REST and web framework
ShapeShift️

5.2 0.8 krawler VS ShapeShift️

A Kotlin/Java library for intelligent object mapping and conversion between objects.
lambda-kotlin-request-router

4.5 8.1 krawler VS lambda-kotlin-request-router

A REST request routing layer for AWS lambda handlers written in Kotlin
yested

4.5 0.0 krawler VS yested

A Kotlin framework for building web applications in Javascript.
KotlinPrimavera

4.5 0.0 krawler VS KotlinPrimavera

Spring support libraries for Kotlin
kottpd

4.2 0.0 krawler VS kottpd

REST framework written in pure Kotlin
kog

3.1 0.0 krawler VS kog

🌶 A simple Kotlin web framework inspired by Clojure's Ring.
tekniq

2.9 7.8 krawler VS tekniq

A framework designed around Kotlin providing Restful HTTP Client, JDBC DSL, Loading Cache, Configurations, Validations, and more
Pellet

2.3 4.1 krawler VS Pellet

An opinionated, Kotlin-first web framework that helps you write fast, concise, and correct backend services 🚀.
kotlin

2.2 0.0 krawler VS kotlin

DISCONTINUED. Starter project for Kotlin
bootique-kotlin

2.2 7.2 krawler VS bootique-kotlin

RETIRED. Provides extension functions and features for smooth development with Bootique and Kotlin.
h

2.0 0.0 krawler VS h

Html templating library for kotlin
Zeko-RestApi

1.5 0.0 krawler VS Zeko-RestApi

Asynchronous web framework for Kotlin. Create REST APIs in Kotlin easily with automatic Swagger/OpenAPI doc generation
graphql-kotlin-toolkit

0.9 0.0 krawler VS graphql-kotlin-toolkit

GraphQL toolkit for Kotlin.
komock

0.8 0.0 krawler VS komock

KoMock - Simple HTTP/Consul/SpringConfig http server framework written in Kotlin. Wiremock use cases
voyager-server-spring-boot-starter

0.5 0.0 krawler VS voyager-server-spring-boot-starter

Easily create REST endpoints with permissions (access control level) and hooks includeded
sponge

0.4 4.7 krawler VS sponge

sponge is a website crawler and links downloader command-line tool
kweb-core

- krawler VS kweb-core

Build rich live-updating web apps in pure server-side Kotlin.
graphql-kotlin

- krawler VS graphql-kotlin

Code-only GraphQL schema generation for Kotlin

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of krawler or a related project?

Add another 'Web' Library

Popular Comparisons

README

About

Krawler is a web crawling framework written in Kotlin. It is heavily inspired by crawler4j by Yasser Ganjisaffar. The project is still very new, and those looking for a mature, well tested crawler framework should likely still use crawler4j. For those who can tolerate a bit of turbulence, Krawler should serve as a replacement for crawler4j with minimal modifications to existing applications.

Some neat features and benefits of Krawler include:

Kotlin project!
Krawler differentiates between a "check" and a "visit". Checks are used to verify the status code of a resource by issuing an HTTP HEAD request rather than a GET request. Each policy (get or check) can have it's own logic associated with it by implementing either shouldCheck or shouldVisit and check and visit.
Krawler's politeness delay is per-host rather than global. This way servers aren't overwhelmed, but crawls visiting many hosts in parallel are not effectively serialized by the politeness delay.
Krawler uses Jsoup for parsing HTML files while harvesting links, making it more tolerant of malformed or poorly written websites, and thus less likely to error out during a crawl. The original HTML of the page is still available to facilitate validation and checking though.
Krawler collects full anchor tags including all attributes and anchor text.
Krawler currently has no proxy support, but it is on the roadmap. :(

Add Dependency

Krawler is published through jitpack.io at: https://jitpack.io/#brianmadden/krawler/ . Add jitpack.io as a repository, and krawler as a dependency to use Krawler in your project:

Using Gradle

repositories {
    jcenter()
    maven { url "https://jitpack.io" }
}

dependencies {
    compile 'com.github.brianmadden:krawler:0.4.4'
}

Using Maven

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

<dependency>
    <groupId>com.github.brianmadden</groupId>
    <artifactId>krawler</artifactId>
    <version>0.4.4</version>
</dependency>

Usage

Using the Krawler framework is fairly simple. Minimally, there are two methods that must be overridden in order to use the framework. Overriding the shouldVisit method dictates what should be visited by the crawler, and the visit method dictates what happens once the page is visited. Overriding these two methods is sufficient for creating your own crawler, however there are additional methods that can be overridden to privde more robust behavior.

The full code for this simple example can also be found in the [example project](...):

class SimpleExample(config: KrawlConfig = KrawlConfig()) : Krawler(config) {

    private val FILTERS: Regex = Regex(".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|" +
            "mov|mpeg|ram|m4v|pdf|rm|smil|wmv|swf|wma|zip|rar|gz|tar|ico))$", RegexOption.IGNORE_CASE)

    /**
     * Threadsafe whitelist of acceptable hosts to visit
     */
    val whitelist: MutableSet<String> = ConcurrentSkipListSet()

    override fun shouldVisit(url: KrawlUrl): Boolean {
        val withoutGetParams: String = url.canonicalForm.split("?").first()
        return (!FILTERS.matches(withoutGetParams) && url.host in whitelist)
    }

    private val counter: AtomicInteger = AtomicInteger(0)

    override fun visit(url: KrawlUrl, doc: KrawlDocument) {
        println("${counter.incrementAndGet()}. Crawling ${url.canonicalForm}")
    }

    override fun onContentFetchError(url: KrawlUrl, reason: String) {
        println("${counter.incrementAndGet()}. Tried to crawl ${url.canonicalForm} but failed to read the content.")
    }

    private var startTimestamp: Long = 0
    private var endTimestamp: Long = 0

    override fun onCrawlStart() {
        startTimestamp = LocalTime.now().toNanoOfDay()
    }
    override fun onCrawlEnd() {
        endTimestamp = LocalTime.now().toNanoOfDay()
        println("Crawled $counter pages in ${(endTimestamp - startTimestamp) / 1000000000.0} seconds.")
    }
}

Roadmap

Proxy support
Headless Chrome support for crawling Javascript driven sites

Release Notes

0.4.4 (2020-1-29)

Upgrade Kotlin to 1.3.61
Upgrade kotlinx.coroutines. This required an update to some of the places where coroutine builders were called internally.
Upgrade Gradle wrapper

0.4.3 (2017-11-20)

Added ability to clear crawl queues by RequestId and Age, see Krawler#removeUrlsByRootPage and Krawler#removeUrlsByAge
Added config option to prevent crawler shutdown on empty queues
Added new single byte priority field to KrawlQueueEntry. Queues will always attempt to pop the lowest priority entry available. Priority can be assigned by overriding the Krawler#assignQueuePriorty method.
Update dependencies

0.4.2 (2017-10-25)

Updated to Kotlin Runtime 1.1.51, kotlinx-coroutines 0.19.2
Reworked KrawlUrl class internals to handle spaces in URLs better which should result in more stability when crawling.

0.4.1 (2017-8-15)

Removed logging implementation from dependencies to prevent logging conflicts when used as a library.
Updated Kotlin version to 1.1.4
Updated kotlinx.coroutines to .17

0.4.0 (2017-5-17)

Rewrote core crawl loop to use Kotlin 1.1 coroutines. This has effectively turned the crawl process into a multi-stage pipeline. This architecture change has removed the necessity for some locking by removing resource contention by multiple threads.
Updated the build file to build the simple example as a runnable jar
Minor bug fixes in the KrawlUrl class.

0.3.2 (2017-3-3)

Fixed a number of bugs that would result in a crashed thread, and subsequently an incorrect number of crawled pages as well as cause slowdowns due to a reduced number of worker threads.
Added a new utility function to wrap doCrawl and log any uncaught exceptions during crawling.

0.3.1 (2017-2-2)

Created 1:1 mapping between threads and the number of queues used to serve URLs to visit. URLs have an affinity for a particular queue based on their domain. All URLs from that domain will end up in the same queue. This improves parallel crawl performance by reducing the frequency that the politeness delay effects requests. For crawls bound to fewer domains than queues, the excess queues are not used.
Many bug fixes including fix that eliminates accidental over-crawling.

0.2.2 (2017-1-21)

Added additional configuration option for redirect handling in KrawlConfig. Setting useFastRedirectHandling = true (when redirects are enabled) will cause Krawler to automatically follow redirects, keeping a history of the transitions and status codes. This history is present in the KrawlDocument#redirectHistory property.

0.2.1 (2017-1-20)

Redirect handling has been changed. Redirects can be followed or not via configuration option in KrawlConfig. When redirects are enabled the redirected to URL will be added to the queue as a part of the link harvesting phase of Krawler.
If an anchor tag specifies rel='canonical' the canonicalForm will not be subject to further processing.
KrawlUrl.new's implementation has been changed to prevent null from being returned in certain circumstances.

0.2.0 (2017-1-18)

Krawler now respects robots.txt. This feature can be configured by passing a custom RobotsConfig to your Krawler instance. By default Krawler will respect robots.txt without any additional configuration.
Krawler now collects outgoing links from src attributes of tags in addition to the href of anchor tags.
Minor bug fixes and refactorings.

krawler

A web crawling framework written in Kotlin