lingua v0.6.0 Release Notes

Release Date: 2020-01-05 // 24 days ago
  • Languages

    • ➕ added 11 new languages: Armenian, Bosnian, Azerbaijani, Esperanto, Georgian, Kazakh, Macedonian, Marathi, Mongolian, Serbian, Ukrainian

    🔋 Features

    🚀 There are some breaking changes in this release:

    • 🚚 The support for MapDB has been removed. It did not provide enough advantages over Kotlin's lazy loading of language models. It used a lot of disc space and language detection became slow. With the long-term goal of creating a multiplatform library, only those features will be implemented in the future that support JavaScript as well.
    • 🚚 The dependency on the fastutil library has been removed. It did not provide enough advantages over Kotlin's lazy loading of language models.
    • 🚚 The method LanguageDetector.detectLanguagesOf(text: Iterable<String>) has been removed because the sorting order of the returned languages was undefined for input collections such as a HashSet. From now on, the method LanguageDetector.detectLanguageOf(text: String) will be the only one to be used.
    • The LanguageDetector can now be built with the following additional methods:
      • LanguageDetectorBuilder.fromIsoCodes639_1(vararg isoCodes: IsoCode639_1)
      • LanguageDetectorBuilder.fromIsoCodes639_3(vararg isoCodes: IsoCode639_3)
      • the following method has been removed: LanguageDetectorBuilder.fromIsoCodes(isoCode: String, vararg isoCodes: String)
    • 🚚 The Gson library has been replaced with kotlinx-serialization for the loading of the json language models. This results in a significant reduction of code and makes reflection obsolete, so the dependency on kotlin-reflect could be removed.

    👌 Improvements

    • The overall detection algorithm has been improved again several times to fix several detection bugs.

Previous changes from v0.5.0

  • Languages

    • ➕ added 12 new languages: Bengali, Chinese (not differentiated between traditional and simplified, as of now), Gujarati, Hebrew, Hindi, Japanese, Korean, Punjabi, Tamil, Telugu, Thai, Urdu

    🔋 Features

    👍 The LanguageDetectorBuilder now supports the additional method withMinimumRelativeDistance() that allows to specify the minimum distance between the logarithmized and summed up probabilities for each possible language. If two or more languages yield nearly the same probability for a given input text, it is likely that the wrong language may be returned. By specifying a higher value for the minimum relative distance, Language.UNKNOWN is returned instead of risking false positives.

    ✅ Test report generation can now use multiple CPU cores, allowing to run as many reports as CPU cores are available. This has been implemented as an additional attribute for the respective Gradle task: ./gradlew writeAccuracyReports -PcpuCores=...

    The REPL now allows to freely specify the languages you want to try out by entering the desired ISO 639-1 codes. Before, it has only been possible to choose between certain language combinations.

    👌 Improvements

    • The overall detection algorithm has been improved, yielding slightly more accurate results for those languages that are based on the Latin alphabet.

    🐛 Bug Fixes

    🛠 Thanks to the great work of contributor Bernhard Geisberger, two bugs could be fixed.

    The fix in pull request #8 solves the problem of not being able to recreate the MapDB cache files automatically in case the data has been corrupted.

    The fix in pull request #9 makes the class LanguageDetector completely thread-safe. Previously, in some rare cases it was possible that two threads mutated one of the internal variables at the same time, yielding inaccurate language detection results.

    Thank you, Bernhard.