NSK and SRCE jointly conduct ninth annual Croatian web domain crawl
From 24 December 2019 to 3 January 2020, the National and University Library in Zagreb and the University of Zagreb Computing Centre (SRCE) jointly conducted the ninth Croatian web domain crawl.
The crawl covered all publicly available web content on the Croatian top level domain, .hr, as well as on .from.hr and .com.hr. It included 164,433,348 seeds and the amount of web content harvested as a result of it is 16TB, which has been captured in the WARC file format and subsequently compressed to a size of 9.3TB.
The Library used the list of active domains provided by the Croatian Academic and Research Network (CARNET) Domain Name System Service, and the Heritrix open-source web crawler, which identified itself as Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-05-22T20:43:22Z +http://haw.nsk.hr/faq).
All content harvested as a result of this latest national domain crawl will soon be made available on the website of the Library’s Croatian Web Archive (Hrvatski arhiv Weba – HAW), which also provides access to content archived through previous domain crawls, several themed collections, and content captured as a result of the Archive’s selective harvesting.