NSK and SRCE jointly conduct tenth annual Croatian web domain crawl
From 22 December 2020 to 7 January 2021, the National and University Library in Zagreb and the University of Zagreb Computing Centre, which is this year celebrating the 50th anniversary of its establishment, jointly conducted the tenth Croatian web domain crawl.
The crawl covered all publicly available web content on the Croatian top level .hr domain, as well as on .from.hr and .com.hr. It included 180,379,532 seeds and the amount of web content harvested as a result of it is 19TB, which has been captured in the WARC file format and subsequently compressed to a size of 11TB.
The crawl was conducted based on the list of active domains provided by the Croatian Academic and Research Network (CARNET) Domain Name System Service, and the Heritrix open-source web crawler, which identified itself as Mozilla/5.0 (compatible; heritrix/3.4.x; + https://haw.nsk.hr/cesta-pitanja/).
All content harvested as the result of this latest national domain crawl is available on the website of the Library’s Croatian Web Archive (Hrvatski arhiv weba – HAW), which also provides access to content archived through all previous domain crawls, several thematic collections, and content captured as a result of the Archive’s selective harvestings.