Shadow library scraped 86 million Spotify tracks for preservation

Shadow library challenge Anna’s Archive introduced on December 20, 2025, that it efficiently scraped Spotify’s music catalog, releasing 256 million rows of monitor metadata and getting ready to distribute 86 million audio information representing roughly 99.6 % of all listens on the streaming platform. The operation totals roughly 300 terabytes of knowledge distributed by torrent information.

In line with the announcement printed on Anna’s Archive weblog, the challenge scraped the metadata and music information immediately from Spotify’s servers utilizing undisclosed technical strategies. The distribution represents what the group characterizes because the world’s first “preservation archive” for music that may be simply mirrored by anybody with adequate storage capability.

The metadata database comprises info for an estimated 99.9 % of tracks obtainable on Spotify. Anna’s Archive launched this part first by torrents on its devoted torrents web page in December 2025. Music information will comply with in subsequent releases ordered by reputation, together with extra file metadata and album art work.

Spotify responded to the scraping operation by disabling accounts recognized as liable for the information extraction. In line with an announcement supplied within the supply supplies, the corporate stated it has “stood with the artist group towards piracy” since its founding and is “actively working with our trade companions to guard creators and defend their rights.”

The technical implementation concerned a number of high quality tiers based mostly on Spotify’s inside reputation metric. For tracks with reputation scores above zero, Anna’s Archive preserved the unique OGG Vorbis format at 160 kilobits per second. Songs with reputation scores of zero have been reencoded to OGG Opus at 75 kilobits per second, which the group states sounds an identical to most listeners however could also be noticeable to specialists.

The dataset contains 256 million whole tracks obtainable on Spotify. Anna’s Archive preserved 86 million of those as audio information, representing roughly 37 % of whole songs however accounting for 99.6 % of precise listening exercise on the platform. The group prioritized tracks based mostly on Spotify’s proprietary reputation algorithm, which calculates scores between 0 and 100 based mostly totally on whole performs and recency of these performs.

In line with metadata evaluation printed by Anna’s Archive, tracks with reputation scores between 50 and 80 account for almost all of listens regardless of representing solely roughly 0.1 % of obtainable songs. The group recognized that 70 % of songs on Spotify have fewer than 1,000 streams. The highest three songs as of the information assortment interval—”Die With A Smile” by Girl Gaga and Bruno Mars, “BIRDS OF A FEATHER” by Billie Eilish, and “DtMF” by Unhealthy Bunny—have accrued billions of streams individually.

The metadata part constitutes lower than 200 gigabytes compressed and contains almost lossless representations of Spotify API JSON responses transformed to queryable SQLite databases. Secondary metadata masking audio evaluation totals 4 terabytes compressed. The group took care throughout conversion to make sure minimal knowledge loss from unique API responses by JSON reconstruction validation.

Anna’s Archive added embedded metadata to every audio file that was absent in Spotify’s unique format. This contains monitor titles, URLs, Worldwide Commonplace Recording Codes (ISRCs), Common Product Codes (UPCs), album art work, and replaygain info. The group stripped an invalid OGG knowledge packet that Spotify prepends to each monitor file, preserving this info individually for customers who want to reconstruct unique Spotify information.

The database comprises 186 million distinctive ISRCs, considerably exceeding the 5 million distinctive codes cataloged by MusicBrainz. This makes Anna’s Archive’s dataset the biggest publicly obtainable music metadata database in keeping with comparative evaluation printed within the announcement. Different metadata sources usually comprise between 50 million and 150 million tracks with much less complete annotation.

Purchase advertisements on PPC Land. PPC Land has normal and native advert codecs through main DSPs and advert platforms like Google Advertisements. By way of an public sale CPM, you’ll be able to attain trade professionals.

Learn more

For tracks with reputation higher than zero, the group archived information representing near all tracks obtainable on the platform. The cutoff date was July 2025, which means content material launched after that date might not seem within the assortment, although some later releases are current. The group stopped archiving the lengthy tail of zero-popularity tracks on account of diminishing preservation worth relative to storage necessities, which might have exceeded 700 extra terabytes for marginal profit.

The announcement states that the group found the scraping methodology “some time in the past” however gives no particular timeline for when systematic assortment started. The challenge crew determined to construct a music archive primarily geared toward preservation after figuring out what they characterize as vital gaps in present music preservation efforts.

Anna’s Archive recognized three main points with present music preservation approaches. First, extreme deal with the most well-liked artists leaves a protracted tail of music preserved solely when single people care sufficient to share particular works, usually with poor seeding infrastructure. Second, audiophile communities prioritize highest doable high quality codecs like lossless FLAC, inflating file sizes and making complete archival impractical. Third, no authoritative torrent checklist exists aggregating all music humanity has produced, corresponding to Anna’s Archive’s present guide torrent aggregations from Library Genesis, Sci-Hub, and Z-Library.

The group acknowledges that Spotify doesn’t comprise all music in existence however describes the platform as “an incredible begin” for constructing a preservation archive. The announcement emphasizes that this represents a humble try and create such infrastructure for music, just like present shadow library preservation efforts for textual content material.

The information will probably be launched in levels by Anna’s Archive’s torrents infrastructure. The primary stage, metadata distribution, accomplished in December 2025. Music information will comply with in reputation order, adopted by extra file metadata containing torrent paths and checksums, then album art work, and eventually .zstdpatch information enabling reconstruction of unique Spotify information earlier than metadata embedding.

The group presently positions this as a torrents-only preservation archive however solicited suggestions about whether or not customers would need particular person file downloading capabilities added to Anna’s Archive’s major interface. The announcement contains direct requests for consumer enter on this potential function enlargement.

Anna’s Archive operates as a non-profit shadow library challenge that aggregates and preserves content material from varied sources. The group launched in November 2022 shortly after regulation enforcement seized Z-Library domains and arrested its alleged operators. Anna’s Archive emerged from the Pirate Library Mirror challenge, which accomplished a full Z-Library copy in September 2022.

The positioning describes itself as having two targets: backing up all data and tradition of humanity, and making this accessible to anybody worldwide. All code and knowledge are launched as fully open supply. The group’s typical focus facilities on textual content material together with books, papers, comics, and magazines. In line with its FAQ web page, the challenge defined in a weblog submit titled “The crucial window of shadow libraries” that it prioritizes textual content due to its highest info density amongst media sorts.

The Spotify scraping operation represents a departure from this textual focus. In line with the December 20 announcement, the group’s mission of “preserving humanity’s data and tradition” doesn’t distinguish amongst media sorts, and generally alternatives come up exterior the textual area. The crew characterised the Spotify assortment as one such alternative.

The group’s infrastructure depends on distributed preservation by torrenting, which creates many copies around the globe and makes the gathering resilient to takedowns. Anna’s Archive presently preserves content material from shadow libraries together with Sci-Hub and Library Genesis that already supply bulk distribution, whereas additionally “liberating” libraries that don’t present bulk downloads comparable to Z-Library, or that aren’t shadow libraries in any respect together with Web Archive and DuXiu.

The group maintains official domains at annas-archive.org, annas-archive.se, and annas-archive.li. A number of nations have blocked entry to those domains. In line with Wikipedia documentation, Anna’s Archive has confronted blocking orders in Italy as of January 2024, the Netherlands as of March 2024, the UK as of December 2024, Belgium as of July 2025, and Germany as of October 2025.

America Commerce Consultant included Anna’s Archive domains in its annual Infamous Markets Record since 2023, describing the location as associated to Sci-Hub and Library Genesis. The Affiliation of American Publishers recognized the location as infringing in feedback submitted to the Workplace in July 2023, analyzing cryptocurrency wallets to find out the challenge had acquired over $29,000 in funds as of that date.

Anna’s Archive confronted a lawsuit filed by OCLC in January 2024 after the shadow library scraped WorldCat, the world’s largest bibliographic database. OCLC sought over $5 million in damages and an injunction to cease scraping or sharing its proprietary knowledge. The group described the WorldCat scrape in October 2023 as “a significant milestone in mapping out all of the books on this planet.” As of November 2025, OCLC dropped its demand for damages, focusing efforts on acquiring an injunction compelling third-party intermediaries to cease sharing the information.

Inside emails unsealed in February 2025 throughout copyright litigation towards Meta in California federal court docket revealed that the corporate downloaded over 81 terabytes of knowledge by Anna’s Archive torrents for synthetic intelligence mannequin coaching. Authors together with Richard Kadrey, Sarah Silverman, and Christopher Golden alleged that Meta CEO Mark Zuckerberg personally approved the usage of shadow libraries for AI coaching. In June 2025, Decide Vince Chhabria partially dominated in favor of Meta, discovering the coaching “extremely transformative” and subsequently honest use, whereas noting the plaintiffs did not develop sturdy arguments about market dilution.

Google eliminated 749 million Anna’s Archive URLs from search outcomes by November 2025, representing 5 % of all takedown requests despatched to the search engine since 2012. These requests originated from over 1,000 authors and publishers. The positioning ranks amongst Google Search’s ten most reported domains for DMCA takedowns as of June 2024.

Anna’s Archive generates income by paid membership tiers providing quicker obtain speeds. The group describes itself as nonprofit, claiming that membership charges and donations fund largely servers, storage, and bandwidth, with no cash going to crew members personally. The positioning awards memberships and financial bounties to some volunteer contributors.

The group affords high-speed entry to its full assortment through SFTP to teams coaching massive language fashions in change for big contributions of cash or knowledge. As of January 2025, the shadow library supplied such entry to roughly 30 corporations, based in China, together with each LLM corporations and knowledge brokers. DeepSeek’s VL mannequin was partly skilled on book knowledge obtained from the location.

Anna has justified opposition to copyright on moral grounds, stating that preserving and internet hosting information is “morally proper” and that shadow librarians imagine “info desires to be free.” In February 2025, the group urged copyright regulation reform as a matter of nationwide safety, proposing that Western nations create authorized carveouts for textual content and knowledge mining to stay forward in what it characterised as an AI arms race.

Subscribe PPC Land e-newsletter ✉️ for related tales like this one

The group cites programmer and data activist Aaron Swartz as inspiration for its metadata assortment efforts. The positioning recommends Swartz’s writings, Stephen Witt’s “How Music Acquired Free,” and Michele Boldrin and David Okay. Levine’s “In opposition to Mental Monopoly” as sources supporting its philosophical method to copyright and mental property.

The Spotify scraping operation intersects with broader tensions in digital promoting and content material markets over unauthorized knowledge assortment for business functions. Meta faced scrutiny after leaked documents revealed systematic scraping of approximately 6 million unique websites for AI model training, together with information organizations, academic platforms, and copyrighted content material repositories. Google filed lawsuits against companies scraping its search results while simultaneously conducting massive web scraping operations for AI training, exposing asymmetrical energy dynamics in content material distribution.

Music streaming platforms have confronted copyright challenges in promoting and advertising and marketing contexts. Sony Music Entertainment filed a copyright infringement lawsuit against Designer Shoe Warehouse in August 2025, alleging the retailer systematically used copyrighted musical recordings in social media advertising and marketing campaigns with out correct licensing on TikTok and Instagram platforms.

The authorized panorama surrounding AI coaching knowledge and copyright stays contested. A federal judge ruled in June 2025 that Meta’s use of 666 copyrighted books to train Llama large language models constitutes fair use, although the choice utilized solely to particular plaintiffs and left different authors free to pursue copyright claims. The U.S. Copyright Office released a major report in May 2025 analyzing how generative AI improvement implicates copyright regulation, establishing a nuanced framework for case-by-case analysis slightly than sweeping determinations.

Venture capital firm Andreessen Horowitz argued in October 2023 comments to the Copyright Office that utilizing copyrighted content material to coach AI fashions constitutes honest use as a result of coaching extracts statistical patterns slightly than storing copyrighted content material, warning that licensing frameworks would show administratively inconceivable given the billions of textual content items concerned. Professor Carys Craig’s research published in the Chicago-Kent Law Review cautioned in August 2024 towards utilizing copyright regulation as the first AI regulation instrument, arguing that requiring permissions may restrict competitors by creating cost-prohibitive limitations favoring highly effective market gamers.

Vermont Senator Peter Welch introduced the TRAIN Act in July 2025, establishing an administrative subpoena mechanism permitting copyright house owners to find out which protected works have been used to coach synthetic intelligence fashions, representing a middle-ground method between expansive copyright enforcement and unrestricted AI improvement.

The music streaming trade has navigated complicated regulatory environments globally. Spotify announced plans to exit Uruguay in December 2023 over copyright regulation adjustments requiring the corporate to pay twice for a similar songs—as soon as to file labels and publishers, and once more to a government-created fund. Spotify reported strong fourth quarter 2023 advertising performance in February 2024, with advert income rising 12 % year-over-year pushed by double-digit music promoting progress and wholesome podcast promoting enlargement.

The Anna’s Archive metadata evaluation revealed technical particulars about Spotify’s catalog composition and consumer conduct patterns. Songs grouped by period present peaks at complete minutes, notably 2:00, 3:00, and 4:00, although the group requested consumer enter to elucidate this phenomenon. The dataset reveals 221.4 million tracks marked as express content material versus 34.6 million clear variations.

Market availability varies considerably throughout Spotify’s geographic distribution. The evaluation reveals that filtering to solely in style songs reveals variations in market availability that seem flat when inspecting all tracks no matter reputation. Spotify gives style lists per artist slightly than per music, with the commonest particular genres distributed throughout a whole lot of distinct classifications.

Album launch knowledge demonstrates accelerating content material addition to Spotify over time, with the group noting that a lot current progress doubtless stems from mechanically generated content material. The prevalence of procedurally and AI-generated materials complicates efforts to establish beneficial content material, in keeping with the evaluation. Nearly all of tracks on Spotify exist as singles slightly than as elements of full albums.

Audio options scraped from Spotify’s API present technical traits throughout the catalog. Loudness correlates with vitality metrics. Beats per minute follows a traditional distribution with a imply round 120. The dataset contains roughly 40 million audio evaluation API responses, although many songs wouldn’t have audio evaluation obtainable, returning 404 errors.

The playlist database comprises 6.6 million playlists representing 1.7 billion playlist monitor entries. Most playlists with fewer than 1,000 followers have been excluded from the gathering. The group additionally scraped roughly 700,000 audiobook entries, 20 million audiobook chapter entries, 5 million podcast present entries, and 54 million podcast episode entries, although these collections are incomplete.

The technical structure makes use of Anna’s Archive Containers format, a standardization system created by the group for distributing information throughout a number of torrents. This differs from Superior Audio Coding format regardless of the AAC acronym. Information are listed into directories utilizing filename prefixes, with metadata saved individually from audio knowledge to allow environment friendly querying and reconstruction.

The group acknowledges a identified bug the place the REPLAYGAIN_ALBUM_PEAK vorbiscomment tag worth comprises copied knowledge from REPLAYGAIN_ALBUM_GAIN as an alternative of the right peak worth for a lot of information. Customers looking for to reconstruct unique Spotify information can make the most of the .zstdpatch archive that will probably be launched in later distribution levels.

The announcement contains instance queries demonstrating the true shuffle performance throughout all Spotify tracks, contrasting with consumer complaints about Spotify’s native shuffle algorithm. The group gives each unfiltered true shuffle playlists and variations filtered to solely considerably in style songs, showcasing the database’s queryability.

The metadata contains track-level info absent from Spotify’s public API for a lot of use circumstances. Every monitor exists in precisely one album, however tracks and albums can have a number of artists. The database shops artist genres as separate desk entries since Spotify gives style lists per artist. Accessible markets are saved in a separate desk with comma-separated ISO 3166-1 alpha-2 nation codes to save lots of area by deduplication.

Artist metadata contains follower counts and recognition scores calculated from monitor reputation. Album metadata comprises label info, copyright notices, launch dates with precision indicators, and Common Product Codes the place obtainable. Observe metadata contains preview URLs, disc numbers, monitor numbers, period in milliseconds, express content material flags, and obtainable markets.

The group stripped invalid OGG packets that Spotify prepends to trace information however preserved this knowledge individually. These packets comprise unknown info plus replaygain values saved as float32 little-endian values at particular byte offsets. The archive’s SQLite databases allow reconstruction of unique Spotify API JSON responses with minimal exceptions.

Telegram suspended Anna’s Archive’s channel in January 2025 for copyright infringement regardless of precautions to keep away from infringing posts. Z-Library’s Telegram channel confronted suspension the identical week. Neither group acquired alerts concerning the motion, which was purported to hyperlink to authorized motion by an Indian court docket.

The Spotify preservation effort represents the group’s largest departure from textual content material preservation. Anna’s Archive usually focuses on books, papers, comics, and magazines aggregated from varied shadow libraries and official library collections. The group’s major search interface as of December 2025 claims to index 61.7 million books and 95.7 million papers, describing itself as “the biggest actually open library in human historical past.”

The challenge crew explicitly requested group suggestions on whether or not so as to add particular person file downloading capabilities to Anna’s Archive’s major interface past the present torrents-only distribution mannequin. The announcement concludes by requesting assist preserving information by donations and torrent seeding, emphasizing that humanity’s musical heritage faces threats from pure disasters, wars, funds cuts, and different catastrophes.

Subscribe PPC Land e-newsletter ✉️ for related tales like this one

Timeline

Subscribe PPC Land e-newsletter ✉️ for related tales like this one

Abstract

Who: Anna’s Archive, a non-profit shadow library challenge launched in November 2022 by pseudonymous operator Anna, scraped Spotify’s music catalog. Spotify responded by disabling accounts liable for the information extraction.

What: The group distributed 256 million rows of monitor metadata and ready distribution of 86 million audio information totaling roughly 300 terabytes, representing 99.6 % of all listening exercise on Spotify. The metadata database covers an estimated 99.9 % of tracks obtainable on the platform.

When: Anna’s Archive introduced the scraping operation on December 20, 2025, by its weblog. The information assortment lined content material by July 2025. Metadata distribution started in December 2025, with music information deliberate for subsequent launch in reputation order.

The place: The scraping focused Spotify’s servers immediately utilizing undisclosed technical strategies. Distribution happens globally by torrent infrastructure, although a number of nations together with Italy, the Netherlands, the UK, Belgium, and Germany have blocked entry to Anna’s Archive domains.

Why: Anna’s Archive characterised the hassle as addressing three main gaps in music preservation: extreme deal with in style artists leaving a protracted tail poorly preserved, audiophile communities prioritizing impractically massive lossless codecs, and absence of an authoritative aggregated torrent checklist for all music. The group describes the operation as a “humble try” to construct a preservation archive for music corresponding to its present efforts preserving textual content material.

Source link

Shadow library scraped 86 million Spotify tracks for preservation

Timeline

Abstract

[email protected]

Leave a Reply Cancel reply

Changing Elements – HTML5 Game Template

AI Chatbot Use Hits 49%, But Skepticism Stays High

Mangoshop – E-Commerce App Flutter Template

Press ESC to close

Free weekly e-newsletter

Timeline

Abstract

Share Article:

Box Boy – Construct 3

Hiccups Guy – HTML5 – Construct 3

Leave a Reply Cancel reply