Products

Prijzen

Customers

Partners

Middelen

Vraag demo aan

Inloggen

Een Bulk Asynchroon Bird Ontvanger validatiehulpmiddel bouwen

May 26, 2022

Gepubliceerd door

Zachary Samuels

•

Categorie:

E-mail

Ready to see Bird
in action?

Een demo plannen

Building a Bulk Asynchronous Bird Recipient Validation Tool

One of the questions we occasionally receive is, how can I bulk validate email lists with Ontvanger validatie? There are two options here, one is to upload a file through the SparkPost UI for validation, and the other is to make individual calls per email naar de API (as the API is single email validation).

De eerste optie werkt geweldig, maar heeft een beperking van 20Mb (ongeveer 500.000 adressen). Wat als iemand een e-maillijst heeft die miljoenen adressen bevat? Dat zou kunnen betekenen dat je die moet opsplitsen in 1000 CSV-bestanden.

Since uploading thousands of CSV files seems a little far-fetched, I took that use case and began to wonder how fast I could get the API to run. In this blog post, I will explain what I tried and how I eventually came to a program that could get around 100.000 validaties in 55 seconden (Whereas in the UI I got around 100,000 validations in 1 minute 10 seconds). And while this still would take about 100 hours to get done with about 654 million validations, this script can run in the background saving significant time.

De definitieve versie van dit programma kan worden gevonden here.

Mijn eerste fout: Python gebruiken

Python is een van mijn favoriete programmeertalen. Het blinkt uit in veel gebieden en is ongelooflijk eenvoudig. Echter, een gebied waarin het niet uitblinkt is gelijktijdige processen. Python heeft weliswaar de mogelijkheid om asynchrone functies uit te voeren, maar het heeft wat bekend staat als het Python Global Interpreter Lock of GIL.

"Het Python Global Interpreter Lock of GIL, in eenvoudige woorden, is een mutex (of een slot) dat slechts één thread toestaat om de controle over de Python-interpreter te houden.

Dit betekent dat slechts één thread zich op elk moment in een uitvoeringstoestand kan bevinden. De invloed van de GIL is niet zichtbaar voor ontwikkelaars die single-threaded programma's uitvoeren, maar het kan een prestatie knelpunt zijn in CPU-gebonden en multi-threaded code.

Since the GIL allows only one thread to execute at a time even in a multi-threaded architecture with more than one CPU core, the GIL has gained a reputation as an “infamous” feature of Python.” (https://realpython.com/python-gil/)”

In het begin was ik me niet bewust van de GIL, dus begon ik te programmeren in python. Aan het einde liep mijn programma vast, ook al was het asynchroon, en hoeveel threads ik ook toevoegde, ik kreeg nog steeds maar ongeveer 12-15 iteraties per seconde.

Het hoofdgedeelte van de asynchrone functie in Python is hieronder te zien:

async def validateRecipients(f, fh, apiKey, snooze, count): h = {'Authorization': apiKey, 'Accept': 'application/json'} with tqdm(total=count) as pbar: async with aiohttp.ClientSession() as session: for address in f: for i in address: thisReq = requests.compat.urljoin(url, i) async with session.get(thisReq,headers=h, ssl=False) as resp: content = await resp.json() row = content['results'] row['email'] = i fh.writerow(row) pbar.update(1)

Dus schrapte ik het gebruik van Python en ging terug naar de tekentafel...

Ik heb gekozen voor NodeJS vanwege de mogelijkheid om non-blocking i/o operaties extreem goed uit te voeren. Ik ben ook redelijk bekend met programmeren in NodeJS.

Utilizing asynchronous aspects of NodeJS, this ended up working well. For more details about asynchronous programming in NodeJS, see https://blog.risingstack.com/node-hero-async-programming-in-node-js/

Mijn tweede fout: proberen het bestand in het geheugen in te lezen

Mijn oorspronkelijke idee was als volgt:

Neem eerst een CSV-lijst met e-mails op. Ten tweede laadt u de e-mails in een array en controleert u of ze de juiste indeling hebben. Ten derde, roep asynchroon de ontvanger validatie API aan. Ten vierde, wacht op de resultaten en laad ze in een variabele. En ten slotte voert u deze variabele uit naar een CSV-bestand.

This worked very well for smaller files. De issue became when I tried to run 100,000 emails through. De program stalled at around 12,000 validations. With the help of one of our front-end developers, I saw that the issue was with loading all the results into a variable (and therefore running out of memory quickly). If you would like to see the first iteration of this program, I have linked it here: Versie 1 (NIET AANBEVOLEN).

Neem eerst een CSV-lijst met e-mails op. Ten tweede, tel het aantal e-mails in het bestand voor rapportagedoeleinden. Ten derde, terwijl elke regel asynchroon wordt gelezen, roept u de Ontvanger validatie API aan en voert u de resultaten uit naar een CSV-bestand.

Dus, voor elke ingelezen regel, roep ik de API aan en schrijf ik de resultaten asynchroon weg om geen van deze gegevens in het langetermijngeheugen te bewaren. Ik heb ook de controle van de e-mail syntax verwijderd na een gesprek met het validatieteam van de ontvanger, omdat ze me vertelden dat de validatie van de ontvanger al controles heeft ingebouwd om te controleren of een e-mail geldig is of niet.

De definitieve code ontleden

Na het inlezen en valideren van de terminalargumenten voer ik de volgende code uit. Eerst lees ik het CSV-bestand met e-mails in en tel elke regel. Er zijn twee doelen van deze functie, 1) het stelt me in staat om nauwkeurig te rapporteren over de voortgang van het bestand [zoals we later zullen zien], en 2) het stelt me in staat om een timer te stoppen wanneer het aantal e-mails in het bestand gelijk is aan het aantal voltooide validaties. Ik heb een timer toegevoegd zodat ik benchmarks kan uitvoeren en ervoor kan zorgen dat ik goede resultaten krijg.

let count = 0; //Line count require("fs") .createReadStream(myArgs[1]) .on("data", function (chunk) { for (let i = 0; i < chunk.length; ++i) if (chunk[i] == 10) count++; }) //Reads the infile and increases the count for each line .on("close", function () { //At the end of the infile, after all lines have been counted, run the recipient validation function validateRecipients.validateRecipients(count, myArgs); });

Vervolgens roep ik de functie validateRecipients aan. Merk op dat deze functie asynchroon is. Nadat ik heb gevalideerd dat de infile en outfile CSV zijn, schrijf ik een header-rij en start ik een programmatimer met behulp van de JSDOM-bibliotheek.

async function validateRecipients(email_count, myArgs) { if ( //If both the infile and outfile are in .csv format extname(myArgs[1]).toLowerCase() == ".csv" && extname(myArgs[3]).toLowerCase() == ".csv" ) { let completed = 0; //Counter for each API call email_count++; //Line counter returns #lines - 1, this is done to correct the number of lines //Start a timer const { window } = new JSDOM(); const start = window.performance.now(); const output = fs.createWriteStream(myArgs[3]); //Outfile output.write( "Email,Valid,Result,Reason,Is_Role,Is_Disposable,Is_Free,Delivery_Confidence\n" ); //Write the headers in the outfile

Het volgende script is eigenlijk het grootste deel van het programma, dus ik zal het opdelen en uitleggen wat er gebeurt. Voor elke regel van het inf-bestand:

Asynchronously take that line and call the recipient validation API.

fs.createReadStream(myArgs[1]) .pipe(csv.parse({ headers: false })) .on("data", async (email) => { let url = SPARKPOST_HOST + "/api/v1/recipient-validation/single/" + email; await axios .get(url, { headers: { Authorization: SPARKPOST_API_KEY, }, }) //For each row read in from the infile, call the SparkPost Recipient Validation API

Dan, op het antwoord

Voeg de e-mail toe aan de JSON (om de e-mail te kunnen afdrukken in de CSV)
Valideer of de reden null is, en zo ja, vul dan een lege waarde in (dit is zodat de CSV-indeling consistent is, omdat in sommige gevallen de reden wordt gegeven in het antwoord)
Stel de opties en sleutels in voor de module json2csv.
De JSON omzetten naar CSV en uitvoeren (met json2csv)
Vooruitgang schrijven in de terminal
Tenslotte, als aantal e-mails in het bestand = voltooide validaties, de timer stoppen en de resultaten afdrukken

.then(function (response) { response.data.results.email = String(email); //Adds the email as a value/key pair naar de response JSON to be used for output response.data.results.reason ? null : (response.data.results.reason = ""); //If reason is null, set it to blank so the CSV is uniform //Utilizes json-2-csv to convert the JSON to CSV format and output let options = { prependHeader: false, //Disables JSON values from being added as header rows for every line keys: [ "results.email", "results.valid", "results.result", "results.reason", "results.is_role", "results.is_disposable", "results.is_free", "results.delivery_confidence", ], //Sets the order of keys }; let json2csvCallback = function (err, csv) { if (err) throw err; output.write(`${csv}\n`); }; converter.json2csv(response.data, json2csvCallback, options); completed++; //Increase the API counter process.stdout.write(`Done with ${completed} / ${email_count}\r`); //Output status of Completed / Total to the console without showing new lines //If all emails have completed validation if (completed == email_count) { const stop = window.performance.now(); //Stop the timer console.log( `All emails successfully validated in ${ (stop - start) / 1000 } seconds` ); } })

Een laatste probleem dat ik tegenkwam was dat dit prima werkte op de Mac, maar dat ik na ongeveer 10.000 validaties tegen de volgende fout aanliep onder Windows:

Fout: ENOBUFS XX.XX.XXX.XXX:443 verbinden - Lokaal (ongedefinieerd:ongedefinieerd) met e-mail XXXXXXX@XXXXXXXXXX.XXX

After doing some further research, it appears to be an issue with the NodeJS HTTP client connection pool not reusing connections. I found this Stackoverflow artikel on the issue, and after further digging, found a good standaardconfiguratie for the axios library that resolved this issue. I am still not certain why this issue only happens on Windows and not on Mac.

Volgende stappen

Voor iemand die op zoek is naar een eenvoudig, snel programma dat een csv binnenhaalt, de recipient validation API aanroept en een CSV uitvoert, is dit programma iets voor jou.

Enkele toevoegingen aan dit programma zijn de volgende:

Bouw een front-end of eenvoudigere UI voor gebruik
Betere afhandeling van fouten en opnieuw proberen, want als de API om de een of andere reden een fout gooit, probeert het programma de aanroep momenteel niet opnieuw.

Ik ben ook benieuwd of er snellere resultaten kunnen worden bereikt met een andere taal, zoals Golang of Erlang/Elixir.

Please feel free to provide me any feedback of suggesties for expanding this project.

Your new standard in Marketing, Pay & Sales. It's Bird

The right message -> to the right person -> aan de right time.

By clicking "See Bird" you agree to Bird's Privacyverklaring.

Your new standard in Marketing, Pay & Sales. It's Bird

The right message -> to the right person -> aan de right time.

By clicking "See Bird" you agree to Bird's Privacyverklaring.