Een e-mailarchiveringssysteem bouwen: De uitdagingen en natuurlijk de oplossing - Deel 1

Building an E-mail Archiving System: De Challenges and of Course the Solution – Part 1

Feb 4, 2019

Gepubliceerd door

Jeff Goldstein

•

Categorie:

E-mail

Ready to see Bird
in action?

Een demo plannen

Een e-mailarchiveringssysteem bouwen: De uitdagingen en natuurlijk de oplossing - Deel 1

Over a year ago I wrote a blog on how to retrieve copies of emails for archival and viewing but I did not broach the actual storing of the email or related data, and recently I wrote a blog on storing all of the event data (i.e. when the email was sent, opens, clicks bounces, unsubscribes, etc) on an email for the purpose of auditing, but chose not to create any supporting code.

Met de toename van e-mail gebruik in regelgevende omgevingen, heb ik besloten dat het tijd is om een nieuw project te starten dat dit alles samenbrengt met code voorbeelden over hoe de e-mail body en alle bijbehorende data op te slaan. In het komende jaar zal ik verder bouwen aan dit project met als doel een werkende opslag en weergave applicatie voor gearchiveerde emails en alle log informatie geproduceerd door SparkPost. SparkPost heeft geen systeem dat de e-mail body archiveert, maar het maakt het bouwen van een archiveringsplatform vrij eenvoudig.

In this blog series, I will describe the process I went through in order to store the email body onto S3 (Amazon’s Simple Store Service) and all relevant log data in MySQL for easy cross-referencing. Ultimately, this is the starting point for building an application that will allow for easy searching of archived emails, then displaying those emails along with the event (log) data. De code for this project can be found in the following GitHub repository: https://github.com/jeff-goldstein/PHPArchivePlatform

Dit eerste deel van de blogserie beschrijft de uitdaging en schetst een architectuur voor de oplossing. De rest van de blogs zal delen van de oplossing uitwerken, samen met codevoorbeelden.

De eerste stap in mijn proces was uitzoeken hoe ik een kopie kon krijgen van de e-mail die naar de oorspronkelijke ontvanger was gestuurd. Om een kopie van de e-mail te verkrijgen, moet u ofwel:

De e-mail body vastleggen voor het verzenden van de e-mail
Laat de e-mailserver een kopie opslaan
Laat de e-mailserver een kopie voor u maken om op te slaan

Als de e-mailserver items toevoegt zoals link tracking of open tracking, kunt u #1 niet gebruiken omdat het de veranderingen in open/klik tracking niet zal weergeven.

Dat betekent dat of de server de e-mail moet opslaan of op een of andere manier een kopie van die e-mail aan jou moet aanbieden voor opslag. Omdat SparkPost geen opslagmechanisme heeft voor e-mail bodies, maar wel een manier heeft om een kopie van de e-mail te maken, laten we SparkPost ons een duplicaat van de e-mail sturen zodat we die in S3 kunnen opslaan.

This is done by using SparkPost’s Archive feature. SparkPost’s Archive feature gives the sender the ability to tell SparkPost to send a duplicate of the email to one or more email addresses and use the same tracking and open links as the original. SparkPost documentatie defines their Archive feature in the following manner:

Ontvangers in de archieflijst zullen een exacte replica ontvangen van het bericht dat naar het RCPT TO-adres is verzonden. In het bijzonder zullen alle gecodeerde links bestemd voor de RCPT TO-ontvanger identiek zijn in de archiefberichten.

De enige verschillen met de RCPT TO e-mail zijn dat sommige headers anders zullen zijn omdat het doeladres voor de archiverings e-mail anders is, maar de body van de e-mail zal een exacte replica zijn!

If you want a deeper explanation here is a link naar de SparkPost documentation on creating duplicate (or archive) copies of an email.

Trouwens, met SparkPost kun je emails sturen naar cc, bcc en archief email adressen. Voor deze oplossing richten we ons op de archiefadressen.

* Let op * Gearchiveerde e-mails kunnen ALLEEN worden aangemaakt als je e-mails in SparkPost injecteert via SMTP!

Now that we know how to obtain a copy of the original email, we need to look aan de log data that is produced and some of the subtle nuances within that data. SparkPost tracks everything that happens on its servers and offers that information up to you in the form of message-events. Those events are stored on SparkPost for 10 days and can be pulled from the server via a RESTful API called message-events, or you can have SparkPost push those events to any number of collecting applications that you wish. The push mechanism is done through webhooks and is done in real time.

Currently, there are 14 different events that may happen to an email. Here is a list of the current events:

Bounce
ClickDelay
Levering
Defecte generatie
Generatie Afwijzing
Initiële Open
InjectionLink Uitschrijven
Lijst afmelden
Open
Out of Band
Beleid AfwijzingSpam Klacht

* Follow deze link for an up to date reference guide for a description of each event along with the data that is shared for each event.

Each event has numerous fields that match the event type. Some fields like the transmission_id are found in every event, but other fields may be more event-specific; for example, only open and click events have geotag information.

One very important message event entry to this project is the transmission_id. All of the message event entries for the original email, archived email, and any cc and bcc addresses will share the same transmission_id.

There is also a common entry called the message_id that will have the same id for each entry of the original email and the archived email. Any cc or bcc addresses will have their own id for the message_id entry.

So far this sounds great and frankly fairly easy, but now is the challenging part. Remember, in order to get the archive email, we have SparkPost send a duplicate of the original email to another email address which corresponds to some inbox that you have access to. But in order to automate this solution and store the email body, I’m going to use another feature of SparkPost’s called Inkomend e-mailverkeer. What that does, is take all emails sent to a specific domain and process them. By processing them, it rips the email apart and creates a JSON structure which is then delivered to an application via a webhook. See Appendix A for a sample JSON.

If you look real carefully, you will notice that the JSON structure from the inbound relay is missing a very important field; the transmission_id. While all of the outbound emails have the transmission_id with the same entry which binds all of the data from the original email, archive, cc, and bcc addresses; SparkPost has no way to know that the email captured by the inbound process is connected to any of the outbound emails. The inbound process simply knows that an email was sent to a specific domain and to parse the email. That’s it. It will treat any email sent to that domain the same way, be it a reply from a customer or the archive email send from SparkPost.

De truc is dus: hoe plak je de uitgaande gegevens aan het inkomende proces dat net de gearchiveerde versie van de e-mail heeft gepakt? Wat ik besloot te doen is een unieke id te verbergen in de body van de e-mail. Hoe je dat doet mag je zelf weten, maar ik heb gewoon een invoerveld gemaakt met de verborgen tag ingeschakeld.

Ik heb dat veld ook toegevoegd in het metadata blok van de X-MSYS-API header die tijdens de injectie aan SparkPost wordt doorgegeven. Deze verborgen UID wordt uiteindelijk de lijm van het hele proces, en is een hoofdonderdeel van het project, dat in de volgende blogberichten uitvoerig zal worden besproken.

Nu we het UID hebben dat dit project zal lijmen en begrijpen waarom het nodig is, kan ik beginnen met het opbouwen van de visie van het totale project en de bijbehorende blogberichten.

Vastleggen en opslaan van het archief e-mail samen met een database-item voor zoeken / indexeren
Alle gegevens over berichtgebeurtenissen vastleggen
Maak een applicatie om de e-mail en alle bijbehorende gegevens te bekijken

Hier is een eenvoudig schema van het project:

build an email archiving system - diagram

The first drop of code will cover the archive process and storing the email onto S3, while the second code drop will cover storing all of the log data from message-events into MySQL. You can expect the first two code drops and blog entries sometime in early 2019. If you have any questions or suggestions, please feel free to pass them along.

Gelukkige verzending.

- Jeff