< gedcrumb chronicles

Discovery


"If you don't understand the Way right before you, how will you know the path as you walk?" -Sekito Kisen


Introduction


I need a gedcom parser, but which one? I looked at parsers written in python, typescript, and rust. The python and typescript implementations worked. The rust implementation broke with the first sample gedcom I tried.

I don't want to just do some statistics and make a pretty front end. The rust crate is an opportunity to better understand what a parser actually is and potentially make a contribution to someone else's project. Perhaps I'll need to fall back to the python/typescript versions, but I'll probably know more about the world when I do.

Here is a link to the rust gedcom parser:

> rust-gedcom

Summary


The parser is made up of three pieces:


The tokenizer is responsible for taking a character stream, converting it into tokens and maintaining any meta information relevant to how the gedcom data is structured. For example, in the record:

0 @I2644@ INDI

1 NAME Henry C. /Marone/

1 SEX M

1 BIRT

2 DATE 1895

1 DEAT

2 DATE 1968

1 FAMS @F1250@


Each number indicates which previous line it pertains to. 0 indicates a record identifier: the start of an individual, family, etc. The next three lines provide data about that individual, and are thus labelled 1. The fifth line is labelled 2 because it describes the previous line (it provides the date of the individual's birth). The tokenizer must track the row's number (level), and provide functionality for getting the next pointer (@I2644@), tag (INDI, NAME, SEX, etc), and value (Henry C. /Marone/, M, 1895, etc.). The last line (1 FAMS @F1250@) links Henry to his family using a pointer to that family's record (@F1240@).

The parser creates a tokenizer from the given gedcom file; loops through the tokens identified by the tokenizer; and parses them into the desired output data structure.

The data structure is an important part of the parser, because it determines a lot about how that data can be used going forward. For example, the currend GedcomData struct keeps a lot of important information in Vecs<>. This will be problematic going forward because if I want to, say, find an individual's parents, I'll need to loop through the Vec<Family> until I find the correct family, then extract the spouses of that family and loop through Vec<Individuals> to find them. One goal I have is to re-write this structure to maintain HashMaps<> of the data, so that I can search for family's and individuals in O(1) time. I hope I get a chance to connect with the crate's author so I can ask about their design choices!

The following is informal technical documentation for the rust-gedcom crate. It is high-level. In general, I don't specify input/output types, lifetimes, etc.

File Structure


src/



src/types/



tests/



tests/fixtures/



src/bin.rs


This file defines the cli binary interface for the program:


src/lib.rs


This file declares modules:


types:


and defines a helper function:


src/parser.rs


This file defines the parser type. The parser struct contain a single field:


and implements functionality:

new


takes chars, a Chars type, and returns a new parser initilized from those chars. A Chars is an iterator over the chars of a string slice. So we pass this function a interator of chars and the Tokenizer type creates a new tokenizer from that iterator. The chars must have the same lifetime as the parser.

parse_record


Operates mutably by reference on self and returns a GedcomData type. Creates a new GedcomData variable, then loops through the tokens of the tokenizer adding each correctly parsed token to the tree.

At the start of each loop a pointer is created and is used when parsing Tag tokens. The following are the primary functions for parsing:

parse_header


parse_individual


parse_submitter


parse_family


parse_source


parse_repository


The following are for processing secondary and tertiary tags:

parse_custom_tags


parse_gedcom_data



parse_repo_citation


parse_gender


parse_name


parse_event


parse_address


parse_citation


The following are helper functions:

take_continued_text


take_line_value


dbg


There's a lot in the functions listed above. I'll try going into depth with each one at a later date.

src/tokenizer.rs


This files defines the Tokenizer type and the Token type.

It makes use of:

> gedcom standard 5.5.1

Which boils down to (I'll have to go check though):

"gedcom_line: level + delim + [optional_xref_ID] + tag + [optional_line_value] + terminator"


The token type is an enum defining the possible tokens and their associated types:


The tokenizer type has fields:


Tokenizer implements:


next_line, new, and done are the public functions. next_line checks for special characters (\0, \r, \n), skips whitespace, then sets the new current token based on the previous current token.

src/tree.rs


Tree uses the types defined in src/type to define the Gedcom type. This type has fields:


src/util.rs


Contains a macro for dislaying Option<T>s in debug mode.

src/types/address.rs


a struct with Option<String> fields. It would be nice if there was a way to geolocate these.

src/types/family.rs


The family type is a struct with the fields one would expect: reference id, individual 1 (husband), individual 2 (wife), children, etc. family events is currently a private field and I'm not sure why (figured that out, it's because there's an explicit getter method in the HasEvents trait). individual 1 and 2 are the author's choice to allow a wider variety of family structures. Could go further there and make it a HashMap called Spouses or something, but that could come with complications.

Family implements methods: creating a new family, setting the spouses and adding children.

Family implements traits: HasEvents. HasEvents allows adding events and returning a clone of the family's events. Why a clone I wonder? Probably just easier.

src/types/event.rs


The EventType is an enum with categories for each type of gedcom event (Adoption, birth, marriage, etc.). The Event type is a struct with a field for EventType, date, place, and citations. The Event type implements methods for creating an event from a tag, adding/reporting citations, and converting and event to have EventType of SourceData(String).

It doesn't look like there is functionality for adding times/places.

Events defines a public trait called HasEvents. types implementing HasEvents must define a method to add an event and return its events. HasEvents then gives those types methods to return all dates and locations of its events.

This is the first time I've seen traits (or anything like it... I guess interfaces are similar) in the wild, so I'm pretty stoked.

src/types/header.rs


The header type is a struct which contains all 'metadata' (Encoding, copyright, corp, language, etc) of the file.

src/types/source.rs


Two structs are defined here: Source and SourceData.

SourceData contain events and agency, an optional string. I might be getting ahead of myself here, but it looks like add_event was mistakenly implemented in an implementation block for the SourceData type, instead of for the HasEvents trait. Enhancement!

Source contains a SourceData, reference, abbreviation, title, and repo citations.

Souce implements methods for creating a new source and adding a repo citation.

src/types/submitter.rs


Defines a struct for Submitter containing a reference, name, address, and phone number. A submitter is a submitter of a genealogical fact, or so a comment tells me. Implements a method for creating a new submitter

src/types/individual.rs


Individuals are slightly more complicated than the other types. The individual type is a struct with an option<xref> (reminder xref is a string), option<name>, sex: gender, Vec<FamilyLink>, Vec<CustomData>, last_updated: Option<String>, and Vec<Event>. So what is this family link? More later...

Individual implements new, adding a family (actually adds a FamilyLink), and add_custom_data.

Individual also implements trait HasEvents.

There is an enum for gender (M, F, Nonbinary, Unknown).

A Family link is a tuple struct of an xref, FamilyLinkType, and option<pedigree>.

pedigree is an enum: adopted, birth, foster, sealing.

FamilyLinkType is a enum: Spouse, Child.

FamilyLink implements new from an Xref and a tag ('FAMC', 'FAMS'), and set_pedigree.

Last, Name is a struct with a bunch of option elements. Could add title here. I'm not sure how to use it though. It derives some traits, so maybe the answer is there. Ya I see it. It has a Name::default() and then gets overwritten in the parser.

src/types/mod.rs


Brings the type modules into scope. Also defines a few structs (SourceCitation, RepoCitation, and CustomData). Apparently this is the older style of doing things.

tests/json_feature.rs


Ok this is cool. If I look back a the Name type defined in src/types/individual, The line:

#[cfg_attr(feature = "json", derive(Serialize, Deserialize))]


declares an optional build configuration. If the build includes the feature 'json' then the compiler derives the Serialize and Deserialize traits for Name. These are defined in the serde crate and allow serialization/deserialization of the name time to/from json.

All the type declarations have this line, which means a gedcom tree can be serialized/deserialized to json.

the json_feature.rs file runs tests to make sure that serialization to json works on the rust-gedcom types.

tests/lib.rs


this file holds a test of the parser on a simple gedcom file. HashMap implements Index (but not IndexMut) so accessing individuals and submitters via a data.individuals[0] might still work. There could probably be more tests here!

Enhancements