Discovery
"If you don't understand the Way right before you, how will you know the path as you walk?" -Sekito Kisen
Introduction
I need a gedcom parser, but which one? I looked at parsers written in python, typescript, and rust. The python and typescript implementations worked. The rust implementation broke with the first sample gedcom I tried.
I don't want to just do some statistics and make a pretty front end. The rust crate is an opportunity to better understand what a parser actually is and potentially make a contribution to someone else's project. Perhaps I'll need to fall back to the python/typescript versions, but I'll probably know more about the world when I do.
Here is a link to the rust gedcom parser:
> rust-gedcom
Summary
The parser is made up of three pieces:
- the tokenizer
- the parser
- the output (gedcom) data structure
The tokenizer is responsible for taking a character stream, converting it into tokens and maintaining any meta information relevant to how the gedcom data is structured. For example, in the record:
0 @I2644@ INDI
1 NAME Henry C. /Marone/
1 SEX M
1 BIRT
2 DATE 1895
1 DEAT
2 DATE 1968
1 FAMS @F1250@
Each number indicates which previous line it pertains to. 0 indicates a record identifier: the start of an individual, family, etc. The next three lines provide data about that individual, and are thus labelled 1. The fifth line is labelled 2 because it describes the previous line (it provides the date of the individual's birth). The tokenizer must track the row's number (level), and provide functionality for getting the next pointer (@I2644@), tag (INDI, NAME, SEX, etc), and value (Henry C. /Marone/, M, 1895, etc.). The last line (1 FAMS @F1250@) links Henry to his family using a pointer to that family's record (@F1240@).
The parser creates a tokenizer from the given gedcom file; loops through the tokens identified by the tokenizer; and parses them into the desired output data structure.
The data structure is an important part of the parser, because it determines a lot about how that data can be used going forward. For example, the currend GedcomData struct keeps a lot of important information in Vecs<>. This will be problematic going forward because if I want to, say, find an individual's parents, I'll need to loop through the Vec<Family> until I find the correct family, then extract the spouses of that family and loop through Vec<Individuals> to find them. One goal I have is to re-write this structure to maintain HashMaps<> of the data, so that I can search for family's and individuals in O(1) time. I hope I get a chance to connect with the crate's author so I can ask about their design choices!
The following is informal technical documentation for the rust-gedcom crate. It is high-level. In general, I don't specify input/output types, lifetimes, etc.
File Structure
src/
- bin.rs
- lib.rs
- parser.rs
- tokenizer.rs
- tree.rs
- util.rs
src/types/
- address.rs
- event.rs
- family.rs
- header.rs
- individual.rs
- mod.rs
- source.rs
- submitter.rs
tests/
- json_feature.rs
- lib.rs
tests/fixtures/
- .ged files
src/bin.rs
This file defines the cli binary interface for the program:
- main - reads in data from the given file and parses it as a gedcom file.
- usage - prints cli usage to stdout.
- read_relative - takes a filepath and returns a Result<String> of the file data.
- exit_with_error - reports an error message to stdout then exits the program.
src/lib.rs
This file declares modules:
- util
- parser (pub)
- tokenizer (pub)
- types (pub)
- tree
types:
- tree::GedcomData (pub)
and defines a helper function:
- parse - converts a file content stream to parsed data.
src/parser.rs
This file defines the parser type. The parser struct contain a single field:
- tokenizer
and implements functionality:
new
takes chars, a Chars type, and returns a new parser initilized from those chars. A Chars is an iterator over the chars of a string slice. So we pass this function a interator of chars and the Tokenizer type creates a new tokenizer from that iterator. The chars must have the same lifetime as the parser.
parse_record
Operates mutably by reference on self and returns a GedcomData type. Creates a new GedcomData variable, then loops through the tokens of the tokenizer adding each correctly parsed token to the tree.
At the start of each loop a pointer is created and is used when parsing Tag tokens. The following are the primary functions for parsing:
parse_header
parse_individual
parse_submitter
parse_family
parse_source
parse_repository
The following are for processing secondary and tertiary tags:
parse_custom_tags
parse_gedcom_data
parse_family_link
parse_repo_citation
parse_gender
parse_name
parse_event
parse_address
parse_citation
The following are helper functions:
take_continued_text
take_line_value
dbg
There's a lot in the functions listed above. I'll try going into depth with each one at a later date.
src/tokenizer.rs
This files defines the Tokenizer type and the Token type.
It makes use of:
> gedcom standard 5.5.1
Which boils down to (I'll have to go check though):
"gedcom_line: level + delim + [optional_xref_ID] + tag + [optional_line_value] + terminator"
The token type is an enum defining the possible tokens and their associated types:
- Level(u8) - denotes depth within the tree
- Tag(String) - a four-character code that distinguishes datatypes
- LineValue(String) - the value of the data
- Pointer(String) - used to refer to a particular face (?)
- CustomTag(String) - user defined tag (always starts with an underscored)
- EOF - end-of-file indicator
- None - initial token value
The tokenizer type has fields:
- current_token - a Token
- current_char - a char
- chars - a Chars iterator
- line - a u32 of the current line
Tokenizer implements:
- new - takes a Chars and returns a new Tokenizer
- done - ends tokenization, setting current_token to Token::EOF
- next_token - loads the next token into the state
- next_char - gets the next character from chars
- extract_number - extracts a number from the chars iterator at the current state
- extract_word - extracts a word from chars
- extract_value - extracts a value from chars
- skip_whitespace - moves the iterator past any whitespace chracters
- is_nonnewline_whitespace - checks if the current chare is whitespace
next_line, new, and done are the public functions. next_line checks for special characters (\0, \r, \n), skips whitespace, then sets the new current token based on the previous current token.
src/tree.rs
Tree uses the types defined in src/type to define the Gedcom type. This type has fields:
- header - a Header type
- submitters - a Vec of Submitter
- individuals - a Vec of Individual
- families - a Vec of Familie
- repositories - a Vec of Repositor
- sources - a Vec of Source
- multimedia - a Vec of Media
src/util.rs
Contains a macro for dislaying Option<T>s in debug mode.
src/types/address.rs
a struct with Option<String> fields. It would be nice if there was a way to geolocate these.
src/types/family.rs
The family type is a struct with the fields one would expect: reference id, individual 1 (husband), individual 2 (wife), children, etc. family events is currently a private field and I'm not sure why (figured that out, it's because there's an explicit getter method in the HasEvents trait). individual 1 and 2 are the author's choice to allow a wider variety of family structures. Could go further there and make it a HashMap called Spouses or something, but that could come with complications.
Family implements methods: creating a new family, setting the spouses and adding children.
Family implements traits: HasEvents. HasEvents allows adding events and returning a clone of the family's events. Why a clone I wonder? Probably just easier.
src/types/event.rs
The EventType is an enum with categories for each type of gedcom event (Adoption, birth, marriage, etc.). The Event type is a struct with a field for EventType, date, place, and citations. The Event type implements methods for creating an event from a tag, adding/reporting citations, and converting and event to have EventType of SourceData(String).
It doesn't look like there is functionality for adding times/places.
Events defines a public trait called HasEvents. types implementing HasEvents must define a method to add an event and return its events. HasEvents then gives those types methods to return all dates and locations of its events.
This is the first time I've seen traits (or anything like it... I guess interfaces are similar) in the wild, so I'm pretty stoked.
src/types/header.rs
The header type is a struct which contains all 'metadata' (Encoding, copyright, corp, language, etc) of the file.
src/types/source.rs
Two structs are defined here: Source and SourceData.
SourceData contain events and agency, an optional string. I might be getting ahead of myself here, but it looks like add_event was mistakenly implemented in an implementation block for the SourceData type, instead of for the HasEvents trait. Enhancement!
Source contains a SourceData, reference, abbreviation, title, and repo citations.
Souce implements methods for creating a new source and adding a repo citation.
src/types/submitter.rs
Defines a struct for Submitter containing a reference, name, address, and phone number. A submitter is a submitter of a genealogical fact, or so a comment tells me. Implements a method for creating a new submitter
src/types/individual.rs
Individuals are slightly more complicated than the other types. The individual type is a struct with an option<xref> (reminder xref is a string), option<name>, sex: gender, Vec<FamilyLink>, Vec<CustomData>, last_updated: Option<String>, and Vec<Event>. So what is this family link? More later...
Individual implements new, adding a family (actually adds a FamilyLink), and add_custom_data.
Individual also implements trait HasEvents.
There is an enum for gender (M, F, Nonbinary, Unknown).
A Family link is a tuple struct of an xref, FamilyLinkType, and option<pedigree>.
pedigree is an enum: adopted, birth, foster, sealing.
FamilyLinkType is a enum: Spouse, Child.
FamilyLink implements new from an Xref and a tag ('FAMC', 'FAMS'), and set_pedigree.
Last, Name is a struct with a bunch of option elements. Could add title here. I'm not sure how to use it though. It derives some traits, so maybe the answer is there. Ya I see it. It has a Name::default() and then gets overwritten in the parser.
src/types/mod.rs
Brings the type modules into scope. Also defines a few structs (SourceCitation, RepoCitation, and CustomData). Apparently this is the older style of doing things.
tests/json_feature.rs
Ok this is cool. If I look back a the Name type defined in src/types/individual, The line:
#[cfg_attr(feature = "json", derive(Serialize, Deserialize))]
declares an optional build configuration. If the build includes the feature 'json' then the compiler derives the Serialize and Deserialize traits for Name. These are defined in the serde crate and allow serialization/deserialization of the name time to/from json.
All the type declarations have this line, which means a gedcom tree can be serialized/deserialized to json.
the json_feature.rs file runs tests to make sure that serialization to json works on the rust-gedcom types.
tests/lib.rs
this file holds a test of the parser on a simple gedcom file. HashMap implements Index (but not IndexMut) so accessing individuals and submitters via a data.individuals[0] might still work. There could probably be more tests here!
Enhancements
- Parse values into useful types. Dates and Locations especially shouldn't be raw Option<String>s.
- Change Individuals, Families, and GedcomData to maintain HashMap collections as fields instead of Vecs.
- implement HasEvents trait for: SourceData
- Add title to src/types/individual/name