Skip to main content
MediaBeacon University

Preparing Metadata Files

Guidelines for metadata to be used for Import.

When metadata is being imported into assets already ingested by MediaBeacon (the most typical metadata import), the disposition of import files must conform to specific standards.

A good technique to use to determine the proper formatting for a metadata file is to perform an export of a representative subset of MediaBeacon asset metadata once the list component has been configured. See Using the Export Function for more info.

Metadata File Terminology

Most of MediaBeacon's supported metadata file formats are character-delimited formats: CSV, Tab Delimited, and Custom Delimited. Below is a sample CSV file that illustrates several key concepts:

file_name,keyword_field,date_field,textarea_field
file01.jpg,keyword,2018-05-01,String of text
file02.jpg,"hello,world",2018-05-01,"Text that contains
an example of a line break."
file03.jpg,,yellow green,
file04.jpg,apples,2018-06-07,1234567890
  • Row: Each line of text ending with a line break.
  • Record: Generally, a row of text, delimited into fields. Some data types (textarea fields) may contain line breaks within values, causing a record to span multiple rows.
    • In a character-delimited files, it is very important that each record have the same number of fields per record, and that those fields are always in the same order for every record. Null values are used to accomplish this.
  • Fields: Discrete sections of rows that contain values, separated by delimiters.
  • Delimiter: The character sequence that defines the boundary between fields. Single comma and tab characters are common.
  • Value: The substring of a particular row that exists in a particular field.
  • Null value: Fields that contain no value still need to be represented in each record to maintain the ordering of fields in each record. This is expressed as two or more delimiters next to each other: For example, adjacent commas in a CSV file.
  • Column: This refers to all values that exist in all records in the same position, e.g. every third field in every row.
  • Header (aka header row): the first line in a character-delimited file does not contain a record of data, but a delimited sequence of values that serve as field identifiers. Best practice is to ensure that each of these is unique and known to correspond to a specific field in MediaBeacon.
  • Escape character (aka text qualifier or enclosing character): When the boundaries of fields are ambiguous, this character is used to enclose the contents of a field, the straight double quote character (") is common.
  • Line break: The non-printing character (or characters) that cause text to break onto the next line. This may variously be referred to as a "line feed" (aka "newline"), "carriage return", or "end of line". Keep in mind those terms refer to very specific and different ASCII characters that effect the line break used in different data standards. Best practice is to ensure line breaking is accomplished by the same method in all places in the metadata file..

XML files are also supported as metadata files, but are more complex and covered in more detail in the File Formats article.

Preparation Checklist

  1. Choose a Key Field
  2. Determine field identifiers
  3. Determine column order
  4. Check Character Encoding
  5. Truncate length-delimited data
  6. Check OS Localization
  7. Check for Non-Printing Characters
  8. Format Data Types
  9. Use Appropriate File Format

Choose a Key Field

See the Choosing a Key Field section.

Determine Field Identifiers

CSV files (and other character-delimited formats) need to have header row values which identify metadata fields in MediaBeacon correspond to which column of row values in the metadata file. Below is an example of a CSV file formatted correctly.

file_name,http://purl.org/dc/elements/1.1/ subject,http://purl.org/dc/elements/1.1/ title,http://purl.org/dc/elements/1.1/ date
file01.xmp,"hello,world",File One,2018-03-09
file02.xmp,Example_Keyword,File Two,2017-10-26

Header Row values can be defined in a number of ways:

  • Arbitrary Names: Each column is given a title, but these are not synchronized to the MediaBeacon Field names in any particular way. This is not recommended, as it will require a manual rearrangement of the columns at time of import, introducing opportunity for human error.
  • "Namemapped Column Headings": Each column is identified with the string used by MediaBeacon to identify a field. This is not considered best practice, since these names may not be unique.
  • XML Expanded Name: This method (the default for MediaBeacon's Export function, and shown above) uses an unambiguous field identifier, and is recommended as a best practice.
    • Example: The Dublin Core Keywords field XML Expanded name (Schema URI + Internal Name) is: "http://purl.org/dc/elements/1.1/ subject"
    • Some database-only fields, such as File Name will not have an identifier in this format, such as "file_name", "record_id", and "long_name".

A quick method to get the correct list of XML Expanded names is to configure the List component with the desired fields and then perform an Export on a single asset. The resulting CSV will have the expanded names in the header row.

Determine Column Order

It is best practice to have the metadata file's columns (sequence of fields in reach record) match the List component's column header arrangement. This includes:

  • Order of columns
  • Number of columns
  • "Key field" as the leftmost field in each record.

Although a metadata file's field values can be "soft arranged" or even ignored during Import, it adds unnecessary extra steps to the import process.

Check Character Encoding

MediaBeacon is UTF-8 compliant and it is the best practice to ensure all metadata sources are kept in that encoding during all operations prior to Import.

Main Article: [Character Encoding]

Truncate Length-Delimited Data

Some metadata sources (usually databases) may store data in length delimited fields. This type of data construction stores different "fields" in a single long string, writing each piece of data to a given segment. Segments are often 64 characters in length, but have no "breaking" characters to indicate where these breaks are.

  • Character delimited records:
2018-01-01,hello
  • Length delimited records (spaces are simulated with underscore characters in this example):
2018-01-01______________________________________________________hello___________________________________________________________

MediaBeacon requires character delimited files (or XML): CSV, Tab delimited or custom delimited. When processing a length delimited source, any extra spaces should be removed otherwise they will be recorded as part of the metadata values.

Format Data Types

Certain types of data are expected to be in a specific format:

  • Date fields must be in ISO 8601 format (2018-01-01).
  • Multivalue* fields must be escaped in (noncurly) double quotes:
  • "Hello,world"
  • Multiline* (textarea) fields must be escaped in (noncurly) double quotes:
  • "I do not think I shall ever see
  • A poem as lovely as a tree."
  • Strings that contain the delimiting or escaping characters* (, and " respectively for CSV format) must be properly formatted.
    • "Hello" is represented by """Hello"""
    • A value of Jane, Doe would be written as "Doe\, Jane"
  • Curly quotes: Beware of the incorrect type of double-quotes being used as the escape character (aka text qualifier) in CSV (or other character delimited filetypes).

*These restrictions apply to character-delimited file formats, XML files are structured differently. See the File Formats article for more info.

Check OS Localization

Non-English language settings may cause issues with importing and exporting CSV files. This is due to some localizations using a non-comma character in the "List separator" setting.

This may be remedied by using the English localization, or setting "List separator" to the comma ( , ) character.

Check for Non-Printing Characters

Data from some metadata sources can contain non-printing unicode characters that can be misinterpreted when importing text and should be purged whenever possible.

  • Non-Printing / Control characters including, but not limited to:
    • File Separator "FS" (U+001C)
    • Group Separator "GS" (U+001D)
    • Record Separator "RS" (U+001E)
    • Unit Separator "US" (U+001F)
  • Line Break characters:
    • Some metadata sources use different nonprinting characters or sometimes multiple unicode characters to represent a single line break. This can sometimes manifest as different line breaking character sequences for textarea values than what is used to line break records. These should be synchronised if a mismatch occurs.
      • Windows "CRLF" (carriage return, line feed)
      • Unix / macOS "LF" (line feed)
      • Mac OS 9: "CR" (carriage return, very uncommon)

Use Appropriate File Format

Main Article: File Formats

  • Was this article helpful?