Difference between revisions of "Workdocumentation 2021-03-30"

From OPENRESEARCH mk copy Wiki
Jump to navigation Jump to search
Line 96: Line 96:
 
output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ACCV_2020&type=revision&diff=19177&oldid=17957 ACCV_2020]
 
output result: [https://confident.dbis.rwth-aachen.de/ormk/index.php?title=ACCV_2020&type=revision&diff=19177&oldid=17957 ACCV_2020]
  
 +
 +
Main Function for usage:
 +
<source lang='python'>
 +
if __name__ == "__main__":
 +
    parser = argparse.ArgumentParser()
 +
    parser.add_argument('-p', dest="path", help='path to the file')
 +
    parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes')
 +
    parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe')
 +
    parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file')
 +
    parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file')
 +
    args = parser.parse_args()
 +
    if args.pipein:
 +
        if args.rmf:
 +
            allx = sys.stdin.readlines()
 +
            for i in allx:
 +
                remove_FreeText(i.strip(), restoreOut=args.ro)
 +
        if args.rdf:
 +
            allx = sys.stdin.readlines()
 +
            for i in allx:
 +
                remove_DuplicateFields(i.strip(), restoreOut=args.ro)
 +
    elif args.path is not None:
 +
        if args.rmf:
 +
            remove_FreeText(args.path, restoreOut= args.ro)
 +
</source>
 
*Usage Statistics:
 
*Usage Statistics:
 
** The null field 'some person' has been used<b> 223 </b> times
 
** The null field 'some person' has been used<b> 223 </b> times

Revision as of 09:24, 2 April 2021

Red Links and Data Fixations

Broken redirects

Broken file links

Broken properties and errors

  • The property Property:Has improper value for stores all invalid values.

Ordinals

  • For the Fix for ordinals the following approaches can be used :
    • This approach finds and edits all events with improper ordinals fixed:
wikiedit -t wikiId -q "[[Has improper value for::Ordinal]]" --search "(\|Ordinal=[0-9]+)(?:st|nd|rd|th)\b" --replace "\1"
  • A code snippet can be used coupled with wikibackup and bash tools for specific editing of pages: Code Snippet
  • Pipeline usage:
grep Ordinal /path/to/backup -l -r | python ordinal_to_cardinal.py -stdin -d '../dictionary.yaml' -ro -f | wikirestore -t ormk -stdinp -ui

Improper Null values for Has person

  • Has person was using "some person" as a null value. There was incorrect usage where in the free text events would use some person while the Wikison Format info would contain the person name.
  1. First way of doing this is to remove free text altogether. A code snippet was used coupled with bash utility grep.

Code Snippet:

def remove_FreeText(path, restoreOut= False):
    '''
            Remove free text from any given wiki page
            Args:
                path(str): path to wiki file
            Returns:
                page without any free text.
            '''
    f = open(path, 'r')
    eventlist = f.readlines()
    i = 0
    newwiki = ''
    while i < (len(eventlist) - 1):
        if "{{" in eventlist[i]:
            while True:
                newwiki += eventlist[i]
                if "}}" in eventlist[i]:
                    break
                i += 1
        i += 1
    if restoreOut:
        if newwiki == '':
            return
        dirname = getHomePath + '/wikibackup/rdfPages/'
        fullpath = dirname + ntpath.basename(path)
        print(fullpath)
        f.close()
        ensureDirectoryExists(dirname)
        f = open(fullpath, 'w')
        f.write(newwiki)
        f.close()

Usage:

 
grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rmf | wikirestore -t ormk -stdinp -ui

Output result: ICFHR 2020

  1. Second way to do this is to only remove the 'some person' entry from the wiki free text. Python snippet is used with bash utility grep.

Code Snippet:

def remove_DuplicateFields(path, restoreOut= False):
    f = open(path, 'r')
    event = f.read()
    flag = False
    eventlist = event.split('\n')
    for i in eventlist:
        if 'some person' in i:
            chk = re.findall('\[\[(.*?)\:\:', i)
            if event.count(chk[0]) > 1:
                flag = True
                event = event.replace(i, '')
    if restoreOut:
        if flag:
            dirname = getHomePath + '/wikibackup/rdfPages/'
            fullpath = dirname + ntpath.basename(path)
            print(fullpath)
            f.close()
            ensureDirectoryExists(dirname)
            f = open(fullpath, 'w')
            f.write(event)
            f.close()

Usage:

grep 'some person' -r '/path/to/backup' -l | python scripts/Data_Fixes.py -stdin -ro -rdf | wikirestore -t ormk -stdinp -ui

output result: ACCV_2020


Main Function for usage:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-p', dest="path", help='path to the file')
    parser.add_argument('-stdin', dest="pipein", action='store_true', help='Use the input from STD IN using pipes')
    parser.add_argument('-ro', dest="ro", action='store_true', help='Output to STDOUT for wikirestore by pipe')
    parser.add_argument('-rmf', dest="rmf", action='store_true', help='flag to remove free text from wiki file')
    parser.add_argument('-rdf', dest="rdf", action='store_true', help='flag to remove duplicate null value fields from wiki file')
    args = parser.parse_args()
    if args.pipein:
        if args.rmf:
            allx = sys.stdin.readlines()
            for i in allx:
                remove_FreeText(i.strip(), restoreOut=args.ro)
        if args.rdf:
            allx = sys.stdin.readlines()
            for i in allx:
                remove_DuplicateFields(i.strip(), restoreOut=args.ro)
    elif args.path is not None:
        if args.rmf:
            remove_FreeText(args.path, restoreOut= args.ro)
  • Usage Statistics:
    • The null field 'some person' has been used 223 times

Dates

  • 153 improper start date entries.
  • 6 improper end date entries.
  • Reference page
  • End date or dates in general are placed with strings.
  • Decision to remove the field all together or fix them?
    • For fixing manual intervention is needed.
    • For removing a field all together a small code snippet can do the trick

Acceptance Rate Issue

  • Statistics for the missing values for Submitted papers:
  • With bash utility these can be found
    • Number of pages that have the field "Submitted papers" : 1716
    • Number of pages that have the field "Accepted papers" : 1965
  • With a small python code snippet the following can be found:
    • Number of pages that have the field "Submitted papers" but no field of "Accepted papers" : Approximately 63
    • Number of pages that have the field "Accepted papers" but no field of "Submitted papers" : Approximately 302