# Invited Speakers

**Philip Bille **

Philip Bille is associate professor in the Algorithms, Logic, and Graph Theory group at the Technical University of Denmark, Department of Applied Mathematics and Computer Science (DTU Compute). His scientific interest is in algorithms and data structures with focus on pattern matching, data compression, and parallelism in modern computer architectures. He is the deputy head of the Center for Compressed Computing and the head of the Research Academy at DTU Compute.

**Nataša Przulj**

Nataša Przulj is professor of biomedical data science at the Department of Computer Science at University College London. She is recognized for initiating extraction of biomedical knowledge from the wiring patterns (topology, structure) of “Big Data” real-world molecular (omics) and other networks. That is, she views the wiring patterns of large and complex omics networks, disease ontologies, clinical patient data, drug-drug and drug-target interaction networks etc., as a new source of information that complements the genetic sequence data and needs to be mined and meaningfully integrated to gain deeper biomedical understanding. Her recent work includes designing machine learning methods for integration of heterogeneous biomedical and molecular data, applied to advancing biological and medical knowledge. She also applies her methods to economics.

**Rossano Venturini**

Rossano Venturini is assistant professor in the Department of Computer Science at the University of Pisa. His research focusses on compressed data structures, for which he won the Best PhD Thesis award from the Italian Chapter of the EATCS in 2012, and information retrieval, for which he won the Best Paper award at SIGIR 2014 and 2015. He was program committee co-chair for SPIRE 2017 in Palermo.

# Programme

Tuesday, October 9th |
|||

08:45 | Registration | ||

09:00 | Invited (SISAP) | Humans, Machines, and Work: The Future is Now | Moshe Vardi (US) |

10:00 | Coffee | ||

10:10 | SISAP | On the Analysis of Compressed Chemical Fingerprints | Ricardo C. Sperandio (France), Simon Malinowski (France), Laurent Amsaleg (France) and Romain Tavenard (France). |

SISAP | Time Series Retrieval using DTW-Preserving Shapelets | Fabio Grandi (Italy) | |

Joint with SISAP | Adaptive Computation of the Discrete Frechet Distance | Jérémy Barbay (Chile) | |

Joint with SISAP | Computing Burrows-Wheeler Similarity Distributions for String Collections | Felipe A. Louza (Brazil), Guilherme P. Telles (Brazil), Simon Gog (US) and Liang Zhao (Brazil) | |

11:50 | Coffee | ||

12:00 | Invited | Mining the Integrated Connectedness of Biomedical Systems | Nataša Pržulj (UK) |

13:00 | SISAP Closing | ||

13:10 | Lunch | ||

14:30 | Excursion and Dinner | ||

23:00 | Finish |

Wednesday, October 10th |
|||

09:00 | Registration | ||

10:00 | Invited | Data Compression: The Whole is Larger than the Sum of its Parts | Rossano Venturini (Italy) |

11:00 | Coffee | ||

11:20 | Information Retrieval | Fast and Effective Neural Networks for Translating Natural Language into Denotations | Tiago Pimentel (Brazil), Juliano Viana (Brazil), Adriano Veloso (Brazil) and Nivio Ziviani (Brazil) |

11:40 | Information Retrieval | Early Commenting Features for Emotional Reactions Prediction | Anastasia Giachanou (Switzerland), Paolo Rosso (Spain), Ida Mele (Italy) and Fabio Crestani (Switzerland) |

12:00 | Lunch | ||

14:00 | LCS and LCP | Compressed Communication Complexity of Longest Common Prefixes | Philip Bille (Denmark), Mikko Berggreen Ettienne (Denmark), Roberto Grossi (Italy), Inge Li Gørtz (Denmark) and Eva Rotenberg (Denmark) |

14:20 | LCS and LCP | Better heuristic algorithms for the Repetition Free LCS and other variants | Radu-Stefan Mincu (Romania) and Alexandru Popa (Romania) |

14:40 | LCS and LCP | Longest Property-Preserved Common Factor | Lorraine A.K. Ayad (UK), Giulia Bernardini (Italy), Roberto Grossi (Italy), Costas Iliopoulos (UK), Nadia Pisanti (Italy), Solon Pissis (UK) and Giovanna Rosone (Italy) |

14:55 | LCS and LCP | Indexed Dynamic Programming to boost Edit Distance and LCSS Computation | Jérémy Barbay (Chile) and Andrés Olivares (Chile) |

15:15 | LCS and LCP | Longest Common Prefixes with k-Errors and Applications | Lorraine Ayad (UK), Carl Barton (UK), Panagiotis Charalampopoulos (UK), Costas Iliopoulos (UK) and Solon Pissis (UK) |

15:35 | Coffee | ||

15:55 | (Re)Construction | Optimal In-Place Suffix Sorting | Zhize Li (China), Jian Li (China) and Hongwei Huo (China) |

16:15 | (Re)Construction | Fast Wavelet Tree Construction in Practice | Yusaku Kaneta (Japan) |

16:35 | (Re)Construction | Recovering, counting and enumerating strings from forward and backward suffix arrays | Yuki Kuhara (Japan), Yuto Nakashima (Japan), Shunsuke Inenaga (Japan), Hideo Bannai (Japan) and Masayuki Takeda (Japan) |

16:55 | (Re)Construction | Linear-Time Online Algorithm Inferring the Shortest Path from a Walk | Shintaro Narisada (Japan), Diptarama Hendrian (Japan), Ryo Yoshinaka (Japan) and Ayumi Shinohara (Japan) |

17:15 | Finish |

# Spire 2018 Proceedings

As in past editions, the proceedings of SPIRE 2018 are published by Springer in the Lecture Notes in Computer Science (LNCS 11147) series.

Limited-time free access to the conference proceedings is available from here.

# Accepted Papers

Recoloring the Colored de Bruijn Graph

**Abstract:**The colored de Bruijn graph, an extension of the de Bruijn graph, is routinely applied for variant calling, genotyping, genome assembly, and various other applications [11]. In this data struc- ture, the edges are labelled with one or more colors from a set C, and are stored as a m × |C| matrix, where m is the number of edges. Since both m and |C| can be significantly large, the matrix should be represented in a space-efficient manner. Recently, there has been a significant amount of work in de- veloping compacted representations of this color matrix but all existing methods have focused on com- pressing the color matrix. In this paper, we explore the problem of recoloring in order to reduce the size of the color matrix. We show that finding the minimum number of colors to have a valid recoloring is a NP-hard problem, and thus, motivate our development of a recoloring heuristic that greedily merge the colors in the colored de Bruijn graph. Our results show that this heuristic is able to reduce the number of colors between one and two orders of magnitude, and the size of the matrix by almost half. This work is publicly available at https://github.com/baharpan/cosmo/tree/Recoloring.

Indexed Dynamic Programming to boost Edit Distance and LCSS Computation

**Abstract:**There are efficient dynamic programming solutions to the computation of the Edit Distance from $S\in[1..\sigma]^n$ to $T\in[1..\sigma]^m$, for many natural subsets of edit operations, typically in time within $O(nm)$ in the worst-case over strings of respective lengths $n$ and $m$ (which is likely to be optimal), and in time within $O(n{+}m)$ in some special cases (e.g. disjoint alphabets). We describe how indexing the strings (in linear time), and using such an index to refine the recurrence formulas underlying the dynamic programs, yield faster algorithms in a variety of models, on a continuum of classes of instances of intermediate difficulty between the worst and the best case, thus refining the analysis beyond the worst case analysis. As a side result, we describe similar properties for the computation of the Longest Common Sub Sequence $\idtt{LCSS}(S,T)$ between $S$ and $T$, since it is a particular case of Edit Distance, and we discuss the application of similar algorithmic and analysis techniques for other dynamic programming solutions. More formally, we propose a parameterized analysis of the computational complexity of the Edit Distance for various set of operators and of the Longest Common Sub Sequence in function of the area of the dynamic program matrix relevant to the computation.

Adaptive Computation of the Discrete Frechet Distance

**Abstract:**The discrete Fr{\’e}chet distance is a measure of similarity between point sequences which permits to abstract differences of resolution between the two curves, approximating the original Fr{\’e}chet distance between curves. Such distance between sequences of respective length $n$ and $m$ can be computed in time within $O(nm)$ and space within $O(n+m)$ using classical dynamic programing techniques, a complexity likely to be optimal in the worst case over sequences of similar lenght unless the Strong Exponential Hypothesis is proved incorrect. We propose a parameterized analysis of the computational complexity of the discrete Fr{\’e}chet distance in fonction of the area of the dynamic program matrix relevant to the computation, measured by its \emph{certificate width} $\omega$. We prove that the discrete Fr{\’e}chet distance can be computed in time within $((n+m)\omega)$ and space within $O(n+m+\omega)$.

Compressed Range Minimum Queries

**Abstract:**Given a string S of n integers in [0,sigma), a range minimum query RMQ(i,j) asks for the index of the smallest integer in S[i…j]. It is well known that the problem can be solved with a data structure of size O(n) and constant query-time. In this paper we show how to preprocess S into a compressed representation that allows fast range minimum queries. This allows for sublinear size data structures with logarithmic query time.

The most natural approach is to use string compression and construct a data structure for answering range minimum queries directly on the compressed string. We investigate this approach using grammar compression. We then consider an alternative approach. Even if S is not compressible, its cartesian tree necessarily is. Therefore, instead of compressing S using string compression, we compress the cartesian tree of S using tree compression. We show that this approach can be exponentially better than the former, and is never worse by more than an O(sigma) factor (i.e. for constant alphabets it is never asymptotically worse).

Maximal Motif Discovery in a Sliding Window

**Abstract:**With the current explosion of genomic data, the need to analyse it efficiently is growing. As next-generation sequencing technology advances, there is an increase in the production of genomic data that requires de novo assembly and analyses. One such analysis is motif discovery. Motifs are relatively short sub-sequences that are biologically significant, and may contain don’t cares (special symbols that match any symbol in the alphabet). Examples of these are protein-binding sites, such as transcription factor recognition sites. Maximal motifs are motifs that cannot be extended or specialised without losing occurrences. We present an on-line algorithm that finds all maximal motifs, that occur at least k times and contain no more than d don’t cares, in a sliding window on the input string.

Faster Recovery of Approximate Periods over Edit Distance (short paper)

**Abstract:**The approximate period recovery problem asks to compute all approximate word-periods of a given word S of length n: all primitive words P (|P|=p) which have a periodic extension at edit distance smaller than τ from S, where τ < ⌊ n / ((3.75+ε) · |P|) ⌋ for some ε>0. Here, the set PerExt(P) of periodic extensions of P consists of all finite prefixes of P^∞.

We improve time complexity of the fastest known algorithm for this problem of Amir et al. (TCS 2018) from O(n^{4/3}) to O(n log n). Our tool is a fast algorithm for Approximate Pattern Matching in Periodic Text. We consider only verification for the period recovery problem when the candidate approximate word-period P is explicitly given up to cyclic rotation; the algorithm of Amir et al. reduces the general problem in O(n) time to a logarithmic number of such more specific instances.

New structures to solve aggregated queries for trips over public transportation networks

**Abstract:**Representing the trajectories of mobile objects is a hot topic from the widespread use of smartphones and other GPS devices. However, few works have focused on representing trips over public transportation networks (buses, subway, and trains) where user’s trips can be seen as a sequence of stages performed within a vehicle shared with many other users. In this context, representing vehicle journeys reduces the redundancy because all the passengers inside a vehicle share the same arrival time for each stop. In addition, each vehicle journey follows exactly the sequence of stops corresponding to its line, which makes it unnecessary to represent that sequence for each journey. To analyze those problems, we designed a conceptual model that gave us a better insight into this data domain and allowed us the definition of relevant terms and the detection of redundancy sources among those data. Then, we designed two compact representations focused in users’ trips (TTCTR) and in vehicle trips (AcumM) respectively. Each approach owns some strengths and is able to answer efficiently some queries.

In this paper, we present all that work and experimental results over synthetic trips generated from accurate schedules obtained from a real GTFS network description (from Madrid) to show the space/time trade-off of both approaches. We considered a wide range of different queries about the use of the transportation network such as counting-based/aggregate queries regarding the load of any line of the network at different times.

Faster and Smaller Two-Level Index for Network-based Trajectories

**Abstract:**Two-level indexes have been widely used to handle trajectories of moving objects that are constrained to a network. The top-level of these indexes handles the spatial dimension, whereas the bottom level handles the temporal dimension. The latter turns out to be an instance of the \textit{interval-intersection} problem, but it has been tackled by non-specialized spatial indexes. In this work, we propose the use of a compact data structure on the bottom level of these indexes. Our experimental evaluation shows that our approach is both faster and smaller than existing solutions.

Fast and Effective Neural Networks for Translating Natural Language into Denotations

**Abstract:**In this paper we study the semantic parsing problem of mapping natural language utterances into machine interpretable meaning representations. We consider a text-to-denotation application scenario in which a user interacts with a non-human assistant by entering a question, which is then translated into a logical structured query and the result of running this query is finally returned as response to the user. We propose encoder-decoder models that are trained end-to-end using the input questions and the corresponding logical structured queries. In order to ensure fast response times, our models do not condition the target string generation on previously generated tokens. We evaluate our models on real data obtained from a conversational banking chat service, and we show that conditionally-independent translation models offer similar accuracy numbers when compared with sophisticate translation models and present one order of magnitude faster response times.

3DGraCT: A Grammar based Compresed representation of 3D Trajectories

**Abstract:**The management of trajectories has attracted much research work, in most cases focused on trajectories on the ground or the sea, and probably, aircraft trajectories have received less attention. However, the need for a more efficient management of the airspace launched several ambitious research projects, where, as in the traditional case, space and query response times are important issues.

This work presents a method for representing aircraft trajectories. The new method uses a compact data structure approach, and then, it is possible to directly query the compressed data without a previous decompression. In addition, in the same compressed space, the data structure includes several access methods to accelerate the searches.

Fast Wavelet Tree Construction in Practice

**Abstract:**The wavelet tree and matrix are compact data structures that support a wide range of operations on a sequence of $n$ integers in~$[0,\sigma)$ using $n\lg\sigma + o(n\lg\sigma)$ bits of space. Although Munro et al. (SPIRE 2014 and Theoretical Computer Science) and Babenko et al. (SODA 2015) showed that wavelet trees (and matrices) can be constructed in $O(n\lg\sigma/\sqrt{\lg n})$ time, there has been no empirical study on their construction method, possibly due to its heavy use of precomputed tables, seemingly limiting its practicality. In this paper, we propose practical variants of their fast construction of wavelet trees. Instead of using huge precomputed tables, we introduce new techniques based on broadword programming and special CPU instructions available for modern processors. Our experiments using real-world datasets showed that our proposed methods were up to 2.2 and 4.5 times as fast as a naive one for both wavelet trees and matrices, respectively, and up to 1.9 times as fast as existing ones for wavelet matrices.

Compressed Communication Complexity of Longest Common Prefixes

**Abstract:**We consider the communication complexity of fundamental longest common prefix (LCP) problems. In the simplest version, two parties, Alice and Bob, each hold a string, A and B, and we want to determine the length of their longest common prefix l=LCP(A,B) using as few rounds and bits of communication as possible. We show that if the longest common prefix of A and B is compressible, then we can significantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication.

Namely, if the longest common prefix has an LZ77 parse of z phrases, only O(lg z) rounds and O(lg l) total communication is necessary.

We extend the result to the natural case when Bob holds a set of strings B1, …, Bk, and the goal is to find the length of the maximal longest prefix shared by A and any of B1, …, Bk. Here, we give a protocol with O(log z) rounds and O(lg z lg k + lg l) total communication.

We present our result in the public-coin model of computation but by a standard technique our results generalize to the private-coin model. Furthermore, if we view the input strings as integers the problems are the greater-than problem and the predecessor problem.

Recovering, counting and enumerating strings from forward and backward suffix arrays

**Abstract:**The suffix array SA_w of a string w of length n is a permutation of [1..n] such that SA_w[i] = j iff w[j..n] is the lexicographically i-th suffix of w. In this paper, we consider variants of the reverse-engineering problem on suffix arrays with two given permutations P and Q of [1..n], such that P refers to the forward suffix array of some string w and Q refers to the backward suffix array of the reversed string w^R. Our results are the following: (1) An algorithm which computes a solution string over an alphabet of the smallest size, in O(n) time. (2) The exact number of solution strings over an alphabet of size \sigma. (3) An efficient algorithm which computes all solution strings in the lexicographical order, in time near optimal up to \log n factor.

Early Commenting Features for Emotional Reactions Prediction

**Abstract:**Nowadays, one of the main sources for people to access and read news are social media platforms. Different types of news trigger different emotional reactions to users who may feel happy or sad after reading a news article. In this paper, we focus on the problem of predicting emotional reactions that are triggered on users after they read a news post. In particular, we try to predict the number of emotional reactions that users express regarding a news post that is published on social media. In this paper, we propose features extracted from users’ comments published about a news post shortly after its publication to predict users’ the triggered emotional reactions. We explore two different sets of features extracted from users’ comments. The first group represents the activity of users in publishing comments whereas the second refers to the comments’ content. In addition, we combine the features extracted from the comments with textual features extracted from the news post. Our results show that features extracted from users’ comments are very important for the emotional reactions prediction of news posts and that combining textual and commenting features can effectively address the problem of emotional reactions prediction.

Towards a compact representation of temporal rasters

**Abstract:**Big research efforts have been devoted to efficiently manage spatio-temporal data. However, most works focused on vectorial data, and much less, on raster data. This work presents a new representation for raster data that evolve along time named Temporal k2-raster. It faces the two main issues that arise when dealing with spatio-temporal data: the space consumption and the query response times. It

extends a compact data structure for raster data in order to manage time and thus, it is possible to query it directly in compressed form, instead of the classical approach that requires a complete decompression before any manipulation.

In addition, in the same compressed space, the new data structure includes two indexes: a spatial index and an index on the values of the cells, thus becoming a self-index for raster data.

[short paper] Block Palindromes: A New Generalization of Palindromes

**Abstract:**We propose a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes. In so doing, we introduce the notion of maximal block palindromes, which are a compact representation of all block palindromes that occur in a string. We also propose an algorithm which enumerates all maximal block palindromes that appear in a given string T in O(|T | + ∥MBP (T )∥) time, where ∥MBP(T)∥ is the output size, which is optimal unless all the maximal block palindromes can be represented in a more compact way.

Truncated DAWGs and their application to minimal absent word problem

**Abstract:**The \emph{directed acyclic word graph} (\emph{DAWG}) of a string $y$ is the smallest (partial) DFA which recognizes all suffixes of $y$ and has $O(n)$ nodes and edges. Na et al. proposed $k$-truncated suffix tree which is a compressed trie that represents substrings of a string whose length is $k$ and suffixes whose length is less than $k$. In this paper, we present a new data structure called \emph{$k$-truncated DAWGs}, which can be obtained by pruning the DAWGs. We show that the size complexity of the $k$-truncated DAWG of a string $y$ of length $n$ is $O(\min\{n,kz\})$ which is equal to the truncated suffix tree’s one, where $z$ is the size of LZ77 factorization of $y$. We also present an $O(n\log \sigma)$ time and $O(\min\{ n,kz\})$ space algorithm for constructing the $y$-truncated DAWG of $y$, where $\sigma$ is the alphabet size. As an application of the truncated DAWGs, we show that the set $\MAW_k(y)$ of all minimal absent words of $y$ whose size is smaller than or equal to $k$ can be computed by using $k$-truncated DAWG of $y$ in $O(\min\{ n, kz\} + |\MAW_k(y)|)$ time and $O(\min\{ n,kz\})$ working space.

On Extended Special Factors of a Word

**Abstract:**An extended special factor of a word x is a factor of x whose longest infix can be extended by at least two distinct letters to the left or to the right and still occur in x. It is called extended bispecial if it can be extended in both directions and still occur in x. Let f(n) be the maximum number of extended bispecial factors over all words of length n. Almirantis et al have shown that 2n – 6 ≤ f(n) ≤ 3n-4 (WABI 2017). In this article, we show that there is no constant c<3 such that f(n) ≤ cn. We then exploit the connection between extended special factors and minimal absent words to construct a data structure for computing minimal absent words of a specific length in optimal time for integer alphabets generalising a result by Fujishige et al (MFCS 2016). As an application of our data structure, we show how to compare two words over an integer alphabet in optimal time improving on another result by Crochemore et al (LATIN 2016).

Searching for a Modified Pattern in a Changing Text

**Abstract:**Much attention has been devoted recently to the dynamic model of pattern matching. In this model the input is updated or changed locally. One is interested in obtaining the appropriate search result in time that is shorter than the time necessary to search without any previous knowledge. In particular, searching for a pattern P in an indexed text is done in optimal O(|P |) time. There has been work done in searching for a fixed pattern in a dynamic text, as well as finding all maximum common factors of two strings in a dynamic setting.

There are real-world applications where the text is unchanged and the pattern is slightly modified at every query. However, in the current state- of-the-art, a new search is required if the pattern is modified. In this paper we present an algorithm that reduces this search time to be sublinear, for a variety of types of pattern modification – in addition to the insertion, deletion, and replacement of symbols, we allow copy-paste and delete substring operations.

We also make a step toward a fully dynamic pattern matching model by also supporting text changes, albeit much more modest than the pattern changes. We support dynamic pattern matching where symbols may also be either added or deleted to either the beginning or the end of the text. We show that we can support such a model in time O(log n) for every pattern modification or text change. We can then report all occ occurrences of P in the text in O(occ) time.

Optimal In-Place Suffix Sorting

**Abstract:**The suffix array is a fundamental data structure for many applications that involve string searching and data compression.

Designing time/space-efficient suffix array construction algorithms has attracted significant attentions and considerable advances have been made for the past 20 years.

We obtain the \emph{first} in-place linear time suffix array construction algorithms that are optimal both in time and space for (read-only) integer alphabets.

Our algorithm settles the open problem posed by Franceschini and Muthukrishnan in ICALP 2007. The open problem asked to design in-place algorithms in $o(n\log n)$ time and ultimately, in $O(n)$ time for (read-only) integer alphabets with $|\Sigma| \leq n$.

Our result is in fact slightly stronger since we allow $|\Sigma|=O(n)$.

Besides, we provide an optimal in-place $O(n\log n)$ time suffix sorting algorithm for read-only general alphabets (i.e., only comparisons are allowed), recovering the result obtained by Franceschini and Muthukrishnan which was an open problem posed by Manzini and Ferragina in ESA 2002.

Trickier XBWT Tricks

**Abstract:**A trie is one of the best data structures for implementing and searching a dictionary. However, to build the trie structure for larger

collections of strings takes up a lot of memory. Since the eXtended Burrows-Wheeler Transform (XBWT) is able to compactly represent a labeled tree, it can naturally be used to succinctly represent a trie. The XBWT also supports navigational operations on the trie, but it does not support failure links. For example, the Aho-Corasick algorithm for simultaneously searching for several patterns in a text achieves its good worst-case time complexity only with the aid of failure links. Manzini showed that a balanced parentheses sequence P can be used to support failure links in constant time with only 2n+o(n) bits of space, where n is the number of internal nodes in the trie. Besides practical algorithms that construct the XBWT, he also provided two different algorithms that construct P. In this paper, we suggest an alternative way for constructing P that outperforms the previous algorithms.

Computing Burrows-Wheeler Similarity Distributions for String Collections

**Abstract:**In this article we present practical and theoretical improvements to the computation of the Burrows-Wheeler similarity distribution (BWSD) for all pairs of strings in a collection. Our algorithms take advantage of the Burrows-Wheeler transform (BWT) computed for the concatenation of all strings, instead of the pairwise construction of BWTs performed by the straightforward approach, and use compressed data structures that allow reductions of running time while still keeping a small memory footprint, as shown by a set of experiments with real datasets.

Longest Common Prefixes with k-Errors and Applications

**Abstract:**Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length $n$ that occurs elsewhere in the string with $k$-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for {\em non-constant $k$} and using only {\em linear space} under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in $\cO(n\frac{(c\log n)^k}{k!})$ time on average, where $c>1$ is a constant, using $\cO(n)$ space. Finally, we show that our technique is applicable to several algorithmic problems found in computational biology and elsewhere. The importance of our technique lies on the fact that it is the first one achieving this bound for non-constant $k$ and using $\cO(n)$ space.

Linear-Time Online Algorithm Inferring the Shortest Path from a Walk

**Abstract:**We consider the problem of inferring an edge-labeled graph from the sequence of edge labels seen in a walk of that graph.

It has been known that this problem is solvable in $O(n \log n)$ time when the targets are path or cycle graphs.

This paper presents an online algorithm for the problem of this restricted case that runs in $O(n)$ time, based on Manacher’s algorithm for computing all the maximal palindromes in a string.

Longest Property-Preserved Common Factor

**Abstract:**In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider the fundamental property of periodicity under two different settings. In the first one, we are given a string $x$ and we are asked to construct a data structure over $x$ answering the following type of on-line queries: given string $y$, find a longest periodic factor common to $x$ and $y$. In the second, we are given $k$ strings and an integer $1 < k’\leq k$ and we are asked to find a longest periodic factor common to at least $k’$ strings. We present linear-time solutions for both settings.

**Abstract:**Sequence mappability is an important factor in genome sequencing. In the $(k,m)$-mappability problem, for a given a sequence $T$ of length $n$ our goal is to compute a table $A$ such that $A[i]$ is the number of indices $j$ such that $m$-length substrings of $T$ starting at positions $i$ and $j$ are at Hamming distance at most $k$. Previous work on this problem focused on heuristic approaches that provided a rough approximation of the result or on the special case of $k=1$. We present several efficient algorithms for the general case of the problem. The main result is an algorithm that works in time $O(n \min(\log^{k+1} n, m^k))$ and linear space. It requires a careful adaptation of the technique of Cole et al. (STOC 2004) in order to avoid multiple counting of pairs of substrings. We also show $O(n^2)$-time algorithms that compute all results for a fixed $m$ and all $k=1,\ldots,n$ or a fixed $k$ and all $m=1,\ldots,n$. Finally we show that $(k,m)$-mappability cannot be computed in strongly subquadratic time for $k,m = \Omega(\log n)$ unless SETH fails.

Better heuristic algorithms for the Repetition Free LCS and other variants

**Abstract:**In Discrete Applied Mathematics 2010, Adi et al. introduce and study a variant of the well known Longest Common Subsequence problem, named \emph{Repetition Free Longest Common Subsequence (RFLCS)}. In RFLCS the input consists of two strings $A$ and $B$ over an alphabet $\Sigma$ and the goal is to find the longest common subsequence containing only distinct characters from $\Sigma$. Adi et al. prove that the problem is $\mathcal{APX}$-hard and show three approximation algorithms. Castelli et al. (Operations Research Letters 2013) propose a heuristic approximation algorithm and Blum and Blesa introduce an exact algorithm based on ILP (Journal of Heuristics 2017).

In this paper we design and test several new approximation algorithms for RFLCS. The first algorithm, which we believe is the most important contribution, uses dynamic programming and on our tests performs noticeably better than the algorithms of Adi et al.. The second algorithm transforms the RFLCS instance into an instance of the Maximum Independent Set (MIS) problem with the same value of the optimum solution. Then, we apply known algorithms for the MIS problem. We also augment one of the approximation algorithms of Adi et al. and we prove that we achieve an approximation of factor $2\sqrt{\min\{|A|,|B|\}}$.

Finally, we introduce two variants of the LCS problem. For one of the problems, named \emph{Pinned Longest Common Subsequence (PLCS)} we present an exact polynomial time algorithm based on dynamic programming. The other variant, named \emph{Multiset Restricted Common Subsequence (MRCS)} is a generalization of RFLCS. We present an exact polynomial time algorithm for MRCS for constant size alphabet. Also, we show that MRCS admits a $2\sqrt{\min\{|A|,|B|\}}$ approximation.

The colored longest common prefix array computed via sequential scans

**Abstract:**Due to the increased availability of large datasets of biological sequences, the tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most of the alignment-free approaches require the computation of statistics of the sequences in the dataset. Such computations become impractical in internal memory when very large collections of long sequences are considered.

In this paper, we present a new conceptual data structure, the colored longest prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string ACS problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the effectiveness of our approach.

# Excursion to Pachacamac

Pachacamac is an archaeological site located on the right bank of the Lurin River, very close to the Pacific Ocean and in front of a group of homonymous islands. It is located in the district of Lurin (2 hours from Lima), in the province of Lima, in Peru. It contains the remains of various buildings, dating from the Early Intermediate (3rd century) to the Late Horizon (15th century), with the best preserved Inca buildings (1450-1532).

## Your browser does not support inline frames!