Tag: algorithms

Entries for tag "algorithms", ordered from most recent. Entry count: 65.

Uwaga! Informacje na tej stronie mają ponad 5 lat. Nadal je udostępniam, ale prawdopodobnie nie odzwierciedlają one mojej aktualnej wiedzy ani przekonań.

Pages: > 1 2 3 4 5 ... 9 >

# Processing File Paths in C++

Tue
12
Jan 2010

Paths to files and directories are special kind of strings. Sometimes they have to be processed, e.g. merged or splitted into some parts like drive letter, directory, file name and extension. Standard C library provides two function for this: makepath and splitpath. They use char* strings and operate always on all possible parts (drive, dir, fname, ext). In my opinion they are not very convenient. On the other hand, modern programming languages have more functionality in this field. For example, .NET has System.IO.Path static class with methods like GetExtension, GetFileName, GetDirectoryName.

To make operations on file paths convenient in C++, I've developed some time ago my own functions for that which operate on std::string strings. I've included them in my CommonLib library. They are similar a bit to these from Delphi :) Here is how I approach this topic: First, a common task is to extract part of a path (e.g. "C:\Windows\notepad.exe"). My functions for this are:

I also have some functions for processing paths:

Path can be absolute or relative. In Windows, absolute paths start with drive letter like "C:\" (which we can recognized by looking for colon) or network share (paths that begin with "\\"). Relative paths (where most basic example is a single file name) make sense only in some context. When we use relative paths to open real files, we address them relative to the "Current Directory" (which is a separate topic, I will write about it another time). To deal with absolute and relative paths, I've coded functions such as:

NormalizePath function preprocessess all the . and .. drectories in the path. A single dot used as directory name means "my own directory" (just does nothing) and two dots mean going up one level to the parent directory. So the path we consider here is logically equal to "C:\Windows\.\system32\..\notepad.exe". Paths like this still work, but they can look a bit unclear, so you can use NormalizePath function to clean them.

Paths in Linux are quite different. My functions also support them, although I didn't test this case in real situations yet. First of all, there are no drive letters in Linux, so absolute paths just start with '/', like "/etc/passwd". Second, a slash sign is used as a separator instead of backslash (it's interesting that slash also works in Windows). It's also more common in Linux than in Windows to meet files that don't have extension or have double extension (like "MyFile.tar.gz"). A question arises here whether what we call "extension" starts from the first or from the second dot :)

Comments | #algorithms #c++ #commonlib Share

# Writing Parsers is Easy

Sun
27
Dec 2009

Using binary file formats is easier as data can be read and written almost directly as they are in memory, but sometimes we have to use some text formats. Whenever it is neither a standard format like XML or YAML, nor it is so simple we can load it with std::cin or scanf, we have to write a parser to read it. Writing a parser - code that reads and interprets some text format (e.g. file format or network protocol) looks like very complex task, but here I want to convince you that's not true - it is actually quite easy.

Why do I think so? It may look like it is a very long way from reading characters from a file till interpreting them as data records containing numbers, strings and other meaningful information. My answer for that is to split this problem into three layers.

The lowest layer is called CharReader. All it can do is reading single characters. It provides a layer of abstraction over reading data from a real source so you can forget whether you read from a file (hopefully with some buffering - never do file reads byte-after-byte!) or just iterate over a string fully loaded into memory. You can also forger about encoding so you see you data as characters, not bytes. It's a good idea to keep last read character somewhere so it can be "peeked" (read without losing it) many times before going to next one. The interface of a simple CharReader may look like this:

class CharReader {
public:
  CharReader(const std::string &filePath);
  // Returns last read character as out.
  // If cursor at end, returns false.
  bool Peek(char &out);
  // Moves cursor one character forward.
  void Next();
};

Next layer is called Tokenizer. We need it because it's still complicated to interpret characters directly as some structured information. Data in almost every text format is made of multiple-character pieces of information like numbers, strings etc. so it's convenient to provide another layer of abstraction to see data as a sequence of tokens, not single characters. Here are some typical kinds of tokens that can be found also in programming languages:

Tokenizer uses CharReader to read chars and interprets these chars as tokens. Just like CharReader, it buffers and provides access to single token that was parsed recently so you can read it many times before anvancing to next one. It's also Tokenizer's responsibility to skip comments and whitespaces between tokens. Example interface may look like this:

class Tokenizer {
public:
  Tokenizer(CharReader &cr);
  
  enum TOKEN_TYPE {
    TOKEN_END,
    TOKEN_IDENTIFIER,
    TOKEN_STRING,
    TOKEN_SYMBOL,
  };
  // Returns type of the last read token.
  TOKEN_TYPE GetType();
  // Returns content of the last read token.
  const std::string & GetContent();
  // Moves cursor one token forward.
  void Next();
};

And finally, when being able to see file as made of tokens, it's easy to code the third layer - Parser - on top of that. Parser reads subsequent tokens and interprets them as some meaningful data. For example, you can parse a Key="Value" pair by expecting token of type identifier and saving its content as Key, then expecting the (=) symbol and finally expecting a string token and saving its content as Value. Of course you also have to find a way to report errors on all these stages, including information about current row and column.

Comments | #algorithms Share

# Random Choice with Given Probabilities

Fri
11
Dec 2009

When we want to randomly choose one option from some possible items, we usually do something as simple as (rand() % numberOfItems). But what if we want to vary probabilities of these items? Here is an algorithm for this:

As input data, we have number of items and an array of their probabilities. Each probability is a nonnegative real number. I like to allow these numbers to be as big as I want, knowing that an item with probability 2 (or 200) is just two times more probable than the item with probability 1 (or 100). I will normalize them later.

// == Preprocessing step ==

// Input:
std::vector<float> ItemProbabilities;
// Output:
std::vector<float> ProcessedProbabilities;

// Algorithm:

size_t count = ItemProbabilities.size();
assert(count > 0);

// First pass: summation
float sum = 0.f;
for (size_t i = count; i--; ) // count-1, count-2, ..., 0
{
  float p = ItemProbabilities[i];
  assert(p >= 0.f);
  // Optional: Delete items with zero probability
  if (p == 0.f)
  {
    Items.RemoveByIndex(i);
    // That's how we delete an element from std::vector by index :P
    ItemProbabilities.erase(ItemProbabilities.begin()+i);
  }
  sum += p;
}
assert(sum > 0.f);
float sumInv = 1.f / sum;

// Second pass: building preprocessed array
ProcessedProbabilities.resize(count);
ProcessedProbabilities[0] = ItemProbabilities[0] * sumInv;
for (size_t i = 1; i < count; i++)
  ProcessedProbabilities[i] =
    ProcessedProbabilities[i-1] + ItemProbabilities[i] * sumInv;

The algorithm above does two things: It prepares ProcessedProbabilities array so that each probability is the item's probability plus the sum of probabilities of all previous items. It also normalizes them to range 0..1. As the result, the output array forms an ascending sequence, where the difference to the previous element is proportional to the item's probability in respect to full 0..1 range. For example:

Input: 100, 100, 200
Temporary data: count=3, sum=400, sumInv=0.0025
Output: 0.25, 0.5, 1

Now as we have these preprocessed data, generating random choices with certain probabilities is fast and simple. We just generate a random real number in range 0..1 and choose the first item for which ProcessedProbabilities[i] is greater than this. Additional optimization would be to use binary search here.

// == Random choice ==

// Input:
std::vector<float> ProcessedProbabilities;
// Output:
size_t choice;

// Algorithm:
choice = 0;
float randNum = RandFloat(); // 0..1
while (choice < ProcessedProbabilities.size()-1
  && ProcessedProbabilities[choice] < randNum)
{
  choice++;
}

Comments | #algorithms #math Share

# Calculating Linear and Quadratic Equation Coefficients

Sun
30
Aug 2009

Some time ago I've written about formulas to calculate coefficients of linear and quadratic equation [pl] when having 2 or 3 given points (x,y). Yesterday I suddenly noticed that my code needs to do it every frame so I need functions to calculate these coefficients. Below you can find the code of my functions. They are templates, so they work with many types including float, double, as well as vectors of any dimmension and other types which have addition, subtraction and assignment operators, as well as multiplication and divistion by float scalar.

Read full entry > | Comments | #rendering #math #algorithms Share

# Source Code and Ray Tracer

Wed
19
Aug 2009

As some of you are interested in seeing source code of my 2D Software Renderer, I'll fullfill this request, although I don't think there is anything special about this code. Here is the file: SoftRender01.zip and here is a brief description of this project:

BTW, I've started coding a raytracer. Of course the first primitive I support is a sphere and first effects I've coded are shadows and reflections. Ray tracing is not magic. It's just like "foreach pixel, foreach object, render", while rasterization is the opposite: "foreach object, foreach pixel, render". But It's quite fascinating that raytracer suppors directly some of effects that require special preprocessing in standard rasterization (like preparing shadow map and enviromental map in case of my effects).

Raytracer

Comments | #productions #rendering #algorithms Share

# Three Ways to Calculate Mean and Variance

Mon
20
Jul 2009

There is a free e-book on DSP (Digital Signal Processing) called The Scientist and Engineer's Guide to Digital Signal Processing (by Steven W. Smith, Ph.D.). As I have a holiday now, I've started reading it and I've found some interesting information even in introductory chapters. For example, there are three algorithms to calculate mean and variance. Let's say we have some data in a vector of bytes and we want to calculate its statistics.

std::vector<unsigned char> Bytes;
// Fill Bytes...
uint N = Bytes.size();

Traditional algorithm looks like this:

float Mean = 0.f, Variance = 0.f;
for (uint i = 0; i < N; i++)
  Mean += (float)Bytes[i];
Mean /= (float)N;
for (uint i = 0; i < N; i++)
  Variance += ((float)Bytes[i] - Mean) * ((float)Bytes[i] - Mean);
Variance /= (float)(N-1);

Now the incremental one, which can return current mean and variance at any time while you can also post next piece of data. All we need to implement it is to keep track of current sample number, sum of samples and sum of squared samples. I've created template class for that, which can be parametrized by float, double, int or other numeric types.

template <typename T>
class IncrementalMeanAndVarianceCalc
{
public:
  IncrementalMeanAndVarianceCalc() : m_Sum(T()), m_SqSum(T()), m_Count(0) { }
  void Clear() { m_Sum = m_SqSum = T(); m_Count = 0; }
  
  void Add(const T &v)
  {
    m_Sum += v;
    m_SqSum += v*v;
    m_Count++;
  }
  void Add(const T *values, uint valCount)
  {
    for (uint i = 0; i < valCount; i++)
    {
      m_Sum += values[i];
      m_SqSum += values[i]*values[i];
    }
    m_Count += valCount;
  }
  
  bool IsEmpty() { return m_Count == 0; }
  uint GetCount() { return m_Count; }
  T GetSum() { return m_Sum; }
  T GetSqSum() { return m_SqSum; }
  
  T GetMean()
  {
    assert(!IsEmpty());
    return m_Sum / m_Count;
  }
  T GetVariance(bool varianceBiased = true)
  {
    if (varianceBiased)
      return (m_SqSum - m_Sum*m_Sum/m_Count) / (m_Count-1);
    else
      return (m_SqSum - m_Sum*m_Sum/m_Count) / m_Count;
  }
  void GetMeanAndVariance(T &outMean, T &outVariance, bool varianceBiased = true)
  {
    assert(!IsEmpty());
    outMean = m_Sum / m_Count;
    if (varianceBiased)
      outVariance = (m_SqSum - m_Sum*m_Sum/m_Count) / (m_Count - 1);
    else
      outVariance = (m_SqSum - m_Sum*m_Sum/m_Count) / m_Count;
  }

private:
  T m_Sum, m_SqSum;
  uint m_Count;
};

Finally, there is an algorithm to calculate mean and variance from histogram. It is very efficient method. If we have only 256 possible values for each sample, we don't have to do binning and calculating histogram is very simple:

uint Histogram[256] = { 0 };
for (uint i = 0; i < N; i++)
  Histogram[Bytes[i]]++;

Now, the mean and variance can be calculated this way:

float Mean = 0.f, Variance = 0.f;
for (uint i = 0; i < 256; i++)
  Mean += (float)i * Histogram[i];
Mean /= (float)N;
for (uint i = 0; i < 256; i++)
  Variance += (i - Mean)*(i - Mean) * Histogram[i];
Variance /= N - 1;

Comments | #algorithms #math Share

# What is GUID and how to use it

Thu
16
Jul 2009

There are many ways to identify objects in a collection. One of them is to keep direct pointers to objects. Another way is to simply index their positions in an array. We can make user see only some strange indirect "handles" without any obvious meaning, like HWND or FILE*. We can also give objects string names or some numeric identifiers. SQL databases often use integer number to identify record in a table and new identifiers are generated as subsequent numbers. Another solution is to use GUIDs.

GUID means Globally Unique Identifier and it is a kind of general purpose, 128-bit numeric identifier. GUIDs are generated from some random data. We can be sure about their uniqueness in any context because of their size, as 2^128 possible values is more than stars in the whole observable universe :) That's why they are called "globally unique".

We could just generate 128 bits of data from any good pseudorandom number generator and call it good identifiers, but to standardize this idea, they have created standard - RFC4122 (it's actually UUID, but let's ignore the difference). This standard says how to express GUIDs as strings, made of hexadecimal numbers separated by dashes (for example "f81d4fae-7dec-11d0-a765-00a0c91e6bf6"). You can find such GUIDs everywhere. For example, run "regedit.exe" and look into key HKEY_CLASSES_ROOT\CLSID or run your Visual C++ and run command Tools / Create GUID from the main menu.

The standard also defines bit structure of the UUID and algorithms to generate it. There are several possible algorithms. Some of them use system time or even network card MAC address. The simplest one is version 4 algorithm, which uses only random number generator. Some of bits have special meaning and code algorithm which have been used to generate particular GUID.

Here is my code in C++ defining GUID structure and the algorithm to generate it.

Read full entry > | Comments | #c++ #algorithms Share

# Efficient Finding of Duplicated Files

Tue
14
Jul 2009

I've recently wrote a tool for finding duplicated files in a given directory. I wanted to do it efficiently and here is my algoritm.

My basic idea is to first recursively enumerate all files in the directory and its subdirectories and sort their descriptions by file size. Making it this way, we can then iterate through this list and for each file look ahead to see how many other files have same size. If two files have different sizes, they are obviously not equal in terms of their content. So if there is only one file with particular size, it can just be skipped. If there are two files with same size, we must just compare them by content.

If there are many files with same size, things become more complicated, as any possible combinations of them can turn out to be identical. My solution is to compare two files by a "header" (lets say - first 1024 bytes) and if they are equal - by MD5 checksum of their whole content. Checksum of each file is lazy evaluated, so it's calculated only once, the first time its needed.

As I iterate through the sorted file list, I mark reported files not to process them again. I can do it because I ensure that if a file is reported, all other identical files are also already reported. In my real code I do the same with files I encounter errors with, but here I wanted to simplify the code.

Read full entry > | Comments | #algorithms #winapi Share

Pages: > 1 2 3 4 5 ... 9 >

STAT NO AD
[Stat] [STAT NO AD] [Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2018